Deceptive and Lazy AIs: When Language Models Learn to Mislead
As the capabilities of AI continue to grow, it is becoming increasingly difficult for humans to supervise them. In this article, we discuss this pressing issue through the lens of a recent paper presented at ICLR 2025 by a group from Anthropic: Language Models Learn to Mislead Humans via RLHF. This paper investigates an intriguing […]
Deceptive and Lazy AIs: When Language Models Learn to Mislead Read Post »