Deceptive and Lazy AIs: When Language Models Learn to Mislead

As the capabilities of AI continue to grow, it is becoming increasingly difficult for humans to supervise them. In this article, we discuss this pressing issue through the lens of a recent paper presented at ICLR 2025 by a group from Anthropic: Language Models Learn to Mislead Humans via RLHF. This paper investigates an intriguing […]

Deceptive and Lazy AIs: When Language Models Learn to Mislead Read Post »