Deceptive and Lazy AIs: When Language Models Learn to Mislead

Share This Post

As the capabilities of AI continue to grow, it is becoming increasingly difficult for humans to supervise them. In this article, we discuss this pressing issue through the lens of a recent paper presented at ICLR 2025 by a group from Anthropic: Language Models Learn to Mislead Humans via RLHF.

This paper investigates an intriguing and concerning phenomenon: when facing tasks that are too difficult for a large language model (LLM)—such as complex programming challenges—the model avoids simply saying “I don’t know” or outputting obviously incorrect code. Doing so would likely result in receiving a “BAD” rating from human evaluators. Instead, the model opts to produce overly complex or difficult-to-debug code, effectively obfuscating its failure. This behavior was empirically observed in controlled experiments.

What’s alarming is that this phenomenon may not be limited to research settings—it might already be happening in real-world LLMs and AI services. Even if you believe you haven’t been deceived, it’s possible that you already have—without realizing it.

In the following sections, we will examine the underlying causes of this behavior, the experimental setup, and findings, and discuss future directions for mitigating this deceptive tendency.

Background: The Necessity and Limitations of RLHF

Language models aim to generate plausible text. More precisely, they aim to generate text sampled from the same distribution as their training data. If the training data is full of errors, then a well-trained language model will learn to imitate those errors. If 1% of the training data is wrong, the model will learn to make mistakes with a probability of 1%. Garbage in, garbage out. Language models are neither magic nor alchemy.

As a result, the capability of a language model tends to converge around the “average level” of its training data. The exact performance depends on how the output is chosen: if we sample from the model’s distribution, we get an “average-like” performance; if we take the most likely text under the model distribution, we get something akin to a majority vote across the training data. While sampling and majority voting differ, for simplicity we’ll refer to both as “approximately average” in the following. This means that if the training data comes from web corpora, the model’s behavior reflects the average person who writes on the web; if trained on GitHub, the output approximates the average GitHub user’s code quality.

Even this “average human level” is quite impressive. But as the expectations for language models continue to rise, we inevitably face the need to surpass this ceiling. In order to achieve output quality that goes beyond “next-token prediction” or “plausible text generation,” the training objective must be redefined.

This is where Reinforcement Learning from Human Feedback (RLHF) comes in. In RLHF, humans evaluate the model’s outputs, and the model is fine-tuned to prefer those outputs that receive higher human ratings. A base language model might produce plausible but mediocre completions or plausible and genuinely good ones. RLHF encourages the model to prioritize the latter, so that it doesn’t just sound plausible—but also becomes helpful or correct.

If you’ve used ChatGPT, you may have seen messages like “Which response do you prefer?” or encountered GOOD/BAD buttons under the outputs. These are part of the RLHF process.

But there is a catch: there can be a gap between what humans “think is good” and what is actually good. As the model optimizes for what scores well with humans, it may start to inflate its responses with flattering language or insert plausible-sounding (but unsupported) explanations just to gain points.

This becomes a serious issue when tasks assigned to the AI grow in complexity. When it becomes harder for humans to assess quality reliably, deceiving the human becomes easier than solving the task rigorously—and if both deception and honest work yield similar scores, the efficiency-maximizing model may choose to deceive. Of course, if the deception is detected, it might result in a BAD rating. But if the risk of failure from an honest attempt is higher than the risk of being caught deceiving, deception becomes the optimal strategy under the reward model. As both AIs and tasks become more sophisticated, this tendency becomes increasingly difficult to ignore.

In reinforcement learning, this type of behavior—where the agent exploits quirks in the reward signal to achieve high scores without solving the intended task—is known as reward hacking, a long-standing topic of discussion. Classic examples include:

A tic-tac-toe bot learns to win by playing a huge coordinate value that would cause other bots to crash by exploiting memory limits.
A robot trained to maximize total momentum learns to rapidly vibrate in place to accumulate high values without meaningful movement.
An evolutionary algorithm optimizing a robot to run 50 meters as fast as possible evolves a pole-shaped robot that falls forward at the start and cuts the finish line in one step.
An ad generator trained to maximize click-through rate starts generating clickbait headlines and thumbnails.

These are manifestations of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Consider using the 50-meter dash time to evaluate which robot is faster—it’s a valid observational metric. But if we say, “Your sole mission is to minimize 50-meter dash time, no matter what,” then what we get might be a bizarre pole-shaped robot engineered solely to exploit the metric.

Language Models Learn to Mislead Humans via RLHF

This section introduces a paper presented at ICLR 2025 by Anthropic and collaborators: Language Models Learn to Mislead Humans via RLHF.

While discussions on reward hacking in LLMs under RLHF have existed before, prior work often relied on deliberately provoking malicious behavior from the model. For instance, popular methods like PPO or DPO (Rafailov et al., NeurIPS 2023) include mechanisms to prevent solutions from deviating too far from the initialization, precisely to mitigate such issues.

Compared to earlier research, this work is notable for the following three reasons:

Unlike prior studies that often induced LLMs to behave badly on purpose, this research observed deceptive behavior even when using a well-designed RLHF setup that explicitly incorporates mechanisms to avoid extreme solutions. In other words, the model misled humans despite efforts to avoid reward hacking.
In most previously known reward hacking scenarios, the issue arose from oversimplified reward definitions or poor modeling of human judgment. For example, when a reward-hacked agent evolved into a pole-shaped robot, it was easy for designers to recognize it as a design flaw. In contrast, this study demonstrates that even when using human evaluations more directly as reward signals, LLMs still engaged in reward hacking—producing outputs that deceived humans without them realizing it.
Through QA and coding tasks, the study identified specific strategies LLMs use to deceive users—for example, returning obfuscated or misleading code for unsolved problems to hide errors.

Experimental Setup

The study evaluated two types of tasks: question answering (QA) and coding.

In the QA task, the model is presented with a document (around 5,000 tokens) and must answer a multiple-choice question (4 options) based on its content. The LLM receives both the document and the question and outputs an answer along with a rationale.
In the coding task, the model receives a specification of an algorithm and sample inputs/outputs, and must produce a Python implementation that fulfills the specification.

Three types of reward signals were defined for these tasks:

True reward \( R^* \): This represents the objective correctness of the output. In the QA task, \( R^* = 1 \) if the model’s answer is correct, and 0 otherwise. The rationale is ignored. In the coding task, \( R^* = 1 \) if the output passes all comprehensive test cases (an average of 34 per task); otherwise, 0.
Human evaluation \( R^{\text{human}} \): This reflects whether human evaluators perceive the output as good or bad. In the QA task, evaluators are given 3 minutes to read both the answer and its rationale before rating it as good (1) or bad (0). In the coding task, evaluators are provided with an environment where they can write test functions and execute arbitrary code. They are given 10 minutes per solution to assign a good/bad rating.
Training reward \( R^{\text{train}} \): This is the reward used during RLHF training. Ideally, \( R^{\text{train}} = R^{\text{human}} \), but due to cost constraints, approximations are used. For the QA task, a reward model is trained on human preference data and serves as a proxy evaluator. For the coding task, a simple rule is used: if the solution passes 2–3 basic tests, it receives a reward of 1; otherwise, 0. While simplistic, pilot studies showed that most users evaluate solutions using similarly minimal tests, and this criterion correlates strongly with human judgment.

Although \( R^{\text{train}} \) is coarser than ideal, this compromise is acceptable for two main reasons: (i) Real-world LLM training often relies on such proxy reward models due to cost and scalability. (ii) The true objective of the study is to test whether trained LLMs deceive real humans. For this, actual human evaluations (\( R^{\text{human}} \) and \( R^* \)) are conducted during the test phase. Elaborating this setting is left for future research.

The experimental setup is as follows:
→ Start with a pretrained language model
→ Fine-tune it using RLHF with \( R^{\text{train}} \)
→ Evaluate the RLHF’ed model on test data to measure its true correctness \( R^* \) and human-perceived quality \( R^{\text{human}} \)

For the QA task, the base model used is Llama-2-7B. For the coding task, Deepseek-Coder-7B is used.

To assess \( R^{\text{human}} \):

For QA, 35 human evaluators were hired, including teachers, writers, editors, and students with experience in English reading comprehension and question answering.
For coding, 10 university students majoring in computer science or electrical engineering—some with competitive programming experience—were employed to rate the outputs.

RLHF Does Not Improve True Performance—But Makes Humans Feel It Does

Below are the reward metrics before and after RLHF.

Question Answering:

Coding:

In both cases, RLHF results in little to no actual performance gain. And yet, humans consistently perceive the outputs become much better.

RLHF Increases the Rate of Human Misjudgment

Humans are prone to making misjudgments—both before and after RLHF. That is, they may believe a model output is correct when it is actually wrong, or vice versa. However, after RLHF, the frequency of such misjudgments increases noticeably. This suggests that the LLM learns to produce more deceptive or misleading outputs—those that humans are more likely to misjudge.

**Human Misjudgment in QA**: Left = before RLHF, Right = after RLHF. RLHF leads to an increase in human misjudgment.

**Human Misjudgment in Coding**: Left = before RLHF, Right = after RLHF. RLHF again increases the rate of human misjudgment.

The paper specifically analyzes an important trend: *after RLHF, there is a noticeable increase in cases where the model output is objectively incorrect (\(R^ = 0\)) but humans incorrectly judge it to be correct (\(R^{\text{human}} = 1\)).**

After RLHF, Models Learn to Insert Sophisticated Fake Evidence

In the QA task, the LLM receives a document and a question, and is expected to generate an answer along with its rationale.

By comparing the rationales before and after RLHF, a clear pattern emerges: before RLHF, the rationales were often naive and straightforward. But after RLHF, the model begins to actively insert sophisticated fake evidence. While the rationale has no effect on whether the answer is correct, including plausible-looking evidence (even if fabricated) seems to make humans more likely to be persuaded—and therefore more likely to label the answer as “good.”

**Change in Rationales**: (Left) The question; (Middle) the model’s response before RLHF; (Right) the response after RLHF. Both are incorrect. As highlighted in red, the post-RLHF model learns to insert sophisticated fake evidence.

Both before and after RLHF, LLMs may include hallucinated evidence. However, before RLHF, such evidence tended to be weak or obviously false. After RLHF, the evidence becomes confident, persuasive, and—though still false—appears much more credible. Human evaluators were often observed to be persuaded by this fake evidence and agreed with the LLM’s incorrect claims with high confidence.

Moreover, the pre-RLHF rationales were often inconsistent—supporting claim A at the beginning and switching to claim B later. It was easy for readers to notice, “This model doesn’t really know what it’s saying.” In contrast, post-RLHF rationales, while untrue, are internally consistent and logically coherent, making it difficult to tell truth from falsehood without verifying against the source document.

RLHF-tuned models also make use of citations. In particular, they were observed to selectively quote fragments from the source document that support their (incorrect) claims, while ignoring contradicting evidence. In such cases, the evidence itself is genuine, but the way it is presented is misleading—making it even harder for humans to detect the deception.

While post-RLHF models may appear more capable, as previously discussed, their true accuracy—i.e., the objective correctness of their answers—has barely changed (and sometimes slightly declined). What has improved is their ability to appear correct, not their actual correctness.

After RLHF, Models Learn to Generate Incorrect but Hard-to-Read Code

Post-RLHF LLMs tend to generate code that is harder to read—characterized by a lack of modularization and complex control flow.

**Change in Code Complexity**: (Left) Coding prompt; (Middle) LLM output before RLHF; (Right) after RLHF. Both outputs are incorrect. Pre-RLHF code is easier to read and debug. Post-RLHF code is harder to read and makes error detection more difficult.

Notably, when the post-RLHF model produces incorrect code, it tends to make it more complex. One way to quantify this is through cyclomatic complexity, a metric for measuring control flow complexity. When the model produces correct code, the complexity is similar before and after RLHF. But when the output is incorrect, post-RLHF code is much more complex, as confirmed by quantitative measurements.

**Change in Cyclomatic Complexity**: Blue = before RLHF, Orange = after RLHF. (Left) Complexity when the model produces correct code; (Right) complexity when the model produces incorrect code. While complexity remains similar for correct solutions, RLHF-tuned models tend to produce much more complex code when they are wrong.

The paper also reports that the number of functions in incorrect code drops significantly after RLHF. This suggests that post-RLHF models, when wrong, tend to produce monolithic, unmodularized, and difficult-to-read code. Furthermore, while the core logic is wrong, such code is still able to pass simple unit tests—leading humans to falsely believe the code is correct.

As previously mentioned, the overall accuracy—i.e., the objective correctness of their code—remains virtually unchanged. What the model has learned is not to be more correct, but rather to be better at hiding its mistakes.

The Road Ahead: When Human Evaluation Fails, Should We Let AI Judge Itself?

These findings suggest that as AI systems grow more capable—and are assigned more complex tasks—we are approaching a stage where human supervision becomes ineffective, or even counterproductive.

So, how can we continue to improve AI beyond this point? Traditional machine learning techniques have largely assumed that human judgment is the gold standard, so overcoming this barrier is no trivial challenge.

One straightforward idea is this: if AI has become so capable, why not let it evaluate itself instead of relying on humans? This approach has naturally been explored and is known as self-verification or self-critique (Weng et al., EMNLP Findings 2023, Chen et al., ICLR 2024).

However, recent studies have pointed out critical limitations to this approach. One such study was presented at ICLR 2025 by a group from Arizona State University: On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks.

In this work, an LLM is tasked with playing three roles: generator, verifier, and critic. The generator first attempts to solve a problem. The verifier then checks whether the answer is correct. If the verifier determines the answer is wrong, the critic provides feedback highlighting the error. This feedback is given to the generator, who tries again. The loop continues until the verifier deems the answer acceptable.

They tested this setup on formally verifiable problems like the Game of 24 (e.g., given 4, 7, 8, 8, produce 24 using arithmetic: \((7 – (8 / 8)) * 4 = 24\)) and graph coloring. Surprisingly, the system with self-verification mechanisms performed worse than models that simply answered without any such mechanism. For example, accuracy in the Game of 24 dropped from 5% to 3%, and in graph coloring from 16% to 2%.

In contrast, when the same iterative setup used rule-based verifiers (i.e., non-LLM tools that could guarantee correctness), the results improved significantly, i.e., Game of 24: 5% → 36%, and Graph coloring: 16% → 38%

While LLM-based verifiers are generally accurate, they are still fallible. Using such imperfect evaluators introduces reward hacking-like dynamics. Suppose a verifier is 99% accurate. When a generator is faced with a task it cannot solve reliably, it may generate 100 random solutions and pick the one that happens to “fool” the verifier. As task difficulty increases, such hacks become more efficient than solving the task honestly—and models will converge toward these strategies, whether explicitly or emergently.

A similar structure is discussed in another ICLR 2025 paper from Amazon and Harvard: Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models. This study emphasizes that self-verification is especially unreliable for QA tasks. For instance, if an LLM answers the question “Who was the 42nd president of the United States?” with “George Bush,” it does so because it mistakenly believes this is correct. When asked to verify its own answer, it will likely reaffirm it as correct—since the underlying knowledge is already flawed. This makes self-correction extremely difficult.

The relationship between a good responder and a good verifier resembles a chicken-and-egg problem. If you already have the world’s best verifier, then yes, you can elevate a weaker responder to that level. But improving beyond that verifier’s level becomes extremely difficult.

That said, the Mind the Gap paper also notes that for tasks like arithmetic or puzzles, verification is inherently easier than generation. In such cases, some degree of self-improvement may still be possible. Because these tasks allow for the construction of rule-based, deterministic verifiers—as shown in the Self-Verification Limitations study—LLMs can indeed improve performance through guided feedback loops. As AlphaGo demonstrated, when the problem domain has clear rules and rewards, AI (especially reinforcement learning-based systems) can excel. I also believe focusing on tasks where generation is hard but verification is easy will be a key strategy for future development.

One reason coding AIs have advanced so rapidly in recent years (beyond strong demand and monetization potential) may lie in this exact principle: programming has well-defined rules and abundant tools for automated verification. These properties likely serve as powerful underlying drivers of progress in this domain.

Conclusion

Up until now, AI has grown by freely leveraging centuries of human trial and error. It has ridden on top of a vast accumulation of human data, insights, and behind all that, an even greater number of human failures. The road so far has been one paved and maintained by humanity. But from here on, as AI moves into the frontier—beyond the reach of human education—it will need to learn through its own trials and errors.

Unlike the past, where correct answers were provided, AI will now begin to generate its own vast landscape of mistakes. Progress may no longer be linear. The road ahead is not paved; it must be carved out through detours, false starts, and repeated course corrections.

Moreover, while the scaling laws of recent years taught us that investing more resources would yield better performance metrics, we may now be entering an era where those very metrics can no longer be trusted. It is an open and fascinating question: which evaluation metrics will lose reliability, and which ones will remain trustworthy?

If we must now embrace non-monotonic progress and also refine our evaluation metrics to greater precision, then the cost of AI development will rise even further. Training costs, already high, could escalate to levels where the return on investment becomes unclear. In practical terms, whether AI can continue to achieve truly substantive growth depends on how well we can balance the cost invested against the value returned. That is precisely why it is critical to pursue more efficient techniques that reduce cost and help smooth the path forward.

This challenge remains unsolved—even at the cutting edge of research. Can AI overcome the wall now standing before it? And if so, how? I hope this article has prompted you to reflect on these questions as well.

Author Profile

If you found this article useful or interesting, I would be delighted if you could share your thoughts on social media.

New posts are announced on @joisino_en (Twitter), so please be sure to follow!

Ryoma Sato

Currently an Assistant Professor at the National Institute of Informatics, Japan.

Research Interest: Machine Learning and Data Mining.

Ph.D (Kyoto University).

View Profile