January 2026

Even GPT-5.2 Can’t Count to Five

In this post, we discuss how state-of-the-art large language models still make mistakes on extremely simple problems based on Even GPT-5.2 Can’t Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs. For concrete examples: if you ask whether the number of 1s in 11000 is even or odd, gpt-5.2-2025-12-11 answers “odd”. If you […]

Even GPT-5.2 Can’t Count to Five Read Post »

One Training Example Is All You Need for Reasoning

To improve reasoning ability, it may be enough to use only one training example in the post-training of an LLM. In this post, I explain a study on reinforcement learning that uses just a single training example, “Reinforcement Learning for Reasoning in Large Language Models with One Training Example” [Wang+ NeurIPS 2025]. Intuitively, the main

One Training Example Is All You Need for Reasoning Read Post »

How LLMs Really Do Arithmetic

LLMs can answer prompts like “226-68=” by outputting “158”, but it turns out that this computation is carried out in a much stranger way than we might imagine, as shown by [Nikankin+ ICLR 2025]. Let us first confirm the assumptions. We do not use chain-of-thought. We consider the setting where the model directly outputs an

How LLMs Really Do Arithmetic Read Post »