Sorting with LLMs

Sorting is a classic task in computer science, but it has recently intersected with state-of-the-art LLMs and sparked a new research trend. Sorting can be executed as long as you define a comparison function. Traditional comparison functions assumed measurable numeric quantities such as height, price, or distance. However, if we call an LLM inside the […]

Sorting with LLMs Read Post »

Speculative Decoding Explained

“Generating…”We are all familiar with the cursor blinking, slowly revealing text word by word. Whether it’s ChatGPT or a local LLM, the generation speed of autoregressive models based on Transformers is fundamentally bound by their sequential nature. It computes, emits one token, computes again, emits the next token, and repeats. This strict dependence chain introduces

Speculative Decoding Explained Read Post »

Even GPT-5.2 Can’t Count to Five

In this post, we discuss how state-of-the-art large language models still make mistakes on extremely simple problems based on Even GPT-5.2 Can’t Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs. For concrete examples: if you ask whether the number of 1s in 11000 is even or odd, gpt-5.2-2025-12-11 answers “odd”. If you

Even GPT-5.2 Can’t Count to Five Read Post »

How LLMs Really Do Arithmetic

LLMs can answer prompts like “226-68=” by outputting “158”, but it turns out that this computation is carried out in a much stranger way than we might imagine, as shown by [Nikankin+ ICLR 2025]. Let us first confirm the assumptions. We do not use chain-of-thought. We consider the setting where the model directly outputs an

How LLMs Really Do Arithmetic Read Post »

TwoNN Intrinsic Dimension Explained: Python and Visual Illustrations

Real-world data often live in a high-dimensional ambient space, but the points themselves concentrate near a much lower-dimensional manifold. The “visible” (ambient) dimension is easy to read off from the feature vector length, while the “intrinsic” dimension (the effective degrees of freedom) is much harder to estimate. Two Nearest Neighbors (TwoNN) is a simple yet

TwoNN Intrinsic Dimension Explained: Python and Visual Illustrations Read Post »

Fisher Information Explained: Python and Visual Illustrations

Definition of Fisher Information The Fisher information is defined as $$\mathrm{FisherInformation}(\theta_0)\stackrel{\text{def}}{=}-\mathbb{E}_{X\sim p(x\mid\theta_0)}\left[\frac{d^2}{d\theta^2}\log p(x\mid\theta)\bigg|_{\theta=\theta_0}\right].$$ Fisher information quantifies how precisely a model parameter can be estimated.A larger Fisher information means the parameter can be estimated more accurately,while a smaller Fisher information indicates that estimation is more difficult. Fisher information admits several equivalent interpretations. Equivalent Expressions $$\begin{align}&\mathrm{FisherInformation}(\theta_0) \\&\stackrel{\text{def}}{=}-\mathbb{E}_{X

Fisher Information Explained: Python and Visual Illustrations Read Post »

Attention in LLMs and Extrapolation

It is now understood that the attention mechanism in large language models (LLMs) serves multiple functions. By analyzing attention, we gain insight into why LLMs succeed at in-context learning and chain-of-thought—and, consequently, why LLMs sometimes succeed at extrapolation. In this article, we aim to unpack this question by observing various types of attention mechanisms. Basic

Attention in LLMs and Extrapolation Read Post »

Interestingness First Classifiers

Most existing machine learning models aim to maximize predictive accuracy, but in this article, I will introduce classifiers that prioritize interestingness. What Does It Mean to Prioritize Interestingness? For example, let us consider the task of classifying whether a user is an adult based on their profile.If the profile contains an age feature, then the

Interestingness First Classifiers Read Post »

Scroll to Top