AI

Speculative Decoding Explained

“Generating…”We are all familiar with the cursor blinking, slowly revealing text word by word. Whether it’s ChatGPT or a local LLM, the generation speed of autoregressive models based on Transformers is fundamentally bound by their sequential nature. It computes, emits one token, computes again, emits the next token, and repeats. This strict dependence chain introduces […]

Speculative Decoding Explained Read Post »

Even GPT-5.2 Can’t Count to Five

In this post, we discuss how state-of-the-art large language models still make mistakes on extremely simple problems based on Even GPT-5.2 Can’t Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs. For concrete examples: if you ask whether the number of 1s in 11000 is even or odd, gpt-5.2-2025-12-11 answers “odd”. If you

Even GPT-5.2 Can’t Count to Five Read Post »

How LLMs Really Do Arithmetic

LLMs can answer prompts like “226-68=” by outputting “158”, but it turns out that this computation is carried out in a much stranger way than we might imagine, as shown by [Nikankin+ ICLR 2025]. Let us first confirm the assumptions. We do not use chain-of-thought. We consider the setting where the model directly outputs an

How LLMs Really Do Arithmetic Read Post »

Fisher Information Explained: Python and Visual Illustrations

Definition of Fisher Information The Fisher information is defined as $$\mathrm{FisherInformation}(\theta_0)\stackrel{\text{def}}{=}-\mathbb{E}_{X\sim p(x\mid\theta_0)}\left[\frac{d^2}{d\theta^2}\log p(x\mid\theta)\bigg|_{\theta=\theta_0}\right].$$ Fisher information quantifies how precisely a model parameter can be estimated.A larger Fisher information means the parameter can be estimated more accurately,while a smaller Fisher information indicates that estimation is more difficult. Fisher information admits several equivalent interpretations. Equivalent Expressions $$\begin{align}&\mathrm{FisherInformation}(\theta_0) \\&\stackrel{\text{def}}{=}-\mathbb{E}_{X

Fisher Information Explained: Python and Visual Illustrations Read Post »

Attention in LLMs and Extrapolation

It is now understood that the attention mechanism in large language models (LLMs) serves multiple functions. By analyzing attention, we gain insight into why LLMs succeed at in-context learning and chain-of-thought—and, consequently, why LLMs sometimes succeed at extrapolation. In this article, we aim to unpack this question by observing various types of attention mechanisms. Basic

Attention in LLMs and Extrapolation Read Post »

Interestingness First Classifiers

Most existing machine learning models aim to maximize predictive accuracy, but in this article, I will introduce classifiers that prioritize interestingness. What Does It Mean to Prioritize Interestingness? For example, let us consider the task of classifying whether a user is an adult based on their profile.If the profile contains an age feature, then the

Interestingness First Classifiers Read Post »

Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem

In the field of Natural Language Processing (NLP), a central theme has always been “how to make computers understand the meaning of words.” One fundamental technique for this is “Word Embedding.” This technique converts words into numerical vectors (lists of numbers), with methods like Word2Vec and GloVe being well-known. Using these vectors allows for calculations

Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem Read Post »

Scroll to Top