Fisher Information Explained: Python and Visual Illustrations

Share This Post

Definition of Fisher Information

The Fisher information is defined as

$$
\mathrm{FisherInformation}(\theta_0)
\stackrel{\text{def}}{=}
-\mathbb{E}_{X\sim p(x\mid\theta_0)}
\left[
\frac{d^2}{d\theta^2}\log p(x\mid\theta)\bigg|_{\theta=\theta_0}
\right].
$$

Fisher information quantifies how precisely a model parameter can be estimated.
A larger Fisher information means the parameter can be estimated more accurately,
while a smaller Fisher information indicates that estimation is more difficult.

When the Fisher information is large, the cross-entropy landscape forms a *sharp* valley, making the optimal value easier to pinpoint. When the Fisher information is small, the valley becomes *flat*, and the minimum becomes ambiguous.

Fisher information admits several equivalent interpretations.

Equivalent Expressions

$$
\begin{align}
&\mathrm{FisherInformation}(\theta_0) \\
&\stackrel{\text{def}}{=}
-\mathbb{E}_{X \sim p(x \mid \theta_0)}
\left[\frac{d^2}{d\theta^2} \log p(x \mid \theta)\bigg|_{\theta=\theta_0}\right] \\
&\stackrel{\text{(a)}}{\approx}
-\frac{1}{n} \sum_{i = 1}^n
\left[\frac{d^2}{d\theta^2} \log p(x_i \mid \theta)\bigg|_{\theta=\theta_0}\right]
\qquad (x_i \sim p(x \mid \theta_0)\ \text{i.i.d.}) \\
&=
-\frac{d^2}{d\theta^2}
\left(\frac{1}{n} \sum_{i = 1}^n \log p(x_i \mid \theta)\right)\bigg|_{\theta=\theta_0}
\tag{1} \\
&\stackrel{\text{(b)}}{\approx}
-\frac{d^2}{d\theta^2}
\mathbb{E}_{X\sim p(x\mid\theta_0)}[\log p(X\mid\theta)]
\bigg|_{\theta=\theta_0} \\
&=
\frac{d^2}{d\theta^2}
\mathrm{CrossEntropy}(p(x\mid\theta_0), p(x\mid\theta))\bigg|_{\theta=\theta_0} \\
&=
\frac{d^2}{d\theta^2}
\mathrm{KL}(p(x\mid\theta_0),|,p(x\mid\theta))
\bigg|_{\theta=\theta_0}.
\end{align}
$$

Here, (a) and (b) use Monte-Carlo approximation. Under suitable regularity conditions, the equalities hold exactly as $n\to\infty$.

Thus, Fisher information equals the second derivative (curvature) of the cross-entropy or the Kullback–Leibler divergence with respect to the parameter $\theta$.
In other words, Fisher information describes the curvature of these functions.
Consequently, when the Fisher information is large, even a small change in $\theta$ causes the distribution to change significantly, i.e., easy to distinguish.
Conversely, when the Fisher information is small, the distribution remains almost unchanged even if $\theta$ is perturbed, i.e., difficult to tell apart.

Taylor Expansion of KL Divergence

Let

$$
\mathrm{KL}(\theta)=\mathrm{KL}(p(x\mid\theta_0),|,p(x\mid\theta)).
$$

This function achieves its minimum value $0$ at $\theta=\theta_0$.
Therefore, by a Taylor expansion around $\theta_0$,

$$
\mathrm{KL}(\theta)
\approx
\frac{1}{2}
\mathrm{FisherInformation}(\theta_0)
(\theta-\theta_0)^2.
$$

Thus, Fisher information also appears as the coefficient of the quadratic approximation of the KL divergence.

Numerical Example

We analytically and numerically compute Fisher information for the normal distribution $\mathcal{N}(0,1)$ with mean $\mu=0$ and variance $\sigma^2=1$.

Analytical Calculation

$$
p(x\mid\mu)=\frac{1}{\sqrt{2\pi}}
\exp\left(-\frac{(x-\mu)^2}{2}\right)
$$

$$
\log p(x\mid\mu)
= -\frac12\log(2\pi)-\frac{(x-\mu)^2}{2}
$$

$$
\frac{d^2}{d\mu^2}\log p(x\mid\mu)
= -1
$$

$$\mathrm{FisherInformation}(\mu)=-\mathbb{E}[-1]=1.$$

Numerical Calculation

Using equation (1), we fit a quadratic curve to the average log-likelihood and use its second derivative as a numerical estimate of Fisher information.

import numpy as np
import matplotlib.pyplot as plt

# 1. Generate data from N(0,1)
rng = np.random.default_rng(0)
n = 2000
mu0 = 0.0
sigma = 1.0
x = rng.normal(loc=mu0, scale=sigma, size=n)

# 2. Grid of μ values and mean log-likelihood
mus = np.linspace(-2, 2, 81)

def log_pdf_normal(x, mu, sigma):
    return -0.5 * np.log(2 * np.pi * sigma**2) - 0.5 * (x - mu) ** 2 / (sigma**2)

avg_loglik = np.array([np.mean(log_pdf_normal(x, m, sigma)) for m in mus])

# 3. Quadratic fit
coeffs = np.polyfit(mus, avg_loglik, deg=2)
a, b, c = coeffs
fit_loglik = np.polyval(coeffs, mus)

# 4. Fisher information estimate
fisher_est = -2 * a
print("Estimated Fisher information:", fisher_est)
# => Estimated Fisher information: 1.0000000000000002

# 5. Plot
plt.figure()
plt.plot(mus, avg_loglik, lw=2,
         color="#005aff",
         label="Monte Carlo average log-likelihood")
plt.plot(mus, fit_loglik,
         color="#ff4b00", lw=2,
         linestyle="--",
         label="Quadratic fit")
plt.axvline(mu0, color="gray", linestyle=":", linewidth=1)
plt.xlabel(r"$\mu$", fontsize=16)
plt.ylabel("Average log-likelihood", fontsize=16)
plt.title("Fisher information via quadratic fit", fontsize=16)
plt.legend(fontsize=14)
plt.show()

Monte-Carlo estimate of the log-likelihood (blue) and the quadratic fit (red dashed).

Running the code confirms that the numerical estimate closely matches the analytical value $1$.

Mixture Model Example

Next, consider a more complex model:

$$p(x\mid\theta)=0.5,\mathcal{N}(\theta,1)+0.5,\mathcal{N}(-\theta,1),$$

and estimate the Fisher information at $\theta=2$.
This example is symmetric: the distribution is the same at $\theta=2$ and $\theta=-2$,
hence the log-likelihood becomes symmetric around $\theta=0$ and multimodal.

import numpy as np
import matplotlib.pyplot as plt

def log_normal_pdf(x, mean, var=1.0):
    return -0.5 * np.log(2 * np.pi * var) - 0.5 * (x - mean) ** 2 / var

def log_mixture_pdf(x, theta):
    a = log_normal_pdf(x, theta)
    b = log_normal_pdf(x, -theta)
    m = np.maximum(a, b)
    return m + np.log(0.5 * np.exp(a - m) + 0.5 * np.exp(b - m))

def sample_mixture(theta, n, rng):
    comps = rng.integers(0, 2, size=n) * 2 - 1  # -1 or +1
    means = comps * theta
    return rng.normal(loc=means, scale=1.0)

rng = np.random.default_rng(1)
n = 200_000
theta0 = 2.0

# Sample data
x = sample_mixture(theta0, n, rng)

# Grid near θ0
thetas_fit = np.linspace(0.5, 3.5, 81)

avg_loglik = np.array([np.mean(log_mixture_pdf(x, th)) for th in thetas_fit])

# Quadratic fit
coeffs = np.polyfit(thetas_fit, avg_loglik, deg=2)
a, b, c = coeffs

# Full-range plotting
thetas = np.linspace(-4, 4, 161)
avg_loglik = np.array([np.mean(log_mixture_pdf(x, th)) for th in thetas])
fit_loglik = np.polyval(coeffs, thetas)

fisher_est = -2 * a
print("Estimated Fisher information at theta0=2:", fisher_est)
# => Estimated Fisher information at theta0=2: 0.9487177351723483

plt.figure()
plt.plot(thetas, avg_loglik,
         color="#005aff",
         label="Monte Carlo average log-likelihood")
plt.plot(thetas, fit_loglik,
         color="#ff4b00",
         linestyle="--",
         label="Quadratic fit")
plt.axvline(theta0, color="gray", linestyle=":", linewidth=1)

plt.xlabel(r"$\theta$", fontsize=16)
plt.ylabel("Average log-likelihood", fontsize=16)
plt.ylim(-4, -1.8)
plt.title("Fisher information via quadratic fit", fontsize=16)
plt.legend(fontsize=14)
plt.show()

Even though the likelihood is multimodal, a local quadratic fit still provides a usable estimate of the Fisher information.

Fisher Information for Independent and Identically Distributed Data

For independent and identically distributed samples $X_1,\dots,X_n$,

$$
\begin{aligned}
\mathrm{FisherInformation}(\theta_0)
&\stackrel{\text{def}}{=}
-\mathbb{E}
\left[
\frac{d^2}{d\theta^2}
\log p(X_1,\dots,X_n\mid\theta)
\bigg|_{\theta=\theta_0}
\right] \\
&=
-\mathbb{E}
\left[
\frac{d^2}{d\theta^2}
\sum_{i=1}^n \log p(X_i\mid\theta)
\bigg|_{\theta=\theta_0}
\right] \\
&=
\sum_{i=1}^n
-\mathbb{E}
\left[
\frac{d^2}{d\theta^2}
\log p(X_i\mid\theta)
\bigg|_{\theta=\theta_0}
\right] \\
&=
n\cdot \mathrm{FisherInformation_{single}}(\theta_0).
\end{aligned}
$$

Thus, Fisher information scales linearly with the number of observations.

Cramér–Rao Lower Bound

For any unbiased estimator $\hat{\theta}$,

$$
\mathrm{Var}(\hat{\theta})
\ge
\frac{1}{\mathrm{FisherInformation}(\theta_0)}.
$$

For i.i.d. data, this becomes

$$
\mathrm{Var}(\hat{\theta})
\ge
\frac{1}{n\cdot \mathrm{FisherInformation_{single}}(\theta_0)}.
$$

Thus, the standard deviation of $\hat{\theta}$ satisfies

$$
\mathrm{Std}(\hat{\theta})
\ge
\frac{1}{\sqrt{n\cdot \mathrm{FisherInformation_{single}}(\theta_0)}},
$$

which matches the classical $O(1/\sqrt{n})$ convergence rate, with Fisher information determining the constant. A larger Fisher information yields faster convergence and more accurate estimation.

Multivariate Case

When the parameter is vector-valued $\boldsymbol{\theta}=(\theta_1,\dots,\theta_d)$, the Fisher information becomes the Fisher information matrix

$$
\mathbf{I}(\boldsymbol{\theta_0})
\stackrel{\text{def}}{=}
-\mathbb{E}_{X\sim p(x\mid\boldsymbol{\theta_0})}
\left[
\nabla_{\boldsymbol{\theta}}^2
\log p(X\mid\boldsymbol{\theta})
\bigg|_{\boldsymbol{\theta}=\boldsymbol{\theta_0}}
\right].
$$

The Cramér–Rao bound generalizes to matrix inequalities involving the covariance matrix and the inverse Fisher information matrix.

Summary

Fisher information is a fundamental quantity describing how precisely model parameters can be estimated.
It can be interpreted as the curvature of the cross-entropy or KL divergence.
For independent and identically distributed data, it increases linearly with the sample size.
The Cramér–Rao inequality shows that a larger Fisher information enables more accurate parameter estimation.

Author Profile

If you found this article useful or interesting, I would be delighted if you could share your thoughts on social media.

New posts are announced on @joisino_en (Twitter), so please be sure to follow!

Ryoma Sato

Currently an Assistant Professor at the National Institute of Informatics, Japan.

Research Interest: Machine Learning and Data Mining.

Ph.D (Kyoto University).

View Profile