Chi-squared distribution and normal variance

Those of you who already have tested for the variance of data from a normal distribution may have asked themselves how the link between normal variance and chi-squared distribution arises. Trust me: The story, which I will tell you, is an exciting one!

Simple explanation based on population mean

The chi-squared distribution with $n$ degrees of freedom is defined as the sum of independent squared standard-normal variables $\sum_{k=1}^{n} Z_{k}^2$ with $Z_{k} \sim \mathcal{N}(0,1)$ .

$\displaystyle Z_{1}^{2}+\dots+Z_{n}^{2} \sim \chi^{2}(n)$

Now, the population variance is given by

$\displaystyle \sigma^{2}=E\big[X-E(X)\big]$

An estimator for the variance based on the population mean is

$\widehat{\sigma^{2}}\approx \frac{1}{n}\sum_{i=1}^{n}\Big(X_{i}-\mu\Big)^{2}$

In order to demonstrate the relationship to the chi-squared distribution, let’s multiply with $n/\sigma^2$ .

$\displaystyle \frac{n\widehat{\sigma}^{2}}{\sigma^{2}}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}=\sum_{i=1}^{n}(Z_{i})^{2}=\chi^{2}(n)$

Dividing by $\sigma^{2}$ gives a z-transformation. In turn, this results in independent squared standard-normal variables for each random observation $X_{i}$ . But this is actually the definition of the chi-squared distribution! Hence, for a known population mean the proportion between sample variance and the real underlying population variance follows a chi-squared distribution with $n$ degrees of freedom where $n$ denotes the sample size.

Advanced explanation based on sample observations only

Being aware that you usually don’t know the population mean. So you might be interested in proofing the claim that the sample variance is also related to the chi-squared distribution.

We make use of the relationship

$\displaystyle \sum_{i=1}^{n}\big(X_{i}-\mu\big)^{2}=\sum_{i=1}^{n}\Big(X_{i}-\overline{X}\Big)^{2}+\sum_{i=1}^{n}\Big(\overline{X}-\mu\Big)^{2}$

and can write the sample variance as

$\displaystyle S^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\Big(X_{i}-\overline{X}\Big)^{2}=\frac{1}{n-1}\Big(\sum_{i=1}^{n}\big(X_{i}-\mu\big)^{2}-n\big(\overline{X}-\mu\big)^{2}\Big)$

Multiplying with $(n-1)/\sigma^2$ gives

$\displaystyle \frac{(n-1)S^{2}}{\sigma^{2}}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\overline{X}}{\sigma}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}-\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}$

Rearranging the equation gives

$\displaystyle \frac{(n-1)S^{2}}{\sigma^{2}}+\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\overline{X}}{\sigma}\Big)^{2}+\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}$

Introducing $P, Q, Q_1, Q_2$ helps to distinguish the different terms. The equation now reads as follows:

$P+Q_1=Q_2+Q_1=Q$

$Q_1 \sim \chi^2(1)$ and $Q \sim \chi^2(n)$ , because these terms refer to the population mean $\mu$ and therefore are independent. The summands of $Q_2$ , in contrast, depend on $\overline{X}$ . However, $\overline{X}$ is calculated from $X_i$ and therefore you can at maximum manipulate $n-1$ of the $X_i$ without modifying the sum. As a consequence, $Q_2$ only has $n-1$ degrees of freedom. According to Cochran’s Theorem $Q_2\sim \chi^2(n-1)$ and also the three expressions degrees of freedom must be equal. That is: $Q \sim \chi^2(n)$ , $Q_2+Q_1 \sim \chi^2(n)$ and $P+Q_1 \sim \chi^2(n)$ .