Those of you who already have tested for the variance of data from a normal distribution may have asked themselves how the link between normal variance and chi-squared distribution arises. Trust me: The story, which I will tell you, is an exciting one!

Simple explanation based on population mean

The chi-squared distribution with n degrees of freedom is defined as the sum of independent squared standard-normal variables \sum_{k=1}^{n} Z_{k}^2 with Z_{k} \sim \mathcal{N}(0,1) .

\displaystyle Z_{1}^{2}+\dots+Z_{n}^{2} \sim \chi^{2}(n)

Now, the population variance is given by

\displaystyle \sigma^{2}=E\big[X-E(X)\big]

An estimator for the variance based on the population mean is

\widehat{\sigma^{2}}\approx \frac{1}{n}\sum_{i=1}^{n}\Big(X_{i}-\mu\Big)^{2}

In order to demonstrate the relationship to the chi-squared distribution, let’s multiply with n/\sigma^2.

\displaystyle \frac{n\widehat{\sigma}^{2}}{\sigma^{2}}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}=\sum_{i=1}^{n}(Z_{i})^{2}=\chi^{2}(n)

Dividing by \sigma^{2} gives a z-transformation. In turn, this results in independent squared standard-normal variables for each random observation X_{i}. But this is actually the definition of the chi-squared distribution! Hence, for a known population mean the proportion between sample variance and the real underlying population variance follows a chi-squared distribution with n degrees of freedom where n denotes the sample size.

Advanced explanation based on sample observations only

Being aware that you usually don’t know the population mean. So you might be interested in proofing the claim that the sample variance is also related to the chi-squared distribution.

We make use of the relationship

\displaystyle \sum_{i=1}^{n}\big(X_{i}-\mu\big)^{2}=\sum_{i=1}^{n}\Big(X_{i}-\overline{X}\Big)^{2}+\sum_{i=1}^{n}\Big(\overline{X}-\mu\Big)^{2}

and can write the sample variance as

\displaystyle S^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\Big(X_{i}-\overline{X}\Big)^{2}=\frac{1}{n-1}\Big(\sum_{i=1}^{n}\big(X_{i}-\mu\big)^{2}-n\big(\overline{X}-\mu\big)^{2}\Big)

Multiplying with (n-1)/\sigma^2 gives

\displaystyle \frac{(n-1)S^{2}}{\sigma^{2}}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\overline{X}}{\sigma}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}-\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}

Rearranging the equation gives

\displaystyle \frac{(n-1)S^{2}}{\sigma^{2}}+\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\overline{X}}{\sigma}\Big)^{2}+\Big(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\Big)^{2}=\sum_{i=1}^{n}\Big(\frac{X_{i}-\mu}{\sigma}\Big)^{2}

Introducing P, Q, Q_1, Q_2 helps to distinguish the different terms. The equation now reads as follows:

P+Q_1=Q_2+Q_1=Q

Q_1 \sim \chi^2(1) and Q \sim \chi^2(n), because these terms refer to the population mean \mu and therefore are independent. The summands of Q_2 , in contrast, depend on \overline{X} . However, \overline{X} is calculated from X_i and therefore you can at maximum manipulate n-1 of the X_i without modifying the sum. As a consequence, Q_2 only has n-1 degrees of freedom. According to Cochran’s Theorem Q_2\sim \chi^2(n-1) and also the three expressions degrees of freedom must be equal. That is: Q \sim \chi^2(n) , Q_2+Q_1 \sim \chi^2(n) and P+Q_1 \sim \chi^2(n).

But also by Cochran’s Theorem P \sim \chi^2(n-1) (similar to Q_1). It follows

\displaystyle \frac{(n-1)S^{2}}{\sigma^{2}} \sim \chi^2(n-1)

Hopefully this explanation helped you to get an idea why for the variance there is a relationship to the chi-squared distribution 🙂