# Proof of Unbiasedness of Sample Variance Estimator

Proof of Unbiasness of Sample Variance Estimator

(As I received some remarks about the unnecessary length of this proof, I provide shorter version here)

In different application of statistics or econometrics but also in many other examples it is necessary to estimate the variance of a sample. The estimator of the variance, see equation (1) is normally common knowledge and most people simple apply it without any further concern. The question which arose for me was why do we actually divide by n-1 and not simply by n? In the following lines we are going to see the proof that the sample variance estimator is indeed unbiased.

$s^2$ = variance of the sample

$x_i$ = manifestations of random variable X with $i$ from 1 to n

$\bar x$ = sample average

$\mu$ = mean of the population

$\delta$ = population variance

(1) $s^2=\frac{1}{n-1}\sum\limits_{i=1}^n(x_i-\bar x)^2$

#### First step of the proof

(2) $x_i-\bar x = x_i - \mu + \mu - \bar x$

(3) $x_i-\bar x =( x_i - \mu) + (\mu - \bar x)$

(4) $(x_i-\bar x)^2 = [(x_i - \mu) + (\mu - \bar x)]^2$

(5) $(x_i-\bar x)^2 = (x_i - \mu)(x_i - \mu)+(x_i - \mu)(\mu - \bar x)+(\mu - \bar x)(x_i - \mu)+(\mu - \bar x)(\mu - \bar x)$

(6) $(x_i-\bar x)^2 = (x_i - \mu)^2+2*(x_i - \mu)(\mu - \bar x)+(\mu - \bar x)^2$

(7) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+2*\sum\limits_{i=1}^n(x_i - \mu)(\mu - \bar x)+\sum\limits_{i=1}^n(\mu - \bar x)^2$

(8) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+2(\mu - \bar x)\sum\limits_{i=1}^n(x_i - \mu)+n(\mu - \bar x)^2$

(9) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)\sum\limits_{i=1}^n(x_i - \mu)$

(10) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(\sum\limits_{i=1}^n x_i - \sum\limits_{i=1}^n \mu)$

(11) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(\sum\limits_{i=1}^n x_i - n\mu)$

(12) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(n\bar x - n\mu)$ as $n\bar x=\sum\limits_{i=1}^n x_i$ which derives from $\bar x=\frac{\sum\limits_{i=1}^n x_i}{n}$

(13) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2n(\mu - \bar x)(\bar x - \mu)$

(14) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2+2n(-1)(-\mu + \bar x)(\bar x - \mu)$

(15) $\sum\limits_{i=1}^n (x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2-2n(\bar x-\mu )(\bar x - \mu)$

(16) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\mu - \bar x)^2-2n(\bar x-\mu )^2$

(17) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2+n(\bar x-\mu)^2-2n(\bar x-\mu )^2$

which can be done as it does not change anything at the result

(18) $\sum\limits_{i=1}^n(x_i-\bar x)^2 = \sum\limits_{i=1}^n(x_i - \mu)^2-n(\bar x-\mu )^2$

#### Second step of the proof

(19)$E(X)=\sum\limits_{i=1}^n x_i f(x_i)$ if x is i.u.d. (identically uniformely distributed) and if $f(x_i)=\frac{1}{n}$ then

(21)$E(X)=\frac{1}{n-1}\sum\limits_{i=1}^n x_i$

(22)$E(S^2)=E(\frac{1}{n -1 }\sum\limits_{i=1}^n(x_i - \mu)^2-n(\bar x-\mu )^2)$

(23)$E(S^2)=\frac{1}{n}[E(\sum\limits_{i=1}^n (x_i - \mu)^2)-n E(\bar x-\mu )^2]$

#### Third step of the proof

(24)$E(X+Y)=\sum\limits_{i=1}^n\sum\limits_{j=1}^n (x_i+y_j)f(x_iy_j)$

(25)$E(X+Y)=\sum\limits_{i=1}^n\sum\limits_{j=1}^n x_if(x_iy_j)+\sum\limits_{i=1}^n\sum\limits_{j=1}^n y_jf(x_iy_j)$

(26)$E(X+Y)=\sum\limits_{i=1}^n x_if(x_i)\sum\limits_{j=1}^n f(y_j)+\sum\limits_{j=1}^n y_jf(y_j)\sum\limits_{i=1}^n f(x_i)$

(27)$E(X+Y)=\sum\limits_{i=1}^n x_if(x_i)+\sum\limits_{j=1}^n y_jf(y_j)\sum\limits_{i=1}^n f(x_i)$ as$\sum\limits_{j=1}^n f(y_j)=1$ and$\sum\limits_{i=1}^n f(x_i)=1$

(28)$E(X+Y)=E(X)+E(Y)$

Applying this on our original function:

(29)$E(g(x_i))=\sum\limits_{i=1}^n g(x_i)f(x_i)$

(30)$E(g(x_i))=\sum\limits_{i=1}^n (x_i - \mu)^2f(x_i)$

(31)$E(g(x_i))=\sum\limits_{i=1}^n (x_i - \mu)^2\frac{1}{n}$ as$f(x_i)=\frac{1}{n}$

(32)$E(g(x_i))=\sum\limits_{i=1}^n \frac{1}{n}(x_i - \mu)^2=\sum\limits_{i=1}^n Var(x_i)$

(33)$E(g(x_i))=\sum\limits_{i=1}^nVar(x_i)$

(34)$E(g(x_i))=nVar(x_i)$

Plugging (34) into (23) gives us:

(35)$E(S^2)=\frac{1}{n-1}[nVar(x_i)-nE(\bar x-\mu )^2]$

Notice also that:

(36)$Var(\bar x)=Var(\frac{\sum\limits_{i=1}^n x_i}{n})=Var(\frac{1}{n}\sum\limits_{i=1}^n x_i)$

and that:

(37)$Var(a+bx_i)$

(38)$\mu_y=a+b\mu_x$

(39)$y_i=a+bx_i$

using (38) and (39) in (37) we get:

(40)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n (y_i-\mu_y)^2$

(41)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n (a+bx_i-(a+b\mu_x)^2$

(42)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n(a+bx_i-a-b\mu_x)^2$

(43)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n(bx_i-b\mu_x)^2$

(44)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n(b(x_i-\mu_x))^2$

(45)$Var(a+bx_i)=\frac{1}{n}\sum\limits_{i=1}^n(b^2(x_i-\mu_x)^2)$

(46)$Var(a+bx_i)=b^2\frac{1}{n}\sum\limits_{i=1}^n(x_i-\mu_x)^2$

(47)$Var(a+bx_i)=b^2Var(x_i)$

(48)$Var(\bar x)=Var(\frac{\sum\limits_{i=1}^n x_i}{n})=Var(\frac{1}{n}\sum\limits_{i=1}^n x_i)$

(49)$Var(\bar x)=Var(\frac{1}{n}\sum\limits_{i=1}^n x_i)$

(50)$Var(\bar x)=\frac{1}{n^2}Var(\sum\limits_{i=1}^n x_i)$

(51)$Var(\bar x)=\frac{1}{n^2}\sum\limits_{i=1}^n Var(x_i)$

(52)$Var(\bar x)=\frac{1}{n^2}\sum\limits_{i=1}^n(\frac{1}{n}\sum\limits_{i=1}^n(x_i-\mu_x)^2)$

just looking at the last part of (51) were we have $Var(x_i)=\sum\limits_{i=1}^n (x_i-\mu_x)^2$ we can apply simple computation rules of variance calulation:

(53)$Var(x_i)=\sum\limits_{i=1}^n(x_i-\mu_x)^2$

(54)$Var(X+Y)=\frac{1}{n}\sum\limits_{i=1}^n((X+Y)-(\mu_x+\mu_y))^2$

now the $x_i$ on the lhs of (53) corresponds to the $(X+Y)$ of the rhs of (54) and $\mu$ of the rhs of (53) corresponds to $\mu_x+\mu_y$ of the rhs of (54). Now what exactly do we mean by that, well

(55)$Var(X+Y)=E[(X+Y)-(\mu_x+\mu_y)]^2$

(56)$Var(X+Y)=E[(X-\mu_x)+(Y-\mu_y)]^2$

(57)$Var(X+Y)=E[(X-\mu_x)^2+(Y-\mu_y)^2+2(X-\mu_x)(Y-\mu_y)]$

(58)$Var(X+Y)=Var(X)+Var(Y)+2(X-\mu_x)(Y-\mu_y)]$

the term $2(X-\mu_x)(Y-\mu_y)$ is the covariance of X and Y and is zero, as X is independent of Y. This leaves us with the variance of X and the variance of Y. From (52) we know that

(59)$Var(\bar x)=\frac{1}{n^2}\sum\limits_{i=1}^n (\frac{1}{n}\sum\limits_{i=1}^n (x_i-\mu_x)^2)$

and playing around with it brings us to the following:

(60)$Var(\bar x)=\frac{1}{n^2}\sum\limits_{i=1}^n Var(X)$

(61)$Var(\bar x)=\frac{1}{n^2}n Var(X)$

(62)$Var(\bar x)=\frac{1}{n}Var(X)$

(63)$Var(\bar x)=\frac{1}{n}\delta^2$

now we have everything to finalize the proof. Return to equation (23)

(64)$E(S^2)=\frac{1}{n-1}[E(\sum\limits_{i=1}^n(x_i - \mu)^2)-nE(\bar x-\mu )^2]$

and we see that we have:

(65)$E(S^2)=\frac{1}{n-1}[E(\sum\limits_{i=1}^n Var(x_i))-n\frac{1}{n}Var(X)]$

(66)$E(S^2)=\frac{1}{n-1}[E(\sum\limits_{i=1}^n \delta^2)-n\frac{1}{n}\delta^2]$

(67)$E(S^2)=\frac{1}{n-1}[n\delta^2-n\frac{1}{n}\delta^2]$

(68)$E(S^2)=\frac{1}{n-1}[n\delta^2-\delta^2]$

so we are able to factorize and we end up with:

(69)$E(S^2)=\frac{1}{n-1}[\delta^2(n-1)]$

which cancels out and it follows that

(70)$E(S^2)=\delta^2$

Sometimes I may have jumped over some steps and it could be that they are not as clear for everyone as they are for me, so in the case it is not possible to follow my reasoning just leave a comment and I will try to describe it better.

As most comments and remarks are not about missing steps, but demand a more compact version of the proof, I felt obliged to provide one here.

## 41 thoughts on “Proof of Unbiasedness of Sample Variance Estimator”

1. Perry Thomas says:

At last someone who does NOT say “It can be easily shown that…”

2. In your step (1) you use n as if it is both a constant (the size of the sample) and also the variable used in the sum (ranging from 1 to N, which is undefined but I guess is the population size). Shouldn’t the variable in the sum be i, and shouldn’t you be summing from i=1 to i=n? This makes it difficult to follow the rest of your argument, as I cannot tell in some steps whether you are referring to the sample or to the population.

1. You are right, I’ve never noticed the mistake. It should clearly be i=1 and not n=1. And you are also right when saying that N is not defined, but as you said it is the sample size. I will add it to the definition of variables.

However, you should still be able to follow the argument, if there any further misunderstandings, please let me know.

3. please how do we show the proving of V( y bar subscript st) = summation W square subscript K x S square x ( 1- f subscript n) / n subscript k …..please I need ur assistant

1. Unfortunately I do not really understand your question. Is your formula taken from the proof outlined above? Do you want to prove that the estimator for the sample variance is unbiased? Or do you want to prove something else and are asking me to help you with that proof? In any case, I need some more information 🙂

4. I am very glad with this proven .how can we calculate for estimate of average size
and whats the formula. including some example thank you

1. I am happy you like it 🙂 But I am sorry that I still do not really understand what you are asking for.

5. please can you enlighten me on how to solve linear equation and linear but not homogenous case 2 in mathematical method

please how can I prove …v(Y bar ) = S square /n(1-f)
and

S subscript = S /root n x square root of N-n /N-1
and

S square = summation (y subscript – Y bar )square / N-1

1. I am getting really confused here 🙂 are you asking for a proof of

$V(\bar{Y}) = \frac{S^{2}}{n(1-f)}$

7. I like it….

an investigator want to know the adequacy of working condition of the employees of a plastic production factory whose total working population is 5000. if the junior staff is 4 times the intermediate staff working population and the senior staff constitute 15% of the working population .if further ,male constitute 75% ,50% and 80% of junior , intermediate and senior staff respectively of the working population .draw a stratified sample sizes in a table ( taking cognizance of the sex and cadres ).

8. please am sorry for the inconvenience ..how can I prove v(Y estimate)

9. Jerry joel says:

Gud day sir, thanks alot for the write-up because it clears some of my confusion but i am stil having problem with 2(x-u_x)+(y-u_y), how it becomes zero. Pls explan to me more.

1. Hello! The expression is zero as X and Y are independent and the covariance of two independent variable is zero. Does this answer you question?
Regards!

10. Jerry joel says:

Pls sir, i need more explanation how 2(x-u_x) + (y-u_y) becomes zero while deriving?

11. Jerry joel says:

Yes!, thanks alot sir.

12. Janio Javier says:

Please I ‘d like an orientation about the proof of the estimate of sample mean variance for cluster design with subsampling (two stages) with probability proportional to the size in the first step and without replacement, and simple random sample in the second step also without replacement. .
Thanks a lot for your help.

1. Thank you for you comment. The proof I provided in this post is very general. However, your question refers to a very specific case to which I do not know the answer. Nevertheless, I saw that Peter Egger and Filip Tarlea recently published an article in Economic Letters called “Multi-way clustering estimation of standard errors in gravity models”, this might be a good place to start.

13. Janio Javier says:

14. Nate says:

Thanks a lot for this proof. All the other ones I found skipped a bunch of steps and I had no idea what was going on. Econometrics is very difficult for me–more so when teachers skip a bunch of steps. This post saved me some serious frustration. Thanks!

What do exactly do you mean by prove the biased estimator of the sample variance? Do you mean the bias that occurs in case you divide by n instead of n-1?

15. sigmaoverrootn says:

it would be better if you break it into several Lemmas

for example, first proving the identities for Linear Combinations of Expected Value, and Variance, and then using the result of the Lemma, in the main proof

you made it more cumbersome that it needed to be

Hi, thanks again for your comments. I really appreciate your in-depth remarks. While it is certainly true that one can re-write the proof differently and less cumbersome, I wonder if the benefit of brining in lemmas outweighs its costs. In my eyes, lemmas would probably hamper the quick comprehension of the proof. This way the proof seems simple. I like things simple. Cheers, ad.

16. abbas says:

pls how do we solve real statistic using excel analysis

Hey Abbas, welcome back! What do you mean by solving real statistics? About excel, I think Excel has a data analysis extension. If I were to use Excel that is probably the place I would start looking. However, use R! It free and a very good statistical software. I could write a tutorial, if you tell me what exactly it is that you need.

1. good day sir.

can u kindly give me the procedure to analyze experimental design using SPSS

Sorry mate. I do not speak SPSS.

17. Eq. (36) contains an error. There the index i is not summed over.

You are right. I fixed it. Much appreciated.

18. I think it should be clarified that over which population is E(S^2) being calculated. Is x_i (for each i=0,…,n) being regarded as a separate random variable? If so, the population would be all permutations of size n from the population on which X is defined. I am confused here. Are N and n separate values?

Hey! Thank you for your comment! Indeed, it was not very clean the way I specified X, n and N. I revised the post and tried to improve the notation. Now, X is a random variables, $x_i$ is one observation of variable X. Overall, we have 1 to n observations. I hope this makes is clearer.