Proof of Unbiasedness of Sample Variance Estimator

Proof of Unbiasness of Sample Variance Estimator

(As I received some remarks about the unnecessary length of this proof, I provide shorter version here)

In different application of statistics or econometrics but also in many other examples it is necessary to estimate the variance of a sample. The estimator of the variance, see equation (1) is normally common knowledge and most people simple apply it without any further concern. The question which arose for me was why do we actually divide by n-1 and not simply by n? In the following lines we are going to see the proof that the sample variance estimator is indeed unbiased.

s^2 = variance of the sample

x_i = samples

\bar x = sample average

\mu = mean of the population

\delta = population variance

(1) s^2=\frac{1}{n-1}\sum\limits_{n=1}^N(x_i-\bar x)^2

1^{st} step

(2) x_i-\bar x = x_i - \mu + \mu - \bar x

(3) x_i-\bar x =( x_i - \mu) + (\mu - \bar x)

(4) (x_i-\bar x)^2 = [(x_i - \mu) + (\mu - \bar x)]^2

(5) (x_i-\bar x)^2 = (x_i - \mu)(x_i - \mu)+(x_i - \mu)(\mu - \bar x)+(\mu - \bar x)(x_i - \mu)+(\mu - \bar x)(\mu - \bar x)

(6) (x_i-\bar x)^2 = (x_i - \mu)^2+2*(x_i - \mu)(\mu - \bar x)+(\mu - \bar x)^2

(7) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+2*\sum\limits_{n=1}^N(x_i - \mu)(\mu - \bar x)+\sum\limits_{n=1}^N(\mu - \bar x)^2

(8) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+2(\mu - \bar x)\sum\limits_{n=1}^N(x_i - \mu)+n(\mu - \bar x)^2

(9) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)\sum\limits_{n=1}^N(x_i - \mu)

(10) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(\sum\limits_{n=1}^Nx_i - \sum\limits_{n=1}^N\mu)

(11) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(\sum\limits_{n=1}^Nx_i - n\mu)

(12) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2(\mu - \bar x)(n\bar x - n\mu) as n\bar x=\sum\limits_{n=1}^Nx_i which derives from \bar x=\frac{\sum\limits_{n=1}^Nx_i}{n}

(13) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2n(\mu - \bar x)(\bar x - \mu)

(14) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2+2n(-1)(-\mu + \bar x)(\bar x - \mu)

(15) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2-2n(\bar x-\mu )(\bar x - \mu)

(16) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\mu - \bar x)^2-2n(\bar x-\mu )^2

(17) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2+n(\bar x-\mu)^2-2n(\bar x-\mu )^2

which can be done as it does not change anything at the result

(18) \sum\limits_{n=1}^N(x_i-\bar x)^2 = \sum\limits_{n=1}^N(x_i - \mu)^2-n(\bar x-\mu )^2

2^{nd} step

(19)E(X)=\sum\limits_{n=1}^N x_i f(x_i) if x is i.u.d. (identically uniformely distributed) and if f(x_i)=\frac{1}{n} then

(21)E(X)=\frac{1}{n}\sum\limits_{n=1}^N x_i

(22)E(S^2)=E(\frac{1}{n-1}\sum\limits_{n=1}^N(x_i - \mu)^2-n(\bar x-\mu )^2)

(23)E(S^2)=\frac{1}{n-1}[E(\sum\limits_{n=1}^N(x_i - \mu)^2)-nE(\bar x-\mu )^2]

3^{rd} step

(24)E(X+Y)=\sum\limits_{i=1}^N\sum\limits_{j=1}^N(x_i+y_j)f(x_iy_j)

(25)E(X+Y)=\sum\limits_{i=1}^N\sum\limits_{j=1}^Nx_if(x_iy_j)+\sum\limits_{i=1}^N\sum\limits_{j=1}^Ny_jf(x_iy_j)

(26)E(X+Y)=\sum\limits_{i=1}^Nx_if(x_i)\sum\limits_{j=1}^Nf(y_j)+\sum\limits_{j=1}^Ny_jf(y_j)\sum\limits_{i=1}^Nf(x_i)

(27)E(X+Y)=\sum\limits_{i=1}^Nx_if(x_i)+\sum\limits_{j=1}^Ny_jf(y_j)\sum\limits_{i=1}^Nf(x_i) as\sum\limits_{j=1}^Nf(y_j)=1 and\sum\limits_{i=1}^Nf(x_i)=1

(28)E(X+Y)=E(X)+E(Y)

Applying this on our original function:

(29)E(g(x_i))=\sum\limits_{i=1}^Ng(x_i)f(x_i)

(30)E(g(x_i))=\sum\limits_{i=1}^N(x_i - \mu)^2f(x_i)

(31)E(g(x_i))=\sum\limits_{i=1}^N(x_i - \mu)^2\frac{1}{n} asf(x_i)=\frac{1}{n}

(32)E(g(x_i))=\sum\limits_{i=1}^N\frac{1}{n}(x_i - \mu)^2=\sum\limits_{i=1}^NVar(x_i)

(33)E(g(x_i))=\sum\limits_{i=1}^NVar(x_i)

(34)E(g(x_i))=nVar(x_i)

Plugging (34) into (23) gives us:

(35)E(S^2)=\frac{1}{n-1}[nVar(x_i)-nE(\bar x-\mu )^2]

Notice also that:

(36)Var(\bar x)=Var(\frac{\sum\limits_{i=1}^Nx_i}{n})=Var(\frac{1}{n}\sum\limits_{i=1}^Nx_i)

and that:

(37)Var(a+bx_i)

(38)\mu_y=a+b\mu_x

(39)y_i=a+bx_i

using (38) and (39) in (37) we get:

(40)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(y_i-\mu_y)^2

(41)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(a+bx_i-(a+b\mu_x)^2

(42)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(a+bx_i-a-b\mu_x)^2

(43)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(bx_i-b\mu_x)^2

(44)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(b(x_i-\mu_x))^2

(45)Var(a+bx_i)=\frac{1}{n}\sum\limits_{n=1}^N(b^2(x_i-\mu_x)^2)

(46)Var(a+bx_i)=b^2\frac{1}{n}\sum\limits_{n=1}^N(x_i-\mu_x)^2

(47)Var(a+bx_i)=b^2Var(x_i)

knowing (40)-(47) let us return to (36) and we see that:

(48)Var(\bar x)=Var(\frac{\sum\limits_{n=1}^Nx_i}{n})=Var(\frac{1}{n}\sum\limits_{n=1}^Nx_i)

(49)Var(\bar x)=Var(\frac{1}{n}\sum\limits_{n=1}^Nx_i)

(50)Var(\bar x)=\frac{1}{n^2}Var(\sum\limits_{n=1}^Nx_i)

(51)Var(\bar x)=\frac{1}{n^2}\sum\limits_{n=1}^NVar(x_i)

(52)Var(\bar x)=\frac{1}{n^2}\sum\limits_{n=1}^N(\frac{1}{n}\sum\limits_{n=1}^N(x_i-\mu_x)^2)

just looking at the last part of (51) were we have Var(x_i)=\sum\limits_{n=1}^N(x_i-\mu_x)^2 we can apply simple computation rules of variance calulation:

(53)Var(x_i)=\sum\limits_{n=1}^N(x_i-\mu_x)^2

(54)Var(X+Y)=\frac{1}{n}\sum\limits_{n=1}^N((X+Y)-(\mu_x+\mu_y))^2

now the x_i on the lhs of (53) corresponds to the (X+Y) of the rhs of (54) and \mu of the rhs of (53) corresponds to \mu_x+\mu_y of the rhs of (54). Now what exactly do we mean by that, well

(55)Var(X+Y)=E[(X+Y)-(\mu_x+\mu_y)]^2

(56)Var(X+Y)=E[(X-\mu_x)+(Y-\mu_y)]^2

(57)Var(X+Y)=E[(X-\mu_x)^2+(Y-\mu_y)^2+2(X-\mu_x)(Y-\mu_y)]

(58)Var(X+Y)=Var(X)+Var(Y)+2(X-\mu_x)(Y-\mu_y)]

the term 2(X-\mu_x)(Y-\mu_y) is the covariance of X and Y and is zero, as X is independent of Y. This leaves us with the variance of X and the variance of Y. From (52) we know that

(59)Var(\bar x)=\frac{1}{n^2}\sum\limits_{n=1}^N(\frac{1}{n}\sum\limits_{n=1}^N(x_i-\mu_x)^2)

and playing around with it brings us to the following:

(60)Var(\bar x)=\frac{1}{n^2}\sum\limits_{n=1}^NVar(X)

(61)Var(\bar x)=\frac{1}{n^2}nVar(X)

(62)Var(\bar x)=\frac{1}{n}Var(X)

(63)Var(\bar x)=\frac{1}{n}\delta^2

now we have everything to finalize the proof. Return to equation (23)

(64)E(S^2)=\frac{1}{n-1}[E(\sum\limits_{n=1}^N(x_i - \mu)^2)-nE(\bar x-\mu )^2]

and we see that we have:

(65)E(S^2)=\frac{1}{n-1}[E(\sum\limits_{n=1}^NVar(x_i))-n\frac{1}{n}Var(X)]

(66)E(S^2)=\frac{1}{n-1}[E(\sum\limits_{n=1}^N\delta^2)-n\frac{1}{n}\delta^2]

(67)E(S^2)=\frac{1}{n-1}[n\delta^2-n\frac{1}{n}\delta^2]

(68)E(S^2)=\frac{1}{n-1}[n\delta^2-\delta^2]

so we are able to factorize and we end up with:

(69)E(S^2)=\frac{1}{n-1}[\delta^2(n-1)]

which cancels out and it follows that

(70)E(S^2)=\delta^2

Sometimes I may have jumped over some steps and it could be that they are not as clear for everyone as they are for me, so in the case it is not possible to follow my reasoning just leave a comment and I will try to describe it better.

As most comments and remarks are not about missing steps, but demand a more compact version of the proof, I felt obliged to provide one here.

Advertisements
Aside | This entry was posted in Econometrics, Proof, Statistic and tagged , , , , , , . Bookmark the permalink.

37 Responses to Proof of Unbiasedness of Sample Variance Estimator

  1. Perry Thomas says:

    At last someone who does NOT say “It can be easily shown that…”

  2. Savi Maharaj says:

    In your step (1) you use n as if it is both a constant (the size of the sample) and also the variable used in the sum (ranging from 1 to N, which is undefined but I guess is the population size). Shouldn’t the variable in the sum be i, and shouldn’t you be summing from i=1 to i=n? This makes it difficult to follow the rest of your argument, as I cannot tell in some steps whether you are referring to the sample or to the population.

    • You are right, I’ve never noticed the mistake. It should clearly be i=1 and not n=1. And you are also right when saying that N is not defined, but as you said it is the sample size. I will add it to the definition of variables.

      However, you should still be able to follow the argument, if there any further misunderstandings, please let me know.

  3. Pingback: Unbiased Estimator of Sample Variance – Vol. 2 | Economic Theory Blog

  4. it’s very interesting

  5. please how do we show the proving of V( y bar subscript st) = summation W square subscript K x S square x ( 1- f subscript n) / n subscript k …..please I need ur assistant

    • Unfortunately I do not really understand your question. Is your formula taken from the proof outlined above? Do you want to prove that the estimator for the sample variance is unbiased? Or do you want to prove something else and are asking me to help you with that proof? In any case, I need some more information 🙂

  6. abbas says:

    I am very glad with this proven .how can we calculate for estimate of average size
    and whats the formula. including some example thank you

  7. abbas says:

    please can you enlighten me on how to solve linear equation and linear but not homogenous case 2 in mathematical method

  8. comrade Abbas says:

    please how can I prove …v(Y bar ) = S square /n(1-f)
    and

    S subscript = S /root n x square root of N-n /N-1
    and

    S square = summation (y subscript – Y bar )square / N-1

  9. abbas says:

    I like it….

    please help me to check this sampling techniques

    an investigator want to know the adequacy of working condition of the employees of a plastic production factory whose total working population is 5000. if the junior staff is 4 times the intermediate staff working population and the senior staff constitute 15% of the working population .if further ,male constitute 75% ,50% and 80% of junior , intermediate and senior staff respectively of the working population .draw a stratified sample sizes in a table ( taking cognizance of the sex and cadres ).

    I am confused about it please help me out thanx

  10. abbas says:

    please am sorry for the inconvenience ..how can I prove v(Y estimate)

  11. Jerry joel says:

    Gud day sir, thanks alot for the write-up because it clears some of my confusion but i am stil having problem with 2(x-u_x)+(y-u_y), how it becomes zero. Pls explan to me more.

  12. Jerry joel says:

    Pls sir, i need more explanation how 2(x-u_x) + (y-u_y) becomes zero while deriving?

  13. Jerry joel says:

    Yes!, thanks alot sir.

  14. Janio Javier says:

    Please I ‘d like an orientation about the proof of the estimate of sample mean variance for cluster design with subsampling (two stages) with probability proportional to the size in the first step and without replacement, and simple random sample in the second step also without replacement. .
    Thanks a lot for your help.

    • Thank you for you comment. The proof I provided in this post is very general. However, your question refers to a very specific case to which I do not know the answer. Nevertheless, I saw that Peter Egger and Filip Tarlea recently published an article in Economic Letters called “Multi-way clustering estimation of standard errors in gravity models”, this might be a good place to start.

  15. Janio Javier says:

    Thank you for your prompt answer. I will read that article. Janio

  16. Nate says:

    Thanks a lot for this proof. All the other ones I found skipped a bunch of steps and I had no idea what was going on. Econometrics is very difficult for me–more so when teachers skip a bunch of steps. This post saved me some serious frustration. Thanks!

  17. WALELGN GETE says:

    Please Proofe The Biased Estimator Of Sample Variance.

    • ad says:

      What do exactly do you mean by prove the biased estimator of the sample variance? Do you mean the bias that occurs in case you divide by n instead of n-1?

  18. sigmaoverrootn says:

    it would be better if you break it into several Lemmas

    for example, first proving the identities for Linear Combinations of Expected Value, and Variance, and then using the result of the Lemma, in the main proof

    you made it more cumbersome that it needed to be

    • ad says:

      Hi, thanks again for your comments. I really appreciate your in-depth remarks. While it is certainly true that one can re-write the proof differently and less cumbersome, I wonder if the benefit of brining in lemmas outweighs its costs. In my eyes, lemmas would probably hamper the quick comprehension of the proof. This way the proof seems simple. I like things simple. Cheers, ad.

  19. abbas says:

    pls how do we solve real statistic using excel analysis

    • ad says:

      Hey Abbas, welcome back! What do you mean by solving real statistics? About excel, I think Excel has a data analysis extension. If I were to use Excel that is probably the place I would start looking. However, use R! It free and a very good statistical software. I could write a tutorial, if you tell me what exactly it is that you need.

  20. Beat Tödtli says:

    Eq. (36) contains an error. There the index i is not summed over.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s