The denominator of sample variance!

When I was young, I tried to fit every new thing I learned with what I already knew.

But every time I learned something that didn’t fit my view of the world, I rejected it altogether!

I could not sustain this approach for long because it resulted in poor grades. I soon had to start mugging up stuff without seeking proof for them.

One such thing I accepted without thinking much was using “n-1” in the denominator while calculating variance. As you already know, “n” represents the total number of observations here.

But today, for the first time in my life, I came across two different versions of the variance — population variance and sample variance.

Population variance is calculated with an “n” in the denominator, whereas sample variance requires an “n-1”.

I was curious about this difference and went on a fact-finding mission online.

And what I found just blew me away and made me admire the elegance of statistics!

Before understanding what it is, let us take a look at the process of calculating variance:

1) First, the mean is calculated for all the observations.
2) Later, the deviation of each observation from the mean is computed.
3) Finally, all these deviations are squared, and the average of these squared deviations becomes variance.

Since we are taking an average in the last step, it makes sense to go ahead with a straightforward “N” in the denominator.

(If you still managed to be awake while reading this boring explanation, let me point out that I just did parkour from “n” to “N.” The essence of statistics lies in understanding the difference between these two notations. N usually represents the total number of observations in a population, whereas n refers to that number in a sample.)

Let us say that you are trying to find the variance of marks for all the pupils in a classroom.

The population mean and the squared deviations can be calculated quite easily here.

It makes sense to go ahead and directly average these squared deviations to find the variance.

But if you are trying to find the variance of marks for all the pupils of a country in a particular age group, it is an impossible task to find the population mean.

Here, you would have to find a sample representative of the population and somehow use it to estimate the mean and variance values.

Suppose you selected pupils in a classroom as this representative sample.

You could find the mean here in the usual manner and report it as approximately equal to the population mean since this sample is assumed to be representative.

But this mean is still not equal to the population mean, and using it to estimate the variance of the population would only compound the error.

To correct it, a factor of (n-1)/n is applied to this variance. This factor is called Bessel’s correction factor, and it ensures that the sample variance calculated is less biased.

There is a convoluted mathematical proof for this which made my head feel dizzy, but I found a brilliant intuitive explanation for this online.

To understand it, we need to look at the concept of “degrees of freedom.”

Let us say that you had the task of finding three numbers that add up to 100.

You could play with the first two numbers here — you have the freedom to choose -0.01 as the first number and 10000 as the second number, but the third number has to make up for these extremities to add up to 100.

Therefore, the degree of freedom here would be a 2. Similarly, in an n-variable system, the degree of freedom is “n-1.”

Since we are using the sample mean to calculate the variance, we must respect the fact that it would be an erroneous and biased estimate.

Therefore, we are ‘setting aside’ one observation from our sample while averaging the squared deviations and assuming that it compensates for the error.

Hence, using “n-1” in the denominator ensures that the sample variance calculated is less biased.

Of course, it doesn’t mean that sample variance would be an accurate estimate of population variance with n-1 in the denominator. But like all other things in statistics, we are just trying to be less wrong than more correct!

The Dumb Datum

Search This Blog

The denominator of sample variance!

Labels

Comments

Post a Comment

Popular posts from this blog

Solving Customer Churn with a hammer!

Curing writer's block with sunk cost fallacy

What is SUTVA for A/B testing?