Skip to main content

The denominator of sample variance!

 


When I was young, I tried to fit every new thing I learned with what I already knew.

But every time I learned something that didn’t fit my view of the world, I rejected it altogether!

I could not sustain this approach for long because it resulted in poor grades. I soon had to start mugging up stuff without seeking proof for them.

One such thing I accepted without thinking much was using “n-1” in the denominator while calculating variance. As you already know, “n” represents the total number of observations here.

But today, for the first time in my life, I came across two different versions of the variance — population variance and sample variance.

Population variance is calculated with an “n” in the denominator, whereas sample variance requires an “n-1”.

I was curious about this difference and went on a fact-finding mission online.

And what I found just blew me away and made me admire the elegance of statistics!

Before understanding what it is, let us take a look at the process of calculating variance:

1) First, the mean is calculated for all the observations.
2) Later, the deviation of each observation from the mean is computed.
3) Finally, all these deviations are squared, and the average of these squared deviations becomes variance.

Since we are taking an average in the last step, it makes sense to go ahead with a straightforward “N” in the denominator.

(If you still managed to be awake while reading this boring explanation, let me point out that I just did parkour from “n” to “N.” The essence of statistics lies in understanding the difference between these two notations. N usually represents the total number of observations in a population, whereas n refers to that number in a sample.)

Let us say that you are trying to find the variance of marks for all the pupils in a classroom.

The population mean and the squared deviations can be calculated quite easily here.

It makes sense to go ahead and directly average these squared deviations to find the variance.

But if you are trying to find the variance of marks for all the pupils of a country in a particular age group, it is an impossible task to find the population mean.

Here, you would have to find a sample representative of the population and somehow use it to estimate the mean and variance values.

Suppose you selected pupils in a classroom as this representative sample.

You could find the mean here in the usual manner and report it as approximately equal to the population mean since this sample is assumed to be representative.

But this mean is still not equal to the population mean, and using it to estimate the variance of the population would only compound the error.

To correct it, a factor of (n-1)/n is applied to this variance. This factor is called Bessel’s correction factor, and it ensures that the sample variance calculated is less biased.

There is a convoluted mathematical proof for this which made my head feel dizzy, but I found a brilliant intuitive explanation for this online.

To understand it, we need to look at the concept of “degrees of freedom.”

Let us say that you had the task of finding three numbers that add up to 100.

You could play with the first two numbers here — you have the freedom to choose -0.01 as the first number and 10000 as the second number, but the third number has to make up for these extremities to add up to 100.

Therefore, the degree of freedom here would be a 2. Similarly, in an n-variable system, the degree of freedom is “n-1.”

Since we are using the sample mean to calculate the variance, we must respect the fact that it would be an erroneous and biased estimate.

Therefore, we are ‘setting aside’ one observation from our sample while averaging the squared deviations and assuming that it compensates for the error.

Hence, using “n-1” in the denominator ensures that the sample variance calculated is less biased.

Of course, it doesn’t mean that sample variance would be an accurate estimate of population variance with n-1 in the denominator. But like all other things in statistics, we are just trying to be less wrong than more correct!


Comments

Popular posts from this blog

Solving Customer Churn with a hammer!

Learning when data should take a back seat and give way to domain knowledge is a valuable skill. Suppose you built a machine learning model on the data of your customers to predict churn risk. Now that you have a risk score for each customer, what do you do next? Do you filter the top n% based on the risk and send them a coupon with a discount in the hopes that it will prevent churn? But what if price is not the factor driving churn in many of these customers? Customers might have been treated poorly by customer service, which drove them away from your company's product.  Or there might have been an indirect competitor's product or service that removes the need for your company's product altogether (this happened to companies like Blockbuster and Kodak in the past!) There could be a myriad of factors, but you get the point! Dashboards and models cannot guide any company's strategic actions directly. If companies try to use them without additional context, more often tha...

Curing writer's block with sunk cost fallacy

I paid $20 to renew this blog's domain in July. But the truth is, I had been suffering from writer's block ever since the start of this year and hadn’t posted a single thing. At one point, I was ready to give up on the blog altogether, but a voice in my head kept reminding me of all the time and money I’d already invested in this blog. So, this week, I sat down to write this imperfect, patchy article—about none other than that voice itself.  Let me start with a classic scenario where you might have also encountered this voice. Suppose you’re at an Italian restaurant and ordered some pasta and tiramisu. After finishing the pasta, you realize you’re full, and there’s no way your stomach can handle that delicious tiramisu sitting right in front of you. But then, that beautiful brain of yours reminds you that you’ll be paying for the tiramisu whether you eat it or not. In a desperate attempt to avoid wasting money, you reluctantly eat two quick bites. And just like that, my frien...

What is SUTVA for A/B testing?

Imagine if person B’s blood pressure reading depends on whether person A receives the blood pressure medicine in a randomized controlled trial. This will be violating Stable Unit Treatment Value Assumption (SUTVA) SUTVA states that the treatment received by an individual should not influence the outcome we see for another individual during the experiment. I know the initial example sounded absurd, so let me try again. Consider LinkedIn A/B testing a new ‘dislike’ reaction for its users, and the gods of fate chose you to be part of the initial treatment group that received this update. Excited after seeing this new update, you use this dislike reaction on my post and send a screenshot to a few of your connections to do the same, who are coincidentally in the control group that did not receive the update. Your connections log in and engage with my posts to use this dislike reaction, but later get disappointed as this new update is not yet available to them. The offices of LinkedIn are tr...