Making Sense of Sample Standard Deviation

I had an opportunity to explain sample standard deviation to my colleague last week. There are many technical jargons surrounding sample standard deviation like unbiased estimator and Bessel’s correction but it’s not that hard to make an intuitive sense of it.

Consider a set of n samples x1,x2,xn of a normally distributed random variable X. The sample mean is: x¯=i=1nxi

An important thing to notice here is that x¯ isn’t the true mean. It is an estimate of the mean of the underlying distribution X. Now let’s look at the definition of variance (for a discrete random variable X with mean μ with a finite outcome space Ω): Var(X)=1|Ω|xΩ(xμ)2

If we simply applied this formula to estimate the variance v as in: v=1ni=1n(xix¯)2 then v is a biased estimator of Var(X) as long as μ is not equal to x¯. Why? Because the sample mean x¯ is “closer” to each value of x’s, and therefore the difference between each sample and x¯ tends to be smaller than that of with μ. Unconvinced? Let us prove it.

Theorem: the sum of squares of differences between samples and x has its unique minimum at where x is equal to the sample mean

This was a hard lemma to prove for me. I had a sleepless night or two but I couldn’t come up with a proof for the life of me. Then Min Xu introduced me to a well known proof as follows:

Let x1,x2,xn be n samples of any random variable X with an outcome space Ω=R with a finite mean μ. Notice that X doesn’t need to be a standard normal, and the size of its outcome space is infinite. Now consider the following minimization problem: minci=1n(xic)2

Expanding each term inside the sum yields: minci=1n(xi22xic+c2)

Using the commutative property of additions, we get: minci=1nxi22ci=1nxi+nc2

Now, substitute the definition of the sample mean and divide by n to get: mincc22x¯c+1ni=1nxi2

Since this is a second order polynomial, we know that the function is divergent on both negative and positive infinities and it has the unique minimum at where the first derivative is zero. Take the derivative of this function and set it to zero: 2c2x¯=0. The function has its unique minimum c at c=x¯ as desired. Q.E.D.

Corollary: The Sum of the Squares of the Differences Between Samples and Their Mean Divided by the Sample Size is a Biased Estimator of the Variance

Proof: It follows from the previous theorem that 1ni=1n(xiμ)2>1ni=1n(xix¯)2 whenever μ is not equal to x¯. Q.E.D.

Bessel’s correction

Now that we’ve convinced ourselves that 1ni=1n(xix¯)2 only gives us a biased estimator of the variance, can we do better? Indeed. One most native and popular way of accounting the fact that the sample mean is biased is to use Bessel’s correction: 1n1i=1n(xix¯)2

With Bessel’s correction, you’re dividing the sum with n − 1 as opposed to n to account the fact the sum is smaller than it supposed to be. Note that n − 1 approaches n as n goes to infinity, and here’s why. Recall Chebyshev’s inequality: P(|Xμ|kσ)1k2

Suppose we had n independent and identically distributed random variables of the mean μ and the variance σ2. Since the mean of this sum is nμ and the variance is σ2, we have: P(|nX¯nμ|knσ)1k2

Dividing each side of the inequality by n yields: P(|X¯μ|kσn)1k2

Now, let k=nσδ for some real number δ: P(|X¯μ|δ)σ2nδ2

In other words, P(|X¯μ|δ) is inversely proportional to n for every δ > 0. Thus, any correction we add should converge to the original definition of the (biased) variance since the bias evaporates as the number of samples increases, and Bessel’s correction holds that exact property.