We ought never to allow ourselves to be persuaded of the truth of anything unless on the evidence of our own reason – René Descartes

Making Sense of Sample Standard Deviation

I had an opportunity to explain sample standard deviation to my colleague last week.
There are many technical jargons surrounding sample standard deviation
like unbiased estimator and Bessel's correction
but it's not that hard to make an intuitive sense of it.

Consider a set of n samples ${x}_{1},{x}_{2},\cdots {x}_{n}$ of a normally distributed random variable X.
The sample mean is:
$$\overline{x}={\displaystyle \sum _{i=1}^{n}{x}_{i}}$$

An important thing to notice here is that $\overline{x}$ isn't the true mean.
It is an estimate of the mean of the underlying distribution X.
Now let's look at the definition of variance
(for a discrete random variable X with mean μ with a finite outcome space Ω):
$$Var(X)=\frac{1}{|\mathrm{\Omega}|}{\displaystyle \sum _{x\in \mathrm{\Omega}}(x-\mu {)}^{2}}$$

If we simply applied this formula to estimate the variance v as in:
$$v=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}({x}_{i}-\overline{x}{)}^{2}}$$
then v is a biased estimator of $Var(X)$ as long as μ is not equal to $\overline{x}$.
Why? Because the sample mean $\overline{x}$ is "closer" to each value of x's,
and therefore the difference between each sample and $\overline{x}$ tends to be smaller than that of with μ.
Unconvinced?
Let us prove it.

Theorem: the sum of squares of differences between samples and x has its unique minimum at where x is equal to the sample mean #

This was a hard lemma to prove for me. I had a sleepless night or two
but I couldn't come up with a proof for the life of me.
Then Min Xu introduced me to a well known proof as follows:

Let ${x}_{1},{x}_{2},\cdots {x}_{n}$ be n samples of any random variable X with an outcome space $\Omega =R$
with a finite mean μ.
Notice that X doesn't need to be a standard normal, and the size of its outcome space is infinite.
Now consider the following minimization problem:
$$\underset{c}{min}\sum _{i=1}^{n}({x}_{i}-c{)}^{2}$$

Expanding each term inside the sum yields:
$$\underset{c}{min}\sum _{i=1}^{n}({x}_{i}^{2}-2{x}_{i}c+{c}^{2})$$

Using the commutative property of additions, we get:
$$\underset{c}{min}\sum _{i=1}^{n}{x}_{i}^{2}-2c\sum _{i=1}^{n}{x}_{i}+n{c}^{2}$$

Now, substitute the definition of the sample mean and divide by n to get:
$$\underset{c}{min}{c}^{2}-2\overline{x}c+\frac{1}{n}\sum _{i=1}^{n}{x}_{i}^{2}$$

Since this is a second order polynomial,
we know that the function is divergent on both negative and positive infinities
and it has the unique minimum at where the first derivative is zero.
Take the derivative of this function and set it to zero: $2c-2\overline{x}=0$.
The function has its unique minimum c at $c=\overline{x}$ as desired. Q.E.D.

Corollary: The Sum of the Squares of the Differences Between Samples and Their Mean Divided by the Sample Size is a Biased Estimator of the Variance #

Proof:
It follows from the previous theorem that
$\frac{1}{n}\sum _{i=1}^{n}({x}_{i}-\mu {)}^{2}>\frac{1}{n}\sum _{i=1}^{n}({x}_{i}-\overline{x}{)}^{2}$
whenever μ is not equal to $\overline{x}$. Q.E.D.

Now that we've convinced ourselves that $\frac{1}{n}\sum _{i=1}^{n}({x}_{i}-\overline{x}{)}^{2}$
only gives us a biased estimator of the variance, can we do better?
Indeed.
One most native and popular way of accounting the fact that the sample mean is biased
is to use Bessel's correction:
$$\frac{1}{n-1}\sum _{i=1}^{n}({x}_{i}-\overline{x}{)}^{2}$$

With Bessel's correction, you're dividing the sum with n − 1 as opposed to n to account the fact
the sum is smaller than it supposed to be.
Note that n − 1 approaches n as n goes to infinity, and here's why.
Recall Chebyshev's inequality:
$$P(|X-\mu |\ge k\sigma )\le \frac{1}{{k}^{2}}$$

Suppose we had n independent and identically distributed random variables of the mean μ
and the variance ${\sigma}^{2}$.
Since the mean of this sum is nμ and the variance is ${\sigma}^{2}$, we have:
$$P(|n\overline{X}-n\mu |\ge k\sqrt{n}\sigma )\le \frac{1}{{k}^{2}}$$

Dividing each side of the inequality by n yields:
$$P(|\overline{X}-\mu |\ge \frac{k\sigma}{\sqrt{n}})\le \frac{1}{{k}^{2}}$$

Now, let $k=\frac{\sqrt{n}}{\sigma}\delta $ for some real number δ:
$$P(|\overline{X}-\mu |\ge \delta )\le \frac{{\sigma}^{2}}{n{\delta}^{2}}$$

In other words, $P(|\overline{X}-\mu |\ge \delta )$ is inversely proportional to n for every δ > 0.
Thus, any correction we add should converge to the original definition of the (biased) variance
since the bias evaporates as the number of samples increases,
and Bessel's correction holds that exact property.