estimating standard deviation from data

Question

Suppose results $x_1,x_2,x_3$ of an experiment are values of a random variable X with an unknown variance $\sigma^2.$ What is the best approximation of $\sigma^2$?

D. Luenberger mentiones in "Investment Science" that the best approximation of $\sigma^2$ is $\frac{1}{2}\sum_{i=1}^3 (x_i-\hat x)^2$, where $\hat x=\frac{x_1+x_2+x_3}{3}.$ Where can I find a proof of that?

Why this formula is different from the "variance of data" which by "Statistics" by D. Freedman et al, is $\frac{1}{3}\sum_{i=1}^3 (x_i-\hat x)^2$ ?

score 2 · Accepted Answer · 2013-10-17T13:16:18.713

The first formula is the sample variance, which generally is: $\hat \sigma_{n}^{2} = \frac{1}{n-1}\sum_{i=1}^{n} (x_{i}-\bar x)^2$ whereas the second formula is the population variance: $\hat \sigma_{n}^{2} = \frac{1}{n}\sum_{i=1}^{n} (x_{i}-\bar x)^2$ the only difference is in the denominator, with the sample formula correcting for the bias in small samples. Note that for large $n$, the two formulas give essentially the same value, but for small samples, the sample mean $\bar x$ is likely be much closer to the sample data than the true mean $\mu$ will be, so using the population formula will underestimate the variance, on average.

As for a proof, look up unbiased estimator (UE) of the variance. You will see that the sample variance formula has an expected value equal to the true variance. As far as it being the "best" estimator, in general, there is no single "best" estimator, it depends on your assumed sampling distribution. For example, the sample variance is the minimum variance unbiased estimator for normally distributed data. However, there are other ways to define "best" such as maximum likelihood or maximum posterior probability for maximum likelihood and Bayesian estimation, respectively.

Hope that helps.

The population variance (or the true variance) is $\sigma^2$. Whether you divide by $n$ or $n-1$, the quantities only depend on the sample, and hence they're both sample variances (sample variance is a term for something that estimates the variance based on the sample). "but for small samples, the sample mean $\bar{x}$ is likely be much closer to the sample data than the true mean $\mu$ will be" - what do you mean by that? — Stefan Hansen, Oct 17 '13 at 13:14
Stefan - you are correct that both are based on a sample of data. However, the sample variance formula assumes that your data are in fact a sample from a larger population whilst the second formula is more applicable if you have sampled the entire population. If you use the second formula on a small sample from a larger population, you will, on average, underestimate the variance. My, addmittedly somewhat vague, statement regarding the sample mean vs. true mean was an attempt at an intuitive explanaiton of why it underestimates. The link I provided is more rigorous — , Oct 17 '13 at 13:20

estimating standard deviation from data

1 Answers1