The method of least squares is the most basic method in statistical linear models. For the simplest linear model$$Y_i=\beta_0+X_i\beta_1+\epsilon_i$$we are looking for $\beta_0$ and $\beta_1$ that minimizes the Euclidean distance $\sum\limits_{i=1}^n|Y_i-\beta_0-X_i\beta_1|^2$. I learned statistics and an engineering student asked me why not use $\sum\limits_{i=1}^n|Y_i-\beta_0-X_i\beta_1|$ or things like $\sum\limits_{i=1}^n|Y_i-\beta_0-X_i\beta_1|^4$, I got stuck...... Is there any convincing answer for this question?
-
3Minimizing the $1$-norm of the residual is also a popular approach, and is more robust against outliers. Minimizing the $2$-norm is nice because there's a closed form solution (you just have to solve the normal equations), and also gives you an optimal solution (in some sense) when the noise is Gaussian. – littleO Oct 25 '13 at 00:37
1 Answers
One of the most frequently given rationales is that if one assumes the errors are independent and distributed as $N(\mu,\sigma^2)$ then the least-squares estimators of $\beta_0$ and $\beta_1$ coincide with the maximum-likelihood estimators.
Another rationale is the Gauss--Markov theorem. In that theorem it is not assumed that anything has a normal (or "Gaussian") distribution and it is not even assumed that the errors all have the same distribution. The assumptions are
- The errors $\varepsilon_i$ have expected value $0$;
- The errors $\varepsilon_i$ all have the same (finite) variance, which let us denote $\sigma^2$;
- The errors $\varepsilon_i$ are uncorrelated (but not necessarily independent).
The conclusion is that the least-squares estimators $\hat\beta_0$ and $\hat\beta_1$ are the "best linear unbiased estimators" of $\beta_0$ and $\beta_1$. That means that among all linear unbiased estimators of $\beta_0$ and $\beta_1$, they have the smallest mean squared error $\mathbb E\left(\left(\hat\beta_i - \beta_i\right)^2\right)$.
Let us be careful about what "linear" means. (I know of an instance where someone said they were "affine" but not "linear"; that person was confused.) "Linear" means that the mapping $$ \begin{bmatrix} Y_1 \\ \vdots \\ Y_n\end{bmatrix} \mapsto \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix} \text{ (with $\begin{bmatrix} X_1 \\ \vdots \\ X_n\end{bmatrix}$ fixed)} $$ is linear, i.e. it is additive and 1st-degree homogeneous.
The central idea of the proof is a lemma saying that the least-squares estimators are uncorrelated with every linear unbiased estimator of $0$. A linear unbiased estimator of $0$ is a linear combination of $Y_1,\ldots,Y_n$ with coefficients that may depend on the (observable) $X$ values but not on unobservables such as $\beta_0,\beta_1,\sigma^2$, and whose expected value remains $0$ regardless of the values of those unobservables.
Another rationale is that least-squares is easy to work with. For example, suppose you want a $90\%$ confidence interval for $\beta_1$. You get (assuming normality and independence) something based on a $t$-distribution with $n-1$ degrees of freedom, and making use of $\hat\beta_1$ and $X_1,\ldots,X_n$ (the dependence on the $X$s is non-linear, and the non-linearity can be isolated in the inversion of a $2\times 2$ matrix that depends on the $X$s). The hypothesis test corresponding to that confidence interval can be shown to be precisely the likelihood-ratio test.