Deriving the least squares method

Question

I would like to understand more about the Least Squares, but it remains unclear to me, where these equations come from.

for a linear approximation $y=ax+b$

$a=\frac{m\sum_{i=1}^{m} x_iy_i-(\sum_{i=1}^{m}x_i)(\sum_{i=1}^{m}y_i)} {m\sum_{i=1}^{m}x_i^2 - (\sum_{i=1}^{m}x_i)^2}$

and

$b = \frac{(\sum_{i=1}^{m}x_i^2)(\sum_{i=1}^{m}y_i)-(\sum_{i=1}^{m}x_iy_i)(\sum_{i=1}^{m}x_i)}{m\sum_{i=1}^{m}x_i^2 - (\sum_{i=1}^{m}x_i)^2}$

What I have tried so far:

In the least squares method, the quadratic error should be minimized.

This means

$E(a,b)=\sum_{i=1}^{m}[y_i-(ax_i+b)]^2$

should be minimal.

For a minimum to exist, the partial derivatives of this expression must be zero.

$\frac{\partial{E}}{\partial{a}}$ = $2 \sum_{i=1}^{m}[y_i-(ax_i+b)]\hspace{2pt}(-x_i)=0$

$\frac{\partial{E}}{\partial{b}}$ = $2 \sum_{i=1}^{m}[y_i-(ax_i+b)]\hspace{2pt}(-1)=0$

Rearranging these, yields the following:

$a\sum_{i=1}^{m}x_i^2 + b\sum_{i=1}^{m}x_i = \sum_{i=1}^{m}x_iy_i$

$a\sum_{i=1}^{m}x_i+bm = \sum_{i=1}^{m}y_i$

This is where I'm stuck. I see the terms $x_iy_i$, etc. that also appear in the expressions in question - but I don't know how to rearrange to get there.

Things are much, much nicer if we make good use of matrix and vector notation. — littleO, Jan 02 '24 at 11:10
Note that m appears in the answers for both a and b, as well as the sums you have in your equations. So maybe just temporarily name all of these using some new letters, then solve the system using maybe Cramer's rule, then sub back and simplify. — coffeemath, Jan 02 '24 at 11:28

Egor Ivanov · Accepted Answer · 2024-01-02T11:19:34.917

You are doing everything right, but for such derivation it is easier to use matrix calculus (link). It is quite similar to usual $\mathbb R\to\mathbb R$ function differentiating rules. This is the one we are going to use: $$d\langle A\textbf{x}, \textbf{x}\rangle=\langle (A + A^T)\textbf{x}, d\textbf{x}\rangle $$ Let me briefly show: $$f(\textbf{x}) = \|A\textbf{x}-\textbf{b}\|_2^2 \to \min$$ To use the rule above we need other function presentation: $$f(\textbf{x}) = \langle A\textbf{x} - \textbf{b}, A\textbf{x} - \textbf{b} \rangle = \langle A\textbf{x}, A\textbf{x} \rangle - 2\langle A\textbf{x}, \textbf{b} \rangle + \langle \textbf{b}, \textbf{b} \rangle= $$ $$= \langle A^TA\textbf{x}, \textbf{x} \rangle - 2\langle A^T\textbf{b}, \textbf{x}\rangle + \langle \textbf{b}, \textbf{b} \rangle$$ Now it is easy to apply the rule: $$df(\textbf{x}) = 2 \langle A^TA \textbf{x}, d\textbf{x} \rangle - 2\langle A^T\textbf{b}, d\textbf{x}\rangle \Rightarrow \nabla f = 2(A^TA\textbf{x} - A^T\textbf{b})$$ To get the extremum point we need to find gradient zeros: $$A^TA\textbf{x} = A^T\textbf{b} \Rightarrow \textbf{x} = (A^TA)^{-1}A^T\textbf{b}$$ We get the same result as least squares method does. But the proof is not finished. We need to find the Hessian and proof that it is positive defined.

UPD: For your example: $$A = \begin{bmatrix} x_1 & 1\\ x_2 & 1 \\ \vdots & \vdots\\ x_m & 1 \end{bmatrix}$$ Then: $$ A^TA = \begin{bmatrix} x_1 & x_2 & \dots & x_m\\ 1 & 1 & \dots & 1 \end{bmatrix} \begin{bmatrix} x_1 & 1\\ x_2 & 1 \\ \vdots & \vdots\\ x_m & 1 \end{bmatrix}= \begin{bmatrix} \sum_{k=1}^m x_k^2 & \sum_{k=1}^m x_k\\ \sum_{k=1}^m x_k & 1 \end{bmatrix} $$ Now it is easy to get expressions for $a$ and $b$.

I’d just call this regular old vector calculus rather than matrix calculus. — littleO, Jan 03 '24 at 08:58

Deriving the least squares method

1 Answers1

Linked