Difference between Newton's method and Gauss-Newton method

Question

I know that the Gauss-Newton method is essentially Newton's method with the modification that the Gauss-Newton method it uses the approximation $2J^TJ$ (where $J$ is the Jacobian matrix) for the Hessian matrix.

I didn't understand why we are using this approximation. Can anyone explain how this approximation occur?

Thanks

Vexx23 · Answer 1 · 2018-03-12T16:06:15.670

If the objective function is the classical sum of squares, it can be written as the norm squared of a certain error vector $\boldsymbol e \in \mathbb{R}^m$: $$f(\boldsymbol x) = \| \boldsymbol e(\boldsymbol x))\|^2 = \boldsymbol e ^T(\boldsymbol x) \boldsymbol e(\boldsymbol x) $$ where $\boldsymbol x \in \mathbb{R}^n $ is the decision variable.

Newton algorithm tries to minimize the objective function by finding a point where its gradient vanishes, by using a local linear approximation of the gradient difference: $$\nabla f(\boldsymbol x_{k+1}) - \nabla f(\boldsymbol x_{k}) \approx \boldsymbol Hf(\boldsymbol x_{k})(\boldsymbol x_{k+1} -\boldsymbol x_{k})$$ in the hypothesis that the function is convex or that the hessian matrix is locally positive semi-definite, otherwise the algorithm fails, because it gets attracted by any stationary point, which may be either a minimum, a maximum or saddle). If the above expression is rewritten as an affine transformation: $$\nabla f(\boldsymbol x_{k+1}) \approx \nabla f(\boldsymbol x_{k}) +\boldsymbol Hf(\boldsymbol x_{k})(\boldsymbol x_{k+1} -\boldsymbol x_{k})$$ the optimum update $\Delta \boldsymbol x^*_k = \boldsymbol x_{k+1} -\boldsymbol x_{k}$ can be found by solving the equation: $$\nabla f(\boldsymbol x_{k}) = -\boldsymbol Hf(\boldsymbol x_{k})\Delta \boldsymbol x_k $$

The Hessian matrix, owing to the particular structure of the cost function, depends upon both first and second derivatives of each component $e_i(\boldsymbol x)$ of the error vector. Considering that: $$e_i(\boldsymbol x + \Delta \boldsymbol x) \approx e_i(\boldsymbol x) + \nabla e_i^T(\boldsymbol x) \Delta \boldsymbol x + \frac{1}{2}\Delta \boldsymbol x^T \boldsymbol He_i(\boldsymbol x) \Delta \boldsymbol x + \|\Delta \boldsymbol x \|^3, \quad \Delta \boldsymbol x \to \boldsymbol 0, \forall i = 1, \ldots, m $$ hence: $$ \boldsymbol Hf(\boldsymbol x) = \frac{\partial^2 f(\boldsymbol x)}{\partial \boldsymbol x^2} = \frac{\partial}{\partial \boldsymbol x}\frac{\partial f(\boldsymbol x)}{\partial \boldsymbol x}= \boldsymbol J \left(2 \boldsymbol e^T(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) = 2 \left(\boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) + \sum_{i=1}^{m} e_i(\boldsymbol x) \boldsymbol He_i(\boldsymbol x) \right)$$ where $\boldsymbol J \boldsymbol e(\boldsymbol x)$ is the error jacobian matrix defined as: $$\boldsymbol J \boldsymbol e(\boldsymbol x) = \begin{pmatrix} \nabla e_1^T(\boldsymbol x) \\ \vdots \\ \nabla e_m^T(\boldsymbol x) \end{pmatrix}$$

If the second derivatives of the error components $e_h(\boldsymbol x)$ $$ \frac{\partial^2 e_h(\boldsymbol x)}{\partial x_i \partial x_j} $$ are not known, one can approximate the hessian matrix by neglecting the second part (which becomes more and more negligible as the error gets smaller so it makes perfectly sense when the residuals are very small):

$$ \boldsymbol Hf(\boldsymbol x) \approx 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) $$ $$ \boldsymbol \nabla f(\boldsymbol x) = 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) $$

This gives rise to the Gauss-Newton algorithm: $$ 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) = -\left( 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) \Delta \boldsymbol x^* \Leftrightarrow $$ $$ \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) = -\left(\boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) \Delta \boldsymbol x^* $$

It's also worth mentioning the Gauss-Newton step always exists. — mathmath8128, Sep 24 '18 at 20:13

score 4 · Answer 2 · answered Sep 25 '15 at 15:33

The difference can be seen with a scalar function.

Gauss Newton is used to solve nonlinear least squares problems and the objective has the form $f(x) = r(x)^2$. The derivatives are $f'(x) = 2 r(x) r'(x)$ and $f''(x) = 2 ( r(x) r''(x) + (r'(x))^2)$.

Newton's method uses the second derivative $f''(x)$ above, the Gauss Newton method uses the approximation $f''(x) \approx 2 (r'(x))^2)$ (that is, the Hessian of $r$ is dropped).

score 1 · Answer 3 · answered Sep 25 '15 at 08:42

1

Newton computes the update step $s$ by solving $F'(x)·s=-F(x)$.

Gauss-Newton determines the update by minimizing the error in the linearization of the overdetermined system, i.e., minimizes $\|F'(x)·s+F(x)\|$. The expanded form of the square of this error is $$ \|F(x)\|^2 + 2·F(x)^TF'(x)·s+s^T·F'(x)^TF'(x)·s $$ The quadratic term is not an approximation for the Hessian of $\|F(x+s)\|^2$, just an expression in the error minimization of a linear system.

answered Sep 25 '15 at 08:42

Lutz Lehmann

126,666

Why often people remember square but rare write (or remember?) 2-norm (Euclidean) while writing the notation for the least-squared problem? – JeeyCi Aug 15 '23 at 02:40

Difference between Newton's method and Gauss-Newton method

3 Answers3

Linked