If the objective function is the classical sum of squares, it can be written as the norm squared of a certain error vector $\boldsymbol e \in \mathbb{R}^m$:
$$f(\boldsymbol x) = \| \boldsymbol e(\boldsymbol x))\|^2 = \boldsymbol e ^T(\boldsymbol x) \boldsymbol e(\boldsymbol x) $$
where $\boldsymbol x \in \mathbb{R}^n $ is the decision variable.
Newton algorithm tries to minimize the objective function by finding a point where its gradient vanishes, by using a local linear approximation of the gradient difference:
$$\nabla f(\boldsymbol x_{k+1}) - \nabla f(\boldsymbol x_{k}) \approx \boldsymbol Hf(\boldsymbol x_{k})(\boldsymbol x_{k+1} -\boldsymbol x_{k})$$
in the hypothesis that the function is convex or that the hessian matrix is locally positive semi-definite, otherwise the algorithm fails, because it gets attracted by any stationary point, which may be either a minimum, a maximum or saddle).
If the above expression is rewritten as an affine transformation:
$$\nabla f(\boldsymbol x_{k+1}) \approx \nabla f(\boldsymbol x_{k}) +\boldsymbol Hf(\boldsymbol x_{k})(\boldsymbol x_{k+1} -\boldsymbol x_{k})$$
the optimum update $\Delta \boldsymbol x^*_k = \boldsymbol x_{k+1} -\boldsymbol x_{k}$ can be found by solving the equation:
$$\nabla f(\boldsymbol x_{k}) = -\boldsymbol Hf(\boldsymbol x_{k})\Delta \boldsymbol x_k $$
The Hessian matrix, owing to the particular structure of the cost function, depends upon both first and second derivatives of each component $e_i(\boldsymbol x)$ of the error vector. Considering that:
$$e_i(\boldsymbol x + \Delta \boldsymbol x) \approx e_i(\boldsymbol x) + \nabla e_i^T(\boldsymbol x) \Delta \boldsymbol x + \frac{1}{2}\Delta \boldsymbol x^T \boldsymbol He_i(\boldsymbol x) \Delta \boldsymbol x + \|\Delta \boldsymbol x \|^3, \quad \Delta \boldsymbol x \to \boldsymbol 0, \forall i = 1, \ldots, m $$
hence:
$$ \boldsymbol Hf(\boldsymbol x) = \frac{\partial^2 f(\boldsymbol x)}{\partial \boldsymbol x^2} = \frac{\partial}{\partial \boldsymbol x}\frac{\partial f(\boldsymbol x)}{\partial \boldsymbol x}= \boldsymbol J \left(2 \boldsymbol e^T(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) = 2 \left(\boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) + \sum_{i=1}^{m} e_i(\boldsymbol x) \boldsymbol He_i(\boldsymbol x) \right)$$
where $\boldsymbol J \boldsymbol e(\boldsymbol x)$ is the error jacobian matrix defined as:
$$\boldsymbol J \boldsymbol e(\boldsymbol x) = \begin{pmatrix} \nabla e_1^T(\boldsymbol x) \\ \vdots \\ \nabla e_m^T(\boldsymbol x) \end{pmatrix}$$
If the second derivatives of the error components $e_h(\boldsymbol x)$
$$ \frac{\partial^2 e_h(\boldsymbol x)}{\partial x_i \partial x_j} $$ are not known, one can approximate the hessian matrix by neglecting the second part (which becomes more and more negligible as the error gets smaller so it makes perfectly sense when the residuals are very small):
$$ \boldsymbol Hf(\boldsymbol x) \approx 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) $$
$$ \boldsymbol \nabla f(\boldsymbol x) = 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) $$
This gives rise to the Gauss-Newton algorithm:
$$ 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) = -\left( 2 \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) \Delta \boldsymbol x^* \Leftrightarrow $$
$$ \boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol e(\boldsymbol x) = -\left(\boldsymbol J^T \boldsymbol e(\boldsymbol x) \boldsymbol J \boldsymbol e(\boldsymbol x) \right) \Delta \boldsymbol x^* $$