I am currently trying to understand the math used training neural network, in which gradient descent is used to minimize the error between the target and extracted. I currently following/reading this tutorial
So an example:
Given a network like this:
We wish to minimize the error function being for one training set (x,y)
\begin{align} J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2. \end{align}
(question: Why multiplying a half?)
Which for M training sets would become
\begin{align} J(W,b) &= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \end{align}
(question: Why the second term? and why computing the average of the error than the exact error, and try to minimize it)
Using partial the partial derivative on the cost function, one can compute the gradient in which the weight and bias has to descent to minimize it.
\begin{align} W_{ij}^{(l)} &= W_{ij}^{(l)} - \alpha \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) \\ b_{i}^{(l)} &= b_{i}^{(l)} - \alpha \frac{\partial}{\partial b_{i}^{(l)}} J(W,b) \end{align}
Where $\alpha$ the determine the amount of the gradient to be used.
As far is backpropagation being most usefull here as it provides an efficient way for computing the partial probabilities as such.
\begin{align} \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) &= \left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \right] + \lambda W_{ij}^{(l)} \\ \frac{\partial}{\partial b_{i}^{(l)}} J(W,b) &= \frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \end{align}
(question: Again why average and the second term?)
They they futher explain how one can get derivative for a training set (x,y)
First they define an error term $\delta^{(l)}_i$ which contains the information "how much of the error in the output was caused by node $i$ in layer $l$". The error seen in the output node $\delta^{(l)}_i$, can easily be computed as:
\begin{align} \delta^{(n_l)}_i = \frac{\partial}{\partial z^{(n_l)}_i} \;\; \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i) \end{align}
in which $z^{(l)}_i$ is denote the total weighted sum of inputs to unit $i$ in layer $l$, including the bias term. Example: $\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i$ and $a_{i}^(l)$ is the activation of node $i$ in layer $l$ $a^{(l)}_i = f(z^{(l)}_i)$.
(question: Not sure i understand how they derived partial derivative.. I understand why they do that - don't understand the result though)
This is where things begin become weird and my intuition is not following whats going on...
The error term of each node $i$ and layer $l$ can be defined as
$$\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)$$
Which then by some magic give the wanted partial derivatives..
\begin{align} \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\ \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. \end{align}
