Solving derivative of squared error where the predictor is a sigmoid function

Question

$\newcommand{\sigmoid}{\operatorname{sigmoid}}$In the book "Make your own neural network" by Tariq Rashid, I have to take the derivative of my cost function which is:

$$ \left(t-\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)^2 $$

where $t$ is the true value and thus is a constant. $o_j$ is the value of the previous node and $w_{jk}$ are the weights that connect $o_j$ to the error node. Trying to work out the derivative of the function myself I get the following result:

$$ 2\left(t-\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)\times \left(\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)\left(1-\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right) $$

The problem is that the derivative in the book is a bit different and I have no idea why or what I did wrong. The answer in the book has $-2$ and is multiplied by $o_j$ at the end. Where does the $-2$ and $o_j$ in the equation come from? what step of the chain rule did I miss? $$ -2\left(t-\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)\times \left(\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)\left(1-\sigmoid\left(\sum_j w_{jk}\times o_j\right)\right)\times o_j $$

score 0 · Accepted Answer · edited Apr 24 '23 at 07:12

First, recall that the sigmoid function is defined as: $$\sigma(x)=\frac{e^{x}}{1+e^{x}}=\frac{1}{1+e^{-x}}=(1+e^{-x})^{-1}$$ then, using the quotient rule (or chain rule) you can show that: $$\sigma^{\prime}(x)=\sigma(x)\cdot\big(1-\sigma(x)\big)$$

In fact: $$ \sigma^{\prime}(x)=\left((1+e^{-x})^{-1}\right)^{\prime}=-(1+e^{-x})^{-2}\cdot(-e^{-x})=\frac{e^{-x}}{(1+e^{-x})^{2}}=\frac{e^{-x}}{1+e^{-x}}\frac{1}{1+e^{-x}}=\frac{e^{-x}}{1+e^{-x}}\sigma(x)=\left(1-\frac{1}{1+e^{-x}}\right)\sigma(x)=\big(1-\sigma(x)\big)\cdot\sigma(x)$$

Now, your function (as a function of the weights $w_{jk}$) is $$F(\ldots,w_{jk},\ldots)=\left(t-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)^{2}$$ using the chain rule you will get it's derivative w.r.t. the weights: $$\frac{\partial}{\partial w_{jk}}F(\ldots,w_{jk},\ldots)=\frac{\partial}{\partial w_{jk}}\left(\left(t-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)^{2}\right)$$ $$=2\left(t-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)\cdot\sigma^{\prime}\left(\sum_{j}w_{jk}o_{j}\right)\cdot\frac{\partial}{\partial w_{jk}}\left(\sum_{j}w_{jk}o_{j}\right)$$ and now use that fact that: $$\sigma^{\prime}(x)=\sigma(x)\cdot\big(1-\sigma(x)\big)$$ we get: $$\frac{\partial}{\partial w_{jk}}F(\ldots,w_{jk},\ldots)=2\left(t-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)\cdot\sigma^{\prime}\left(\sum_{j}w_{jk}o_{j}\right)\cdot\frac{\partial}{\partial w_{jk}}\left(\sum_{j}w_{jk}o_{j}\right)$$ $$=2\left(t-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)\cdot\sigma\left(\sum_{j}w_{jk}o_{j}\right)\cdot\left(1-\sigma\left(\sum_{j}w_{jk}o_{j}\right)\right)\cdot\frac{\partial}{\partial w_{jk}}\left(\sum_{j}w_{jk}o_{j}\right)$$ and finally observe that: $$\frac{\partial}{\partial w_{jk}}\left(\sum_{\ell}w_{\ell k}o_{\ell}\right)=\sum_{\ell}\delta_{\ell j}o_{\ell}=o_{j}.$$ where $\delta_{\ell j}$ is the Krocneker delta.

Your answer made me realize that what I forgot was that within the sigmoid function is another function: W*O which explains why the multiplication of Oj is at the end. What I fail to realize is why the answer in the book has a negative 2. x^2 would become 2x, isn't the same true for (t-sigmoid)^2 becoming 2(t-sigmoid) — Jesse, Nov 18 '17 at 23:34
Exactly, but using the chian rule you can show the derivative answer from the book. — Hector Blandin, Nov 18 '17 at 23:37

Solving derivative of squared error where the predictor is a sigmoid function

1 Answers1