What does "a gradient of 1 on $\hat y$" mean?

Question

Chapter 8.7 of the Deep Learning Book says "Suppose our cost function has put a gradient of 1 on $\hat y$ …" Does it mean the derivative of the cost function w.r.t $\hat y$ is equal to 1?

For example, my model has 2 hidden layers and 1 output layer, so

$\hat y=xw_1w_2w_3$

Suppose the model is using mean squared error as loss function, then the loss for n data points is defined as

$loss{(\hat y, y)} = {\dfrac {1}{n}}\sum _{i=1}^{n}(y_{i}-{\hat y_{i}})^{2}$

where both $\hat y$ is the output vector the model predicts and $y$ denotes the corresponding ground truth label.

Does "put a gradient of 1 on $\hat y$" mean the following?

$\dfrac{\partial \ loss{(\hat y, y)}}{\partial \hat y} = 1$

The part cited above comes from Page 314 of the book

Could you maybe add the page number to make it easier to find the context? — csch2, Aug 13 '21 at 04:15
@JJJohn This is for you, the bounty placer. It seems like the assertion is correct. The cost function, evaluated at $\hat{y}$, has a derivative of $1$. Since we want to decrease the cost, we must proceed (back-propagate) in the direction where the cost decreases i.e. lowering the weights. This is why the analysis considers the quantity $w \to w-\epsilon \mathbf{g}$. The calculation for the first order term is given by equation 4.9 (in the same book) and the point i/s that the second and third order effects that appear from expansion can be too large, hence batch renormalization is considered. — Sarvesh Ravichandran Iyer, Aug 27 '21 at 04:25
Note that the derivative isn't $1$ for all values of $\hat{y}$ (if so, then it's a stupid notion of cost). Instead, we have fixed some $w_i$ while initializing the learning , so the $\hat{y}$ corresponding to these parameters is the point, at which the derivative of the cost function is $1$. We are obviously looking for a point where the derivative becomes zero, so we proceed in the direction opposite to the local increase of the cost function , in this case the derivative is positive so we go backward — Sarvesh Ravichandran Iyer, Aug 27 '21 at 04:27
@TeresaLisbon Thank you. Would you consider moving your comments to answer? — JJJohn, Aug 28 '21 at 11:30
@JJJohn Sure, I will do it shortly (not immediately, because I've got some really pressing non-MSE work) but definitely in like 2-3 hours. I also noticed some edits, so I'll modify my answer likewise as well! — Sarvesh Ravichandran Iyer, Aug 28 '21 at 11:49
@JJJohn I've written the answer and added up some references from the same book, which I really enjoyed reading. — Sarvesh Ravichandran Iyer, Aug 28 '21 at 20:58

score 1 · Answer 1 · answered Aug 28 '21 at 20:57

The cost function evaluated at $\hat{y}$, has a derivative of $1$. Since we want to decrease the cost, we must proceed (i.e. back-propagate) in the direction where the cost decreases i.e. by lowering the weights.

This is why the analysis considers the quantity $\mathbf{w} \to \mathbf{w} - \epsilon \mathbf {g}$. The calculation for the first order term is given by equations 4.5 (in the same book, page 86) which is : $$ x' = x - \epsilon \nabla_xf(x) $$

and the point is that if one sees equations $4.9,4.10$ from the same book, then one sees the second order effects of gradient descent as well. The formula given above, when matched with the situation we face and the iteration $w \to w - \epsilon g$, shows that our guess is correct.

What does "a gradient of 1 on $\hat y$" mean?

1 Answers1