1

I was watching Andrew Ng's CS229 ML lectures on youtube and i noticed something when he was explaining gradient descent using contour plots.contour plot

He's showing what theta gets updated to at each iteration of gradient descent. He says that it always goes in a direction orthogonal to the rings. In the image, it shows that the optimal value for $\theta_0$ is around 25, we started at around $30$. He explains that in linear regression the cost function looks like some sort of bowl and it has one global minimum. I'm just wondering given how we update $\theta_0$, how is it possible that in the first iteration $\theta_0$ goes further away from the minimum (from $30$ to $>30$). If we update $\theta_0$ based on its gradient, how does it go further away from the minimum?

TShiong
  • 1,257
frank
  • 11
  • Think about what gradient descent is doing – Andrew Jul 29 '23 at 21:53
  • In my head, when I plot the cost function with respect to theta0 and theta1, gradient descent checks the partial derivatives and goes towards that direction. The final direction that it moves is a sum of the two directions above. I just can't imagine a scenario where it would go away from the minimum in any direction. – frank Jul 29 '23 at 22:09
  • $f(x+h)=f(x)+h^T\nabla f(x)+o(|h|)$. Now think about how gradient descent works – Andrew Jul 29 '23 at 22:12
  • I don't see it. – frank Jul 29 '23 at 22:43
  • 1
    The idea behind gradient descent is choosing $h$ in the above equation to minimize $f(x+h)$. What is problematic is that the above equation describes a qualitative local relationship. Indeed, all it says it that if $h$ is small enough, then $f(x+h)<f(x)$. It doesn't say how small $h$ actually needs to be. – Andrew Jul 30 '23 at 02:10

0 Answers0