Gradient Descent: Optimal fixed step size for a quadratic objective?

Question

All of the literature I'm reading immediately skips to the idea of adjusting the step size as you iterate (as far as I can tell) to maximize the rate of convergence. In the context of neural network modeling, I'm building up from fixed-step gradient descent to the more involved methods for the purpose of minimizing error.

For example, say I have a simple twice-differentiable function $x^2 + 4$. Is there some way to formalize the rate of convergence and subsequently solve for the fixed step size that maximizes this? The claim is that objective quadratic functions have an optimal fixed step size and this can be shown analytically.

Wouldn't you want to use some variant of Newton's method to optimize a quadratic objective function? I'm pretty sure gradient descent is typically considered quite slow and there are quadratic programming methods which will work much faster. — muzzlator, Feb 27 '13 at 18:29
Read through the discussion here:
http://en.wikipedia.org/wiki/Newton's_method_in_optimization — muzzlator, Feb 27 '13 at 18:34
i don't disagree with the notion that it is slow but I'm trying to start at the simplest of the methods and work my way over to NM. — PatternMatching, Feb 27 '13 at 18:37
Fair enough. The way to determine the step size that would be optimal for quadratics in gradient descent is the same as the step size in Newton's method since for quadratic objectives, the search direction is the same. ie) $|\frac{f^{(1)}(x_i)}{ f^{(2)}(x_i)}|$ which for your function $f(x) = x^2 + 4$ gives $|x_0|$ — muzzlator, Feb 27 '13 at 18:48

score 1 · Answer 1 · answered May 06 '22 at 07:45

The optimal step length $\alpha s^k$ (actually for any search direction $\boldsymbol s^k$) for a quadratic objective function $f(\boldsymbol x)$ is given by

$$ \alpha^k = -\frac{ \big( \boldsymbol s^k \big)^T \nabla f\big(\boldsymbol x^k\big)}{\big( \boldsymbol s^k \big)^T Q \boldsymbol s^{k}}. $$

In the gradient descent method, you pick $\boldsymbol s^k = \nabla f\big(\boldsymbol x^k\big)$. Here the minus sign comes into play due to the definition of $\alpha^k$, potentially the positive-definiteness of $Q$ and the update formula $$\boldsymbol x^{k+1} = \boldsymbol x^k + \alpha^k \boldsymbol s^k.$$

In your special case, $Q = 2, \boldsymbol s^k = \nabla f\big(\boldsymbol x^k\big) = 2 x$ and thus $$ \alpha^k = -\frac{2x \cdot 2x }{2x \cdot 2 \cdot 2x} = -\frac12.$$

Gradient Descent: Optimal fixed step size for a quadratic objective?

1 Answers1