Gradient descent vs. Newton's method -- which one requires more computation?

Question

Broadly speaking, when numerically minimizing a d-dimensional objective function:

Gradient descent generally requires more iterations, but each iteration is fast
(we only need to compute 1st derivatives)
Newton's method generally requires fewer iterations, but each iteration is slow
(we need to compute 2nd dervatives too)

My question is: in terms of the total amount of computation required, which one generally ends up being faster -- Newton's method or gradient-descent? Does this depend on $d$? How?

If this is a better question for another site please let me know.

Minor update

If it matters for the sake of comparison, let's assume the function is convex and "typical" (i.e. I'm not going to explicitly choose a function that exhibits the worst-case behavior of either algorithm). I'm just trying to understand what the rule of thumb is regarding the performance of each method.

What precisely do you mean by total amount of computation? Usually for nonlinear programming algorithms, one compares methods in terms of convergence rates. — wonko, Aug 12 '14 at 14:59
@wonko: It's a pretty clear question -- I mean if I run it on a computer, which one will finish faster (perhaps up to a constant factor)? — user541686, Aug 12 '14 at 17:14
Newton-type methods will usually win by far, if done properly. A good reference for the state-of-the-art methods for numerical optimization is Nocedal and Wright's book. — Nick Alger, Jun 12 '15 at 00:12
@NickAlger: Thanks for the tip, I'll take a look at the book. — user541686, Jun 12 '15 at 18:58

score 2 · Accepted Answer · answered Aug 12 '14 at 17:53

2

I think this depends a lot on the structure of the function you are optimizing (duh). In general non-convex cases, both algorithms have the same worst-case complexity for the number of iterations taken to drive the norm of the gradient below some given tolerance. Not sure what this means in terms of actual computation time for instances because the constant factors come into play.

You can look at this paper by Gould et al and references therein for more details.

I think a good thumbrule is - if your problem is convex and you have a reasonably good initial guess, Newton's (or Quasi-Newton) is usually much faster in practice.

answered Aug 12 '14 at 17:53

wonko

485

Are you sure it doesn't strongly depend on d? Intuitively I would expect Newton to become worse than gradient descent as d increases (since it needs second derivatives), but not sure... (and good point about the structure of the function, I think it's safe to assume it's a typical convex function for the sake of the question) – user541686 Aug 12 '14 at 20:57
You're right in the sense that the actual number of computations would depend on $d$ since obviously you are computing more derivative terms as the dimension increases. But I'm not sure there is a general statement you can make about what the tradeoff looks like between iterations and computations per iteration between the two algorithms. – wonko Aug 12 '14 at 21:39
1

@Mehrdad Surprisingly it tends to be quite the opposite - Newton type methods tend to do way better for large $d$. The reason is that it turns out you don't actually need to compute the full $d$-by-$d$ Hessian to apply it's inverse (or an approximation of its inverse) to a vector. The key methods to look into are "inexact Newton-Krylov" and "L-BFGS". – Nick Alger Jun 12 '15 at 19:49
@NickAlger: Very interesting, thanks for the pointers! – user541686 Jun 12 '15 at 20:01

score -1 · Answer 2 · answered Aug 12 '14 at 10:22

-1

In general a function one is trying to find a zero of is given by an explicit formula so the calculation of the derivative is a single-occurrence event and need not to be incorporated into the iteration. For the advantages of Newton's method see http://en.wikipedia.org/wiki/Newton%27s_method

answered Aug 12 '14 at 10:22

Mikhail Katz

42,112
3
66
131

I meant when this is done numerically -- I'll update the question. – user541686 Aug 12 '14 at 10:22
The OP asks about optimization not equation solving, I guess. – Claude Leibovici Aug 12 '14 at 10:38
@Claude, I don't see how this could be answered without some kind of information about the regularity of the function. Does this question make sense? – Mikhail Katz Aug 12 '14 at 10:40
It makes a lot of sense to me. I confess that I implicitly admit the regularity of the function. To me, the main question is : what means "compute the derivatives" ? Are they analytical or numerical ? The problem is very interesting (at least, to me !). Cheers :-) – Claude Leibovici Aug 12 '14 at 10:47
@Claude, computing the derivative only increases each iteration by a constant factor, whereas if you assume regularity, doesn't Newton's method provide an exponential improvement on other methods? – Mikhail Katz Aug 12 '14 at 11:19
I must go now but I shall be delighted to continue this discussion with you tomorrow. Cheers :-) – Claude Leibovici Aug 12 '14 at 11:24
@user72694: If you have to compute the derivative at each step (at the same cost as computing the function) then Newton's method is suboptimal and the secant method does better. If you get the derivative calculation for free -- perhaps as a side-effect of computing the function -- then Newton is faster than secant. – Charles Aug 12 '14 at 17:56

Gradient descent vs. Newton's method -- which one requires more computation?

Minor update

2 Answers2