2

I am working on a high dimensional (N ~ 1000-60000) optimization problem which is currently solved with an LBFGS algorithm. I have experimented with different diagonal preconditioners as I know that the gradients in some dimensions are oders of magnitude larger than in other dimensions and I have observed a significant speed up for the majority of the problem instances I looked at - but not for all. The longer I work with the this simple preconditioner, however, the more I come to the realization that I do not understand why this makes a difference. Shouldn't an LBFGS algorithm account for the scaling in different dimensions automatically? Am I doing something fundamentally wrong when implementing the Jacobi preconditioner with the two following modifications of my objective function:

  1. The first line of code in the objective function is

    x = x./PC

i.e., I a scale my variables x with the preconditioner PC and

  1. The last line of code in the objective function is

[f,dx] = [f, dx./PC]

i.e., I scale the gradient accordingly. x, dx, and PC are vectors. f is the objective function value.

Thank you everybody for your time. I appreciate any help and pointers to relevant literature. The stuff I found focused on CG methods and I could not relate that to my problem...

Mark
  • 21

1 Answers1

2

L-BFGS doesn't scale your problem automatically and can be quite sensitive. Scaling your problem before solving (as you are now doing) is an important step but the L-BFGS update itself can still give rise to ill-conditioned Hessians approximations. For this reason, a few ad-hoc scalings have been developed for the Hessian approximations themselves. They are mostly based on the product $y_k^T s_k$, where $s_k = x_k - x_{k-1}$ is the step and $y_k = \nabla f(x_k) - \nabla f(x_{k-1})$. A good reference for this is the book Numerical Optimization by Nocedal and Wright (Springer, 2nd ed.) I presume you're using an implementation of L-BFGS using the "two loop recursion". In between the two loops, you have a choice to use an "initial" matrix $H_k^0$. A common choice is the diagonal matrix $$ H_k^0 = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}} \, I. $$ Note also that varying the memory parameter can have a significant influence on performance. For some problems, $m=1$ is astounding but most of the time $5 \leq m \leq 10$ gives the best results.

Dominique
  • 3,144
  • I was surprised to find that the $H_k^0$ of Nocedal & Wright (2006) in some cases is better than, for example, using the diagonal part of the Hessian. – a06e Oct 17 '18 at 12:39