0

I am studying line search methods and I stumbled upon the "backtracking algorithm" for calculating a step length $a_k$ that gives a sufficient decrease in our function. To actually compute this sufficient decrease, we use the first Wolfe condition, which is as follows :

$$f(x_k + ap_k) \leq f(x_k) + ca \nabla f_k^Tp_k, c \in (0,1)$$

The backtracking algorithm says: Start with an arbitrary positive $a$. While the above equation isn't true, substitue $a$ with a smaller value, i.e. $$a \leftarrow ρ a$$ where $ρ \in (0,1)$. When we reach a value for $a$ for which the above inequality is satisfied, the algorithm stops.

I have a question that is of a more intuitive nature I think. Why do we make $a$ smaller at every iteration ? Why not make it bigger? If we are talking about a simple convex function like $f(x)=x^2$, I think I can justify the decrease of $a$. But consider the following graph enter image description here

You see that small values of $a$ are acceptable, then there is an interval for which we can't accept $a$ and the finally another interval of acceptable values. If our first estimation of $a$ falls in the interval of non acceptable values, can't we just increase it's value so it goes to the next interval of acceptable values? Because according to the backtracking algorithm, if we land in an non acceptable value of $a$, we will make smaller and smaller until if enters the first acceptable area.

Is there a reason that we want $a$ to be small and we avoid to make it bigger? Or is it because this particular graph isn't convex ? Not sure.

Sorry if my thoughts are a bit jumbled. Thanks in advance.

Note : Picture is taken from "Numerical Optimization, Nocedal & Wright".

Thomas
  • 477

1 Answers1

1

If we increase $a$, we do not know if this will ever result in a valid choice. There is no guaranty for finding another valley farther away from where we are.

But we can be sure that we will find a point below the line $l$ if only we make $a$ small enough. This is because $c\in(0,1),$ which means that $\phi$ is steeper than $l$ at $a=0.$

EDIT: Why can we be sure that we will find a suitable $a$ with $\phi(a)<l(a)$?

Let $g(a)=l(a)-\phi(a).$ Then $g(0)=f(x_k)-f(x_k)=0$ and $g'(0)=(c-1)(\nabla f_k^T p_k)>0,$ because $c-1<0$ and $\nabla f_k^T p_k<0.$

We have $$ g'(0)=\lim_{a\to 0}\frac{g(a)-g(0)}{a-0} = \lim_{a\to 0}\frac{g(a)}{a} $$ From the definition of $\lim$: For each $\varepsilon>0$ there is a $\delta>0$ such that $$ |a-0|<\delta \;\Rightarrow \; \left|\frac{g(a)}{a}-g'(0)\right|<\varepsilon $$ Now set $\varepsilon=\frac12 g'(0)$ and set $\delta$ accordingly, such that the aforementioned implication holds. Then $$ 0<a<\delta \;\Rightarrow\; g'(0)-\frac 12 g'(0) < \frac{g(a)}{a} < g'(0)+\frac 12 g'(0) \\ \;\Rightarrow \; \frac 12 g'(0) < \frac{g(a)}{a} \;\Rightarrow \; g(a) > \frac a2 g'(0) >0 $$

Reinhard Meier
  • 7,331
  • 10
  • 18
  • Thank you ! I have 2 questions. 1) So the graph I posted having a 2nd acceptable interval is coincidental, and noone can guarantee us that it will actually exist? 2) Regarding you 2nd statement. Is this true because we have chosen a descent direction, meaning that at least for small values of $a$, $φ$ will decrease for sure, like in the graph? Of course for large steps we dont know what will happen for sure, but for small values of the step, the function $φ$ is guaranteed to decrease right? – Thomas Nov 15 '19 at 14:54
  • Correct. It is exactly like that. But in order to stay below $l,$ it is not enough to have a descent direction, we also need that $l$ is not as steep as $\phi.$ Therefore we allow $c$ to be in $(0,1)$, only. – Reinhard Meier Nov 15 '19 at 14:58
  • Thanks. Im thinking though. At $a=0$ Both $l$ and $φ$ are equal to $f(x_k)$. Isn't there a chance that $φ$ will always be above $l$ for positive $a$? Or , since $c$ can get very close to $0$ and therefore $l$ will be $f(x_k)$, $φ$ will definitely be under it at some $a$? Sorry if i say silly things, trying to wrap my head around it – Thomas Nov 15 '19 at 15:08
  • Just saw your editted comment. The steepness of $φ$ is $\nabla f(x_k + a_k p_k)p_k$ and the steepness of $l$ is $c \nabla f_k p_k$. The slope of $l$ is the same as the slope of $φ$ at $a=0$ and $φ$ is going to decrease. How can we know for sure that there will be a value of $c$ so the that slope of $l$ will little enough for $l$ to be above $φ$? – Thomas Nov 15 '19 at 15:54
  • 1
    The steepness is not the same at $a=0.$ It differs by a factor of $c.$ And we set the value $c.$ It is a parameter of the algorithm. $l-\phi$ is differentiable and we have $(l-\phi)(0)=0$ and $(l'-\phi')(0)=(c-1)(\nabla f_kp_k)>0.$ Therefore, we can be sure that there is a $\delta>0$ with $(l-\phi)(\delta) > 0.$ – Reinhard Meier Nov 15 '19 at 16:04
  • Actually, you're right. At $a=0$, $φ$ is steepst. What does it mean? That when i move from $0$ and go to positive $a$, for very small steps $φ$ will be steepest. Isnt that the definition of slope/derivative? The rage of change for infinitesimal changes. So for bigger $a$ i dont know if it will continue to be steeper than $l$, but for small ones I can guarantee to find a small enough $a$ for which $φ$ will be steeper, therefore below $l$. Correct? – Thomas Nov 15 '19 at 16:39
  • 1
    @Thomas I have added the required explanations to the answer. – Reinhard Meier Nov 16 '19 at 11:03
  • Thanks a lot for your effort and for the help. I appreciate it. One last question, the comments I typed in my last comment above are correct? – Thomas Nov 16 '19 at 11:12
  • 1
    They are correct and probably helpful for the intuition, but not sufficient for a proof. – Reinhard Meier Nov 16 '19 at 11:19
  • Yes of course, they were more from an intuitive point of view :) Thanks alot you really helped me. Have a good day – Thomas Nov 16 '19 at 11:23