2

Curve fitting problems are solved by minimizing a cost/error function with respect to the model's parameters. Gradient descent and Newton's method are among many algorithms commonly used to minimize this function.

The $L_\infty$ norm can also be used as a cost function for linear/polynomial regression. My question: is it possible to use gradient descent to minimize cost defined by the $L_\infty$ norm (i.e. $\text{cost} = \max|\text{predicted} - \text{actual}|$)? How is the gradient of this function even defined?

rookie
  • 409
  • 5
  • 16
  • 1
    The issue here is that the cost is not differentiable everywhere (take $x \mapsto \max(-x,x)=|x|$ for example). You can deal with it in many ways, for example, re-write the problem as a constrained minimization problem or use an algorithm specifically designed to deal with $\max$ functions. – copper.hat Mar 10 '14 at 17:17
  • @copper.hat Yes, I agree. However, in Excel, you're able to minimize this function using Microsoft's "generalized reduced gradient" algorithm. This leads me to think that there must be some way of defining the gradient of this function. My friend suggested taking the difference of the errors and divide by the change in parameter (a difference quotient). Do you think that would work? – rookie Mar 10 '14 at 17:24
  • 1
    Well, it may work in the sense that it will likely reduce the cost value, but I would guess that it will either be slow or fail near points for which the $\max $ is attained at more than one point. You might be better off using a smooth approximation to $\max$. – copper.hat Mar 10 '14 at 17:27
  • If the underlying problem is convex, there are many methods for solving the problem in a concrete way (I doubt they are in Excel, though). – copper.hat Mar 10 '14 at 17:29
  • @copper.hat How would you guess Excel is solving this problem? – rookie Mar 10 '14 at 17:37
  • I don't know. Usually 'reduced gradient' means there is some equality constraint and the algorithm works on a lower dimensional space (with the remaining variables being functions of these, much as a circle $x^2+y^2=1$ is 'mostly' one dimensional). Sometimes the term is used for projected gradient, but again, this is in the context of constrained optimization. So, I have no idea! – copper.hat Mar 10 '14 at 17:42
  • Thank you, @copper.hat--you've been very helpful! Do you think that this question might be better suited for the statistics forum? – rookie Mar 10 '14 at 17:57
  • 1
    I'm not sure as I can't tell what bearing statistics has on your problem. The problem you are looking at is fairly standard (curve fitting), and I don't know the details of the grg2 algorithm used. I suspect grg2 is more an amalgam of techniques and algorithms rather than a specific type (such as SQP, Newton, etc.) – copper.hat Mar 10 '14 at 18:15

1 Answers1

3

The traditional approach would be to reformulate the problem to make it smooth. In this case, it's easy: $$ \min_x \ \|Ax - b\|_{\infty} $$ is equivalent to $$ \min_{x,t} \ t \quad \text{subject to} \ -te \leq Ax - b \leq te, $$ where $e = (1, 1, \dots, 1)$. This is a linear program that you can solve with a standard LP solver. You can play the same trick with the $\ell_1$ norm, which is often used in curve fitting to mitigate the influence of outliers: $$ \min_x \ \|Ax - b\|_1 $$ is equivalent to $$ \min_{x,s,t} \ \sum_i s_i+t_i \quad \text{subject to} \ Ax-b = s - t, \ (s, t) \geq 0. $$ Again, this is a linear program.

If you want to avoid constraints, you could apply a subgradient method to the nonsmooth objective $\|Ax-b\|_{\infty}$. It's almost identical to applying the gradient method except that instead of a gradient, you compute a subgradient. See http://www.seas.ucla.edu/~vandenbe/236C/lectures/subgradients.pdf.

Dominique
  • 3,144