When does $\ell_1$-norm regularization yield the same result as $\ell_0$-norm regularization?

Question

Often we would like to solve an optimization problem such as $$ \text{minimize} \quad f(x) + \alpha \| x \|_0, $$ where the optimization variable is $x \in \mathbb R^n$, $f:\mathbb R^n \to \mathbb R$ is convex, $\alpha > 0$, and $ \| x \|_0$ is the number of nonzero components of $x$. Unfortunately, this optimization problem is non-convex. So, as a heuristic, people instead solve the problem $$ \text{minimize} \quad f(x) + \beta \| x \|_1, $$ where the "$\ell_0$-norm" is replaced by the $\ell_1$-norm. (Here $\beta > 0$.) This heuristic approach often gives very good results.

Question: Under what conditions is it guaranteed that any minimizer for the second problem is also a global minimizer for the first problem? When is it guaranteed that any minimizer for the second problem has the same sparsity pattern as the global minimizer for the first problem (assuming a global minimizer for the first problem exists and is unique)?

I think this question has been studied extensively in compressed sensing theory, and I'd like to know what kinds of results are available.

(By the way, of course, the "$\ell_0$-norm" is not really a norm.)

The results you're asking for are far from trivial (there are theorems that essentially tell you you can find the right sparsity set + the right sign under a number of conditions but proving this requires a lot of preliminary results, probably too many for a standard MSE answer...). Two important papers for this: https://projecteuclid.org/download/pdfview_1/euclid.aos/1245332830 https://projecteuclid.org/download/pdfview_1/euclid.imsc/1362751196 — tibL, Sep 27 '16 at 14:07

user2664946 · Answer 1 · 2017-01-02T19:14:44.483

There was a lengthy homework problem in a convex optimization class I took that addressed this question.

The problem used various relaxations and duality theory to show that the "reverse-Huber function" was a good convex candidate for approximating the $\ell_0$ norm.

The reverse Huber function $B(x)$ is given by $B(x) = |x|$ for $|x|\leq1$, and $B(x) = (x^2 + 1)/2$ for $|x| > 1$.

So the candidate that emerges from duality theory agrees with the $\ell_1$ norm whenever the elements of the input vector are less than 1 in magnitude. This leads to the qualitative conclusion that your $\ell_1$ optimal solution might not be $\ell_0$ optimal if the vector contains elements larger than 1.

For linear regression problems, I vaguely remember that the $\ell_1$ norm ensures thresholding in some situations depending on the singular values of the data matrix.

Here is the homework problem that walks through the reasoning for using the reverse Huber function (and in turn the $\ell_1$-norm). I don't remember which homework assignment dealt with the linear regression result I mentioned, but I believe it was introduced in a paper by Laurent El Ghaoui.

Thanks. Can you post or link to the homework problem? – eternalGoldenBraid Dec 29 '16 at 23:07 — eternalGoldenBraid, Dec 29 '16 at 23:07

When does $\ell_1$-norm regularization yield the same result as $\ell_0$-norm regularization?

1 Answers1