optimization equivalence

Question

Given the functions $f_1(r,x)$ and $f_2(r,y)$: $[0,1]\times \Bbb R \to \Bbb R ^+$, solve the following problem

$$\underset{r,x,y}{\text{argmin}}\; f_1(r,x)+f_2(r,y) \\ \text{subject to}\; x^2+y^2=1$$

When is this equivalent to

$$\underset{r}{\text{argmin}}\; (\underset{x}{\text{argmin}}f_1(r,x)+\underset{y}{\text{argmin}}f_2(r,y)) \\ \text{subject to}\; x^2+y^2=1\;\;?$$

Generally speaking when do we have
$$\underset{x,y}{\text{argmin}}\; f(x,y)=\underset{x}{\text{argmin}}\; \underset{y}{\text{argmin}}\;f(x,y)\;\;?$$

Would the result be any different if the $f_1$ and $f_2$ were convex?

EDIT:

I believe I need to give mote context to the problem that I am trying to solve.

Simplifying a little bit, I have a statistical model where I am trying to model two groups (parameters $x$ and $y$). The parameter $r$ is for both groups.

I want to minimize the negative log-likelihood with the given constraint. That would correspond to the first problem.

Now in reality, I have more then 50 groups (and the corresponding 50 parameters) and still one "global" parameter $r$.

So the idea (maybe wrong?) is to minimize for a given $r$ without the constraint and then to minimize over $r$ so that the constraint is satisfied. I.e. the second problem should be defined as
$$h(r): r \mapsto (x^*,y^*)=(\underset{x}{\text{argmin}}\; f_1(r,x), \underset{y}{\text{argmin}}\;f_2(r,y)) \\ \underset{r}{\text{argmin}}\; (||h(r)||_2^2-1)^2$$

Under which conditions this second problem is equivalent to the first one?

What is meant by the second optimization? You have $\arg\min_r [\arg\min_x f_1(r,x) + \arg\min_y f_2(r,y)]$ but you also have constraints on $x$ and $y$? What does this mean? Specifically, the expression $[\arg\min_x f_1(r,x) + \arg\min_y f_2(r,y)]$ depends only on $r$, and so it is not clear how you are enforcing the constraints $x^2+y^2=1$. — Michael, Jun 08 '16 at 04:30
Perhaps you really mean to minimize the function $h(r)$ over $r \in [0,1]$, where $h(r)$ is the mimimum of $f_1(r,x) + f_2(r,y)$ over all $(x,y) \in \mathbb{R}^2$ such that $x^2+y^2=1$. — Michael, Jun 08 '16 at 04:38
Generally speaking, if $g(w,z)$ is any real-valued function defined over $(w, z) \in \mathcal{W} \times \mathcal{Z}$, where $\mathcal{W}$ and $\mathcal{Z}$ are any (possibly multi-dimensional) sets, then $$\inf_{(w,z) \in \mathcal{W}\times \mathcal{Z}} g(w,z) = \inf_{a \in \mathcal{W}}\left[\inf_{b \in \mathcal{Z}} g(a,b)\right] = \inf_{b \in \mathcal{Z}}\left[\inf_{a \in \mathcal{W}} g(a,b)\right]$$ — Michael, Jun 08 '16 at 04:46

Michael · Accepted Answer · 2016-06-08T23:03:07.370

Here are two comments. The first comment shows the two approaches you consider are generally not the same, even if the $f_1(r,x), f_2(r,x)$ functions are convex in $(r,x)$. The second comment suggests a standard Lagrange approach which may be more helpful.

Comment 1: An example showing an optimality gap even for convex problems.

Consider the following convex functions: \begin{align*} f_1(x,r) &= (r+x)^2\\ f_2(y,r) &= r + y^2 \end{align*}

Problem 1:

Minimize: $f_1(x,r)+f_2(y,r)$

Subject to: $(x,y) \in \mathbb{R}^2, x^2+y^2=1, r \in [0,1]$

This reduces to minimizing $(r+x)^2 + r + y^2$ subject to the constraints (which in turn is equivalent to minimizing $r^2 + 2rx + r + 1$ subject to $x \in [-1,1], r\in [0,1]$). It can be shown the optimal solution is $(x^*, y^*,r^*)=(-1,0,1/2)$ and gives an objective value of $f_1(x^*, r^*)+f_2(y^*,r^*)=3/4$.

Problem 2:

First: For each $r \in [0,1]$ find $\hat{x}(r) = \arg\min_{x\in\mathbb{R}} f_1(x,r)$ and $\hat{y}(r) = \arg\min_{y \in \mathbb{R}} f_2(y,r)$ (I designed the $f_1, f_2$ functions so that the minimizing points are unique). We get: \begin{align*} \hat{x}(r) &= \arg\min_{x \in \mathbb{R}} [(r+x)^2 ]= -r\\ \hat{y}(r) &= \arg\min_{y \in \mathbb{R}} [r + y^2] = 0 \end{align*}

Second: Find all $r\in [0,1]$ such that $\hat{x}(r)^2 + \hat{y}(r)^2 = 1$: We want $r^2 + 0^2 = 1$. The only $r$ that works is $\hat{r}=1$. This gives the vector:

$$ (\hat{x}, \hat{y}, \hat{r})=(-1, 0, 1) $$ for an objective value of $f_1(\hat{x},\hat{r})+f_2(\hat{y},\hat{r}) = 1$. This is not as good as the solution $(x^*,y^*,r^*)$ in problem 1 because that solution also satisfies the constraints but gets a better objective value.

Comment 2: The standard Lagrange multiplier method

It sounds like you have $m$ functions and you want to solve:

Problem A:

Minimize: $\sum_{i=1}^m f_i(x_i,r)$

Subject to: $r \in [0,1], \quad x_i \in \mathbb{R} \quad \forall i \in \{1, ..., m\}, \quad \sum_{i=1}^m x_i^2 = 1$

Let $\lambda$ be a given real number, called a Lagrange multiplier, and consider the simpler problem:

Problem B:

Minimize: $\sum_{i=1}^m [f_i(x_i,r) + \lambda x_i^2]$

Subject to: $ x_i \in \mathbb{R} \quad \forall i \in \{1, ..., m\}, \quad r \in [0,1]$

Let $(x_i^*(\lambda), r^*(\lambda))$ be a solution to problem B for a given $\lambda$. If you can find a value $\lambda \in \mathbb{R}$ for which $\sum_{i=1}^m (x_i^*(\lambda))^2=1$, then $(x_i^*(\lambda), r^*(\lambda))$ is also optimal for problem A. This is true for any functions $f_i(x_i,r)$, regardless of convexity. I wrote up notes on this general topic, see Theorem II.2 in page 7 of these notes: http://ee.usc.edu/stochastic-nets/docs/network-optimization-notes.pdf

Your question also seems similar in spirit to Exercises VIII-F.15-VIII-F.17 in those notes.

yes the problem is similar to Exercises VIII-F.15-VIII-F.17. However, it seems it is not possible to reduce the multidimensional problem (more than 50 parameters) to simpler unidimensional ones (?). PS: Using the Lagrange multiplier does not reduce the dimensionality of the problem. — teucer, Jun 09 '16 at 02:37
Given any $(\lambda, r)$, you solve 50 unidimensional problems to find the best $x_i$ values for that $(\lambda, r)$. You then have to scan over all $r$ and $\lambda$ (so, you have to solve the unidimensional problems many times for each new $(\lambda, r)$ combination). Your proposed approach had no $\lambda$, but still required solving unidimensional problems many times for each possible $r$. So the Lagrange method involves a 2-d search over $(r,\lambda)$ as opposed to your proposed 1-d search over $r$ (more complexity, but comes with an optimality theorem). — Michael, Jun 09 '16 at 21:48
Searching over the best Lagrange multiplier $\lambda \in \mathbb{R}$ to enforce a single constraint is often easy because you know whether to increase or decrease $\lambda$ on the next try, and you can often do a bisection-type search to quickly zero-in on the optimal $\lambda^*$. — Michael, Jun 09 '16 at 21:54
yes you are absolutely right! I have not appreciated the fact that one can fix $\lambda$ and $r$, because there are no constraints anymore. — teucer, Jun 10 '16 at 01:43