I have the following optmization problem.
There are values $p_1', p_2', ..., p_n'$, which can be approximated using the following equations:
$$ p_1 = F_1 (g(\vec{a}, b)) \\ p_2 = F_2 (g(\vec{a}, b)) \\ ... \\\ p_n = F_n (g(\vec{a}, b)) $$
$\vec{a}$ has more unknowns than equations ($m>n$), $F$ is a very complicated differentiable function for which I don't know the equation in the closed form. $g(\vec{a}, b)$ has the following form:
$$ g(\vec{a}, b) = (1-b)\sum_i^m (a_i A_i) + bB, $$ where $A$ and $B$ are known.
I am only interested in the correct $b$ and any $\vec{a}$ is a good solution for me.
I am trying to solve this problem using the fact that $F$ is differentiable by minimizing some differentiable loss function $L(p_1', p_2', ..., p_n', p_1, p_2, ..., p_n)$ using automatic differentiation tools in deep learning frameworks.
I run into the issue that I cannot bring $b$ to the same scale as entries in $\vec{a}$. Hence, I am using two ADAM optimizers to optimize $\vec{a}$ and $b$ separately. However, my problem is that my optimization is very sensitive to the learning rates in both optimizers. For example, if the learning rate of the $b$-optimizer is too high, the loss function is minimized, but $b$ is not correct. The opposite is true if the learning rate of the other optimizer is too high.
My initialization starts with $\vec{a}$ as close as possible to the end result. Let's assume that $B$ is very different from $\sum_i^m (a_i A_i)$ and $F_i$ are designed to highlight this difference, thus making $p_i$ different from each other.
Is there an effective way to balance out both learning rates or another approach to solve this kind of optimization problems? Will the situation be different if $m=n-1$?