0

Suppose I know that this relationship exists y=(xb-b)+c+d if I had a table of y values for different values of x,b,c and d and I didnt know this relationship how would I go about finding this relationship. Would regression analysis produce the equation or would I have to plot the values and take some of the values as zero and model manually.

  • If you have all the values of $x,b,c,d$ there are no variables left to fit. You could however plot them and see how much error there is. Could you give an example of some of your data or more context for the problem? – overfull hbox Dec 24 '18 at 20:54
  • but problem is we are assuming we dont know the relationship y=(xb-b)+c+d , so even if we have the values of x,b,c and d we dont know how they relate to each other to produce a given y value.How would I plot multiple independent variables vs one dependent variable. – Tariro Manyika Dec 26 '18 at 07:45
  • table

    \begin{table}[] \begin{tabular}{lllll} & & & & \ & & & & \ & & & & \ & & & & \end{tabular} \end{table}

    – Tariro Manyika Dec 26 '18 at 07:47
  • Your table appears to be misformated. But to be clear, you have a set of data with poings looking like: (x,b,c,d) and you want to determine the relationship between them? – overfull hbox Dec 26 '18 at 15:03
  • Yes thats exactly it , a multiple linear regression model works to an extent but it isn't good enough. I tried formatting that table its picked straight from my latex document so its formatted correctly but this comment section wont put it correctly – Tariro Manyika Dec 26 '18 at 20:47

1 Answers1

0

One "data driven" approach to find the relationship between the variables is basically to do a linear regression to some set of prespecified functions. This requires that you have some idea about what types of functions relate the variables (for instance, that the relationship is a polynomial of degree $\leq2$).

First, make vectors: $$ X = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, ~~ B = \begin{bmatrix}b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix}, ~~ C = \begin{bmatrix}c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix}, ~~ D = \begin{bmatrix}d_1 \\ d_2 \\ \vdots \\ d_n \end{bmatrix},~~ $$

Now, construct a "library" of possible relationships between your variables. The columns of this matrix shoukd be functions of the variables $x,b,c,d$ applied to each data point. For example, if you expect a polynomial relationship of degree$\leq 2$ you could make the library: $$ A = \begin{bmatrix} | & | & | & | & | & | & | & | & & | \\ 1 & X & B & C & D & X^2 & XB & XC & ... &D^2\\ | & | & | & | & | & | & | & | & & | \end{bmatrix} $$ where, for example, $$ XB = \begin{bmatrix}x_1b_1 \\ x_2b_2 \\ \vdots \\ x_nb_n \end{bmatrix}, ~~ $$

More generally, any column could be $f(X,B,C,D)$ where the $-$-the entry of this column is simply $f(x_i,b_i,c_i,d_i)$.

Now note that that the product $Ac$ gives a linear combination (weighted sum) of these entries. So you can solve $Ac = Y$ to find the relationship between the columns of your library. Of course, in practice you will have to solve the least squares problem $\min_c \Vert Y-Ac \Vert$.

If your data exactly satisfies $y=(xb-b)+c+d$ then you will $c$ will have a coefficient of 1 on the $XB$ column, $-1$ on the $B$ column, $1$ on the C and D columns, and 0 everywhere else.

Example

Suppose we have data:

Y,   X,  B,  C,  D
16,  6,  2,  2,  4
22,  2,  7,  6,  9
5,   1,  4,  1,  4
33,  5,  7,  0,  5
13,  1,  4,  9,  4

If we know that $Y$ is a linear function of X, B, C, D, and XB. We can form our library A = [X,B,C,D,XB]

[[ 6,  2,  2,  4, 12],
 [ 2,  7,  6,  9, 14],
 [ 1,  4,  1,  4,  4],
 [ 5,  7,  0,  5, 35],
 [ 1,  4,  9,  4,  4]]

Now, solving the least squares problem gives:

x = [0,-1,1,1,1]

This tells us that $y = 0\cdot x + -1\cdot b + 1\cdot c+1\cdot d+1\cdot xb$ which is exactly what we expected.

Now, if you didn't know that the only product term would be $xb$, you could have added more functions to the library and ideally the least squares would give coefficients of 0 for these functions (in our example we get a coefficient of 0 for $x$ since there is no $x$ term in the relationship). The results will vary depending on the amount of data you have, and how noisy it is. If you think the relationship is simple, you could promote sparsity through L1 regularization by instead solving: $$ \min_c \Vert Y-Ac \Vert_2 + \lambda \Vert c \Vert_1 $$ where $\lambda$ is a tune-able parameter.