1

I have an interesting problem. Say I have lots of datasets like this:

a = 21
b = 23
c = 58
d = 498
etc (lots of other values)

X = 85

I need to find the formula that derives X from a, b, c, d etc, with the added complication that I don't know whether all of the values affect X or whether some have no effect on it. Is there a generic method to do that?

I do not have the ability to vary a, b, c and d and check the derived value of X; however, I have a huge amount of these datasets (combinations of values and the resulting X) to look at. I have some programming skills, so I am able to analyse all of these datasets using an algorithm, but I have literally no idea what that algorithm should be. Any help would be appreciated.

Note: I am new to this site, and don't know which tags to use, so feel free to retag this.

EDIT: Each dataset contains the same amount of numbers, and the positions are fixed, i.e. 'a' of one dataset corresponds to the 'a' in others.

Bluefire
  • 1,668
  • In general, for a finite sequence of numbers there is no way to tell which one should be 'next', i.e. to tell what $X$ should be. Is there any additional structure to how $X$ relates to $a$, $b$, $c$, $d$, etc? – Servaes Jun 01 '14 at 12:36
  • I have a general idea for which of a, b, c, d et al are related to X, but I'm not sure. But surely, with the huge volume of data that I have, I should be able to find a relationship? – Bluefire Jun 01 '14 at 12:44
  • Entering $1+1$ into your calculator and pressing enter, it will respond $2$ a million times over. But there's no way to be sure (mathematically) that it will always do so unless you know something about the inner workings of your calculator. You will need to know (or assume) something about how the output relates to the input if you want to find a relationship mathematically. – Servaes Jun 01 '14 at 12:51
  • I'm not quite sure what you mean. I've probably misunderstood you, but I can assume that the calculator I have here is consistent, that is, if a, b, c, d etc are the same, then X will always be the same. – Bluefire Jun 01 '14 at 12:52
  • I admit I was a bit vague. I am indeed assuming that your process is consistent.

    What you are asking for is an algorithm that, given an arbitrary sequence of numbers, outputs the next number in the sequence. But there is no way to determine what the next number should be. In fact, a sequence is defined by giving all of its terms, so any number could be next.

    – Servaes Jun 01 '14 at 12:56
  • Unless you have some restrictions, i.e. some relations, which your sequence should satisfy. – Servaes Jun 01 '14 at 12:56
  • Relations... like what? Maximum values? The maximum value of any parameter (a, b, c, d, etc or X) is 99. Anything else? – Bluefire Jun 01 '14 at 12:58
  • This certainly narrows things down, but it is not sufficient to determine $X$ from this. What would be sufficient precisely is a difficult question. An example of a relation would be: "If $a$ is doubled, then so is $X$", or "$X$ is less than the sum of all the inputs". Do you have any relation like this between input and output? – Servaes Jun 01 '14 at 13:01
  • Right, I think I understand now. I have an assumption that I'm not sure is true, but I guess I will have to stick with it. The assumption is that X is a weighted average of all the other data, so a might have half the weight of b, twice that of c, and d might have no weight at all. – Bluefire Jun 01 '14 at 13:04
  • Then it remains to determine the weights of each of the variables. For this you need at least as many data sets as you have variables. However if you have more, then there might be no solution (meaning your assumption might be false). – Servaes Jun 01 '14 at 13:05
  • 1

1 Answers1

2

If you think there is a linear relationship between the $a, b, c$, etc., and $x$, then you could find the least-squares solution to the system of equations $\mathbf {Ay = X}$. The matrix $\mathbf A$ will consist of rows of the form $[a_i\ b_i\ c_i \ldots]$, and $\mathbf X$ is a column vector containing the values $x_i$. The vector $\mathbf y$ corresponds to the weights in your weighted average.

The system $\mathbf {Ay = X}$ does not necessarily have a solution, but you can find the "best fit" by multiplying both sides by $\mathbf A^t$ and solving the resulting system; i.e., $\mathbf {A}^t\mathbf{Ay} = \mathbf{A}^t\mathbf{X}$.

Thus the best-fit solution for your weights is $\mathbf{\hat y} = (\mathbf{A}^t\mathbf{A})^{-1}\mathbf{A}^t\mathbf{X}$.

Théophile
  • 24,627
  • What do you mean by $\mathbf A^t$? – Bluefire Jun 01 '14 at 13:56
  • I mean the transpose of the matrix $\mathbf A$. – Théophile Jun 01 '14 at 14:00
  • I'm not too experienced with matrices and linear algebra, so I don't know what that is D: Is software like Matlab able to process this for me? – Bluefire Jun 01 '14 at 14:06
  • 1
    Yes, Matlab would certainly be able to do this. In fact, it is such a common problem that, from what I see, the syntax in Matlab is as simple as "$\mathtt{y = A\backslash X}$". Here's a link to the Matlab documentation. Note that the variables on that page are slightly different from here; they're solving the system $\mathbf{Ax = B}$, so you'll have to relabel accordingly. – Théophile Jun 01 '14 at 14:36