4

Very often, I can easily see that my data has a "pattern". This pattern usually resembles something as simple as multiplying the previous point by 1.2 or 1.3. But it can also appear to be exponential or parabolic. The problem is when I know that there "should be one" but I'm not smart enough to extract it mathematically.

Is there a way to convert data into an equation?

For instance, I'm stumped:

3, 3000
5, 1000 = 0.333
7, 500 = 0.5
9, 300 = 0.6
11, 200 = 0.6667
13, 140 = 0.7
15, 105 = 0.75
17, 81.67 = 0.7778
19, 65.33 = 0.8
21, 53.35 = 0.816667

I "know" I should be seeing a pattern here. But, what is it? And, how can I get it? I do not know.

  • You have three numbers in each field (except first one). What do they stand for? Also, what does comma mean, and what does equal sign? – Kaster May 27 '14 at 21:55
  • Sorry about that: x, y = percent – J. H. George May 27 '14 at 21:59
  • So, if I understood you correctly, you wanna get some kind of relationship (function of two variables) that connects pair of numbers $(x,y)$ with some percentage value $p$, right? – Kaster May 27 '14 at 22:01
  • 1
    the last term at the right appears to be the quotient of $y$ and the previous $y$ (that's why it doesn't appear in the first line). – Raymond Manzoni May 27 '14 at 22:02
  • Correct - to both of you. @Kaster Yes, with this example and I'm also hoping what I learn here will work with other data patterns. – J. H. George May 27 '14 at 22:08
  • The square of $x$ multiplied by $y$ is slowly decreasing. A rough approximation for $y$ could be $$y\approx \frac {23000}{(x-0.23)^2}$$ – Raymond Manzoni May 27 '14 at 22:26
  • Thanks very much, but what technique helped you to get this? – J. H. George May 27 '14 at 22:53
  • No decisive method I fear... $x$ is increasing while $y$ is decreasing so that an idea is to plot $x\cdot y$ (and obtain a decreasing curve). $y$ is decreasing faster than $x$ is increasing so that $x^2\cdot y$ may be more interesting but still decreasing. Powers a little higher than $2$ will make the curve increase for $x\gg 1$ and show (kind of) irregular values. After that I searched a satisfying value of $a$ such that $(x-a)^2\cdot y\approx$ constant (say so that the first and last value are equal). I don't think there is a general method to obtain such approximations : – Raymond Manzoni May 27 '14 at 23:29
  • you have to try/guess a formula first and adjust or add parameters to get better results. This last part may be solved with computer algebra by searching the parameters $(p_i)$ (for $i=1\cdots m$) such that $\sum_n (f(x_n,p_1,p_2,\cdots,p_m)-y_n)^2$ will be minimal. For linear functions $f$ you may use linear regression while for more complicated ones you may use gradient descent, simulated annealing or one of the numerous optimization methods available. – Raymond Manzoni May 27 '14 at 23:30
  • @Raymond Manzoni - That's what I suspected, "try/guess". It is my current method too. I will try to learn more about the other methods you gave though. If you cut-n-paste your comments into an answer, then I'll select it - thanks again for your help. – J. H. George May 28 '14 at 13:22

2 Answers2

3

I don't know a general method either and consider 'experience' in guessing functions from their visual aspect important.

One idea is to search an implicit function $\;I(x_i,y_i)\approx K\;$ for $\,i=1,2,\cdots ,n\;$ with $K$ constant (independent of $i$).

Since $x_i$ is increasing with $i$ while $y_i$ is decreasing it may be interesting to plot the graph of $i\mapsto x_i\cdot y_i$ (obtaining a decreasing curve).
$y_i$ is decreasing faster than $x_i$ is increasing so that $i\mapsto I_0(x_i,y_i)=(x_i)^2\cdot y_i$ may be more interesting.
This is still decreasing but rather slowly and powers a little higher than $2$ will make the curve increase for larger $x_i$. We could be near of an acceptable solution and not require faster or slower growing functions like $e^{x_i}$ or $\log(x_i)$ (or whatever). The curve is kind of irregular (while the $x_i$ are regular) giving some hints that a simple and exact solution could not exist.

Let's add a small perturbation with the term '$a$' in $\;I_a(x_i,y_i):=(x_i-a)^2\cdot y_i\;$ then solving $\;I_a(x_1,y_1)=I_a(x_n,y_n)=K\,$ returns $\,a\approx 0.23\,$ and $k\approx 23000$.
We obtained thus the rough approximation : $$y\approx \frac {23000}{(x-0.23)^2}$$ Once 'guessed' a general function (for example $\;y:=f(x,p_1,p_2,p_3)=\dfrac {p_1}{x^2+p_2x+p_3}\,$ here) different optimization methods are available (starting with linear, quadratic, cubic regressions for simple cases or gradient descent, simulated annealing and so on for more complicated functions) to find the best fitting parameters $p_j$ for $j=1\cdots m$ such that $\;\sum_i (f(x_i,p_1,p_2,\cdots,p_m)-y_i)^2\,$ will be minimal.

More about 'curve fitting' may be found at Wikipedia.


Let's add that quick solutions may sometimes be obtained with tools like Wolfram Alpha.

Asking the 'curve fit' for your $(x_i,y_i)$ points will return this result : fit y

Not so good so let's try the same operation but with $(x_i,1/y_i)$ then you'll get the better :

fit 1/y

Note that the correlation is rather good even in the simple quadratic case (that should correspond nearly to the solution I proposed earlier).

From this we see that polynomial regressions (degrees larger than $4$ should be avoided anyway) may give excellent results with some preliminary work on the data. For exponential grow you could use $(x_i,\log(y_i))$ or $(\log(x_i),\log(y_i))$ or a whole table of transformations adapted to your data.

Hoping this helped more,

Raymond Manzoni
  • 43,021
  • 5
  • 86
  • 140
1

You have received good and detailed answers from Raymond Manzoni. However, I would like to point out a few things.

Suppose that you model be $$y=\dfrac {p_1}{x^2+p_2x+p_3}$$ This is nonlinear with respect to the parameters and minimizing the sum of squared errors is problematic as long as you do not have "reasonable" starting guesses. As Raymond Manzoni did it, you can make the model linear writing it as $$\frac{1}{y}=a x^2 + b x + c$$ which is linear and easy to solve. However, you must notice that, in this case, you try to minimize $$SSQ=\sum_{i=1}^{i=N} \Big(\frac{1}{y_i}-\frac{1}{y_i^{*}}\Big)^2$$ while, for the initial problem, your goal is to minimize $$SSQ=\sum_{i=1}^{i=N} \Big({y_i}-{y_i^{*}}\Big)^2$$ So, the first step is very good to provide estimates of $p_1,p_2,p_3$ from the solution values of $a,b,c$ and now you can start the nonlinear regression.

I used your data and got the following results :

  • at the end of the first step, $p_1=23111.567$, $p_2=-0.400060$, $p_3= -0.343428$ and the sum of squares computed for the $y$'s equals $2937.59$
  • using the above values as initial guesses for the nonlinear model, the parameters are $p_1= 23618.883$, $p_2=-0.721977$, $p_3= -0.134989$ for which the sum of squares is $11.18$ which is very different.

In other words, when you use regression, the sum of squares to be minimized must be based on the measured values and not on any of their possible transforms (the distribution of errors on $y$ does not much to do with the distribution of errors on $\frac{1}{y}$).