2

In Excel I created some data. The variables are

Time
Length
Width
height
ChocolateQuality
price

The response variable is Price. I used the following formula to generate the values for price:

price =  B2*C2*D2*5  +  2*E2  +  SIN(A2/3.14)
or
price = Length*Width*height*5  +  2*ChocolateQuality  +  sin(Time/3.14)

I am puzzled that using linear regression on this data creates such a high Multiple R (0.969034397). The price variable is not linearly related to Length, Width, or height so I would not expect such a good result. Can anyone explain why this works?

I was trying to show that the creation of a new variable, the product of height, width, and length, should improve the regression results (and it does raise it to .99) but I didn't expect the result without the creation of the new variable to be so high. Regarding the input data I used formulas for length =4+3RAND(), Width = =3+2RAND(), height =2+1*RAND(), and ChocoQuality =RANDBETWEEN(1,4). RAND() is an Excel function that generates random numbers betweein -1 and +1. Randbetween generates integers between 1 and 4 (in this case). I created 325 rows of data in this way.

Data looks like this but has 325 rows:

Length  Width   height  ChololateQuality    Price
6.584903273 3.724123946 2.729031731 1   336.6201706
6.926510813 4.805111112 2.810795911 1   468.8262491
4.229677111 3.746150844 2.00365992  2   163.3347825
5.621000949 4.869836742 2.190494881 3   306.4677664
6.245666039 3.141578124 2.520717921 2   251.6919095
4.866733045 3.342057324 2.609198462 2   215.3510926
4.687606927 4.779600152 2.6980397   1   304.8411104
Glorfindel
  • 3,955
  • I was trying to show that the creation of a new variable, the product of height, width, and length, should improve the regression results (and it does raise it to .99) but I didn't expect the result without the creation of the new variable to be so high. Regarding the input data I used formulas for length =4+3RAND(), Width = =3+2RAND(), height =2+1*RAND(), and ChocoQuality =RANDBETWEEN(1,4). RAND() is an Excel function that generates random numbers betweein -1 and +1. Randbetween generates integers between 1 and 4 (in this case). I created 325 rows of data in this way. – onyourmark Nov 27 '22 at 12:21
  • Oh, I see. Everything is random, and you generated the price with a linear formula in the volume. – Jean-Claude Arbaut Nov 27 '22 at 12:25
  • yes exactly, I created a linear formula on the product – onyourmark Nov 27 '22 at 12:27
  • Side comment: RAND() generates numbers in $[0,1]$. How is time generated? – Jean-Claude Arbaut Nov 27 '22 at 12:28
  • Oh right, sorry RAND() is in [0,1]. Time was not included in the analysis actually. Sorry I just pasted that in with the dataset. – onyourmark Nov 27 '22 at 12:33
  • The linear fit will get worse if you expand the fractional range of the dimensions. Make each one go from $1$ to $50$, say, and see what you get. – Ross Millikan Nov 27 '22 at 20:46

1 Answers1

2

The result is not surprising.

A linear approximation of the volume $V=xyz$, given a small variation of $x,y,z$ is:

$$\mathrm dV=\mathrm dx\cdot y\cdot z+x\cdot \mathrm dy\cdot z+x\cdot y\cdot \mathrm dz$$

Here we have variations of $x,y,z$ respectively around the means $6.5, 4.5, 2.5$, and the variations don't exceed 0.5. And we multiply the volume by $5$ in the formula. So we have the approximate relationship:

$$\mathrm{price}\simeq5\mathrm V\simeq K+(5\times 4.5\times 2.5) x +(5\times 6.5\times 2.5) y+(5\times 6.5\times 4.5) z$$

Where the constant $K$ accounts for the mean $V$ as well as the means of $x,y,z$ and other random variables that enter the model.

The regression coefficients should therefore be rather close to $56.25$, $81.25$ and $146.25$ respectively. And it's indeed what is observed on my tests in R. And owing to the near-linear relationship, the $R^2$ is indeed high.

  • Thanks. I sort of understand what you have said and will think about it more. But I didn't get the same coefficients as you ( I think you removed ChocolateQuality in your explanation and so I removed it from my test). My coefficients are 110, 160, 215, and K of -1104. But really I would think that the p-values for these would be weak but they aren't. I ran it again and I got coef of 110, 160, 215, and K=-1104. The Mult R these two times is .9. But are you suggesting that if the variation in the x, y, and z values is wider then the Mult R should go down? – onyourmark Nov 27 '22 at 12:56
  • I only introduced $x,y,z$ in the model, that's why I have closer coefficients. And you can't expect to have extremely close values, as the linearization is not perfect either. It's a rough approximation, but it's enough to explain why the regression is so good. – Jean-Claude Arbaut Nov 27 '22 at 12:59
  • If the variation of $x,y,z$ is wider, it does not automatically mean the $R^2$ is worse: you don't change the regression by multiplying a regressor by a constant, for instance. Geometrically you will still span the same subspace and the projection of the dependent variable on it is the same. Likewise adding variables (even if they are unrelated to your data) automatically grows the spanned vector space, and the regression has a better $R^2$, which does not mean the regression is more meaningful. – Jean-Claude Arbaut Nov 27 '22 at 13:08
  • @onyourmark As to the $p$-values: they only tell you if you can accept the $H_0$ hypothesis that a coefficient is zero. Given the near-linear relationship I showed you, these coefficients are certainly not $0$, and the $p$-value is very small (hence we reject $H_0$) – Jean-Claude Arbaut Nov 27 '22 at 13:09
  • @onyourmark There are other causes for concern, like the deviation of the residuals from normality, and the heteroscedasticity of residuals that you can check resp. with a QQ-plot and a residuals vs fitted plot. You can still fit the model (basically it's just a matrix formula), but the inference you may draw from it is dubious. – Jean-Claude Arbaut Nov 27 '22 at 13:25
  • Do you know any good book that discusses something like this or mildly related to this (or even just statistics in general) with concepts such as subspace and/or linear approximations – onyourmark Nov 28 '22 at 22:27
  • @onyourmark Not easy to answer. I thought any good book on regression would address this, but I'm not so sure. You might try a book on linear algebra, but it will be light on regression. And if you want more advanced topics, such as details on QR, SVD and pseudoinverse with applications to regression, it's still another kind of textbook. My best pick would be Matrix Tricks for Linear Statistical Models, but make sure to check the contents. You may also find such topics for free online (university courses). – Jean-Claude Arbaut Nov 29 '22 at 08:00
  • Thanks for the recommendation! I will take a look. – onyourmark Nov 30 '22 at 11:00