Polynomial regression - correctness and accuracy

Question

I have just finished a code that performs polynomial regression, doing $(X'X)^{-1}X'y$ (where $X'$ is the transpose) to estimate the vector of coefficients.

Now I'd like to add some check procedures to assert that everything is correct and that the regression model can be used with confidence. From Wikipedia I know that "This is the unique least squares solution as long as $X$ has linearly independent columns. Since $X$ is a Vandermonde matrix, this is guaranteed to hold provided that at least $m + 1$ of the $x_i$ are distinct (for which $m < n$ is a necessary condition)." So I guess that a good first step would be to check whether $m$ is indeed less than $n$... I could also tell the user that the degree of the regression shouldn't be too high, to avoid overfitting the data.

The thing is, everything should be hidden from the user... I can't ask him to perform cross-validation. On the other hand, I could write a little leave-one-out cross validation code which would be run on the whole training data every time a new model is created.

Any thought or suggestion ?

Thanks

So is $X'$ the transpose? And shouldn't the expression be $(X' X)^{-1} X' y$ using your notation? — Pratyush Sarkar, Jan 17 '14 at 15:37
you're right, typo corrected, thank you ! (I use the correct formula in my code, of course) — CTZStef, Jan 17 '14 at 15:46
Honestly: I do not see what you want to be checked. Is your question on how to check whether $(X'X)^{-1}X'y$ is calculated correctly? To this end you have to do checks: given $X$, you should obtain the $i$-th column of $(X'X)^{-1}X'$ by calculating $(X'X)^{-1}X'e_{i}$, where $e_{i}$ is the standard basis vector with zeroes except a one at entry $i$. The confidence put in a regression model (choice of $X$) is what defines "doing a regression". There is of course no general answer. — MWL, Jan 20 '14 at 13:34
Well, as I wrote in my post, there must be some guidelines to assert whether or not a regression model can be used with confidence ? Also, what about m < n ? I mean I asked precise questions, I don't get it why the question is downgraded... — CTZStef, Jan 20 '14 at 13:40

Alt · Accepted Answer · 2014-01-26T21:24:36.867

1- Let the matrices $X_{train}$ and $Y_{train}$ be your training data, and $X_{tune}$ and $Y_{tune}$ be the held-out data that is not included in the training data.

Solve $\hat{\theta}=(X_{train}^TX_{train})^{\dagger}X_{train}^TY_{train}$

Now you regression model is" $Y=X\hat{\theta}$.

Use this regression model to estimate the outputs for the held-out data:

$\hat{Y}_{tune}=X_{tune}\hat{\theta}$

Afterwards, calculate $corr(\hat{Y}_{tune},Y_{tune})$, the correlation coefficient between $\hat{Y}_{tune}$ and $Y_{tune}$, if the correlation is greater than some threshold, then you have found a good fit. (Threshold=.8 may be a good guess).

2- A question that you might have is about the best number of polynomial terms to use when forming your input Vandermonde matrix. Here is my suggestion, let $k$ be the number of columns in $X$, then start from $k=1$ and record the resulting correlation, keep increasing $k$ until some maximum integer (say k=1000, but it really depends on the complexity of the underlying function that generates your data.) Finally choose the $k$ that yields the maximum correlation.

3- By $(X_{train}^TX_{train})^{\dagger}$, I mean the pseudo-inverse of $(X_{train}^TX_{train})$. Note that $(X_{train}^TX_{train})$ is always PSD, but not necessarily PD. So if its determinant is zero you can either use the Moore–Penrose pseudoinverse, or use the Tikhonov regularization, which is the foillowing:

$(X_{train}^TX_{train})^{\dagger}\approx (X_{train}^TX_{train}+\epsilon I)^{-1}$.

You can also tune $\epsilon$ by tuning on the held-out data.

4- For choosing either $k$ or $\epsilon$, if your sample size is small, I recommend using cross-validation, rather than only working on one held-out dataset.

Hope this helps.

score 0 · Answer 2 · answered Jan 21 '14 at 20:11

This answer is based on a software development practice rather than anything from mathematics.

It sounds like you have two categories of question:

1) Is the approach mathematically correct?

and

2) Is the code producing the correct result?

You can answer both questions via non-mathematical means: write a test suite and run it each time you update your code. E.g., for a polynomial you know, generate points for which the best fit is that polynomial. Then check that your code finds the correct one. You can programmatically test a few hundred or thousand cases -- and a good test suite will check edge and corner cases as well.

I would say that I have more a runtime problem rather than a unit test problem. I know how to test the stuff I programmed, what I want to know is some tricks on how to assert, at runtime, that the user entered correct data to perform regression on, and that the result makes sense. So far I have modified my code to create a test matrix as well as a data matrix, then I check the error between the actual test values and the results of the projection of the test data on the regression model. — CTZStef, Jan 22 '14 at 16:13

Polynomial regression - correctness and accuracy

2 Answers2