Regularization vs increasing # of data points of least squares

Question

This question is regarding using least squares approximation when your # number of data points is LESS than the number of variables -> ill-posed.

In such a problem, would it generally be more accurate to use regularization or to search for more data points?

There is no mathematical trick that can cover for having too few data points.From the data you have, you can come up with an infinite number of solutions that will fit the data you have perfectly, but you will have no idea which is correct, which means that you will not be able to predict future measurements with any confidence. On the other hand, even when you already have sufficient datapoints, more data will still allow you to reduce your margin of error. More data is always a highly desirable thing in statistics. — Paul Sinclair, Mar 24 '18 at 02:05
Regarding "no mathematical trick," isn't regularized least squares technically a mathematical trick for when you have too few data points? — David, Mar 24 '18 at 06:58
Yes, it is a mathematical trick. No, it does not cover for having too few data points. It gives you an answer, but it does not give you the answer. It is simply a way of picking one solution out of the grab-bag of equally likely possibilities. The one it picks is no more reliable than the others. — Paul Sinclair, Mar 24 '18 at 14:26
I see. What about in a situation, where you have a set of data points that are clustered together, and then a couple of outliers. If you use only the clustered data points, your # of data points is less than the # of variables, but if you include the outliers, your # of data points is more or equal to the # of variables? In that a scenario, would it be best to include those outliers or to use regularization? — David, Mar 24 '18 at 20:48
Without additional data points, you don't know if those "outliers" really are outliers (i.e., bad data) or are accurate, but in an area where by chance you only got a couple points. If they are outliers, then regularization will improve the accuracy of your model. If they are accurate, then regularization will worsen the accuracy of your model. There is just no way to know without sufficient data. This is why we call that problem ill-posed. — Paul Sinclair, Mar 25 '18 at 04:53
Interesting. I am using least squares for computing gradients. I need to compute gradients on a mesh of cells. The cells are hexs (6 faces) or tets (4 faces). If I use the adjacent neighbors of the cell in question, I would have 7 (5) data pts for a hex (tet) cell. But using a quadratic basis of size 10, I have insufficient data pts. Due to the nature of some of the meshes, I can only search for additional data points in 1 direction. I am not sure if I should use regularization or perform this search in 1 direction, where the latter would seem to bias the gradient in favor of that direction? — David, Mar 25 '18 at 06:36
That is a significantly different situation than I thought we were talking about. Your meaning for "outlier" is even different from mine ("data pt far away from area of interest" vs "data pt far away from pattern of other data pts, possibly caused by a bad measurement"). Gradients are very unstable. Unless you have some condition that restricts how quickly your function can change, I would not go looking for data any farther from your point than absolutely necessary. — Paul Sinclair, Mar 26 '18 at 16:34
@PaulSinclair Sorry, I was unclear. When I originally mentioned "outlier," I was picturing a situation "data pt far away from area of interest." In the last comment, I was picturing a different situation (data pt far away from pattern of other data pts) that is more relevant to my current problem of interest. Since you suggested to "not look for data any farther from your point than necessary," would it be better in this case to use regularization vs. finding more data farther away from the pattern of my current set of data pts? — David, Mar 26 '18 at 17:20
I didn't mean that you were unclear. At worst, you used concepts more common in a different situation, I should have been more wary in my assumptions. Gradients are not something that I have personal experience with calculating numerically. The problem is that even well-behaved functions can change very quickly, leading to greatly varied gradients in a small region. It depends on your needs. If you want to minimize the size of the gradients, then bring in more distant points. If you want the most responsive to immediate surrounding, regularlize. — Paul Sinclair, Mar 26 '18 at 23:29
Ah I see. When you use least squares, do you generally find that a linear basis [1,x,y,z] is sufficient, or do you at times need a quadratic basis [1,x,y,z,xy,xz,yz,x^2,y^2,z^2]? — David, Mar 27 '18 at 21:22

Regularization vs increasing # of data points of least squares

0 Answers0