Training vs validation set

Question

I have a data set with which I am trying to find correlations.

I split the data into a training set and a validation set. I also have a solver I built which finds the "best coefficients" to give me the best results on the training set.

After solving for the training set, the validation set shows completely different results which do not support the results on the training set.

I then made my solver output all of the results, as opposed to only the best results. There are cases which show positive, very similar results in both the training and validation sets, and there are results that show different results in both sets.

Is it okay to choose, by hand, the results that show the most similar and most positive results in both validation and training sets, or does this defeat the purpose of the validation set and invalidate the results?

Your best bet here is to use cross-validation, if you want to improve the performance of your model on the validation set. As indicated elsewhere, choosing the best results by hand is a BIG no-no, I would agree. — Adrian Keister, Dec 10 '18 at 14:57

score 1 · Answer 1 · answered Dec 10 '18 at 07:44

1

This absolutely will invalidate the results. Data science is meant to be scientific; free from bias. It's ok to make a hypothesis and test it out, but if the results aren't what you want, that's ok, you revise the hypothesis and try again. That's what good science does. But manual "data wrangling" is a big no-no.

answered Dec 10 '18 at 07:44

Paul Childs

811

Thanks for your answer. I worry that my solver is just finding outliers in the data, and focusing on them. The results that match both might be the third or fourth best results that the solver found. It is still technically found by the solver.
Does this make any difference?
– Frank Dec 10 '18 at 07:55
A least mean squares regression for example is biased towards outliers, due to the nonlinearity of the square. There are other means - biased differently - as well as techniques for filtering out noise, but these shouldn't be done by hand, but based on a knowledge of the expected error. – Paul Childs Dec 10 '18 at 23:22

Training vs validation set

1 Answers1