Scaling and cross-validation in statistical models

Question

Let's say i have a two dimensional dataset (X and Y variables). My goal is to fit a model that best describes the X-Y relationship Using a training subset of the dataset and then evaluate the performance using another test subset. and let's say i want to normalize/scale the Y vector to a vector with zero mean and standard deviation of 1. What is better and why? Scaling before partitioning, or the converse (do the scaling on the train partition and then the test partition separately)?

Thanks.

I can't see any benefit to normalising the Y-vector. It will just add an extra layer to the relationship between X and Y. — tomi, Apr 12 '15 at 22:26
@tomi i'm mainly doing the normalisation for another purpose. Gaussian process hyperparameters optimization renders more numerically stable results. — rodrigo, Apr 13 '15 at 23:19

score 0 · Answer 1 · answered Jun 12 '19 at 19:23

Most of the time, ideally, you want to simulate the conditions of a real-world test set, to get the most accurate generalization error possible. Thus you should assume access to a test set in (1) only an online fashion (i.e., one input at a time only is available) and (2) only after training is complete.

Hence, you should compute the normalization parameters $P$ on the training set, and then apply this same normalization $P$ to both the training set and the test set.

Notice that this strategy is identical to that used by default for batch normalization in deep neural network libraries: namely, you learn the normalization parameters at training time, and then at test time you freeze the parameters and use them as is on test data.

Scaling and cross-validation in statistical models

1 Answers1