Best practice for pairing samples for linear regression

Question

I am building linear regression models in R where the two distributions do not have ground truth or any obvious method for pairing samples from each.

What is the best practice for this scenario? The most obvious method would be to just sort both distributions but I'm wondering if there are any better methods. The other method I thought of would be to pair samples with nearest neighbor by percentile or rank.

What about if the two distributions have different amounts of samples? Which should be removed? Should samples ever be duplicated?

Any help would be very appreciated.

First, this question is probably a better fit for Cross Validated. https://stats.stackexchange.com/questions/ask Second, what do you mean by pairing samples? Are you saying that you have a sample of random variables from one distribution, and a sample of another random variable from a different distribution, with no relationship between the samples, but you’re trying to make a linear regression model to predict one variable from the other? — Joe, May 25 '20 at 23:40
Thanks I'll post there. The samples are the runtime of a program on two different CPUs. We are predicting how the program will run between CPUs using metrics like cpuUser/cpuKernel time. Because the runs are all the same program the distribution forms a normal distribution — Robert Cordingly, May 25 '20 at 23:56
It sounds like you want to do an analysis of variance (ANOVA) — Joe, May 26 '20 at 01:48

Best practice for pairing samples for linear regression

0 Answers0