I am performing a Kruskal Wallis test for very big sample sizes (100 000+ numbers). While the distributions seem similar on the figures, the test says there is a significant difference between the two distributions. This is making me think that the Kruskal Wallis test can become less reliable when using too large sample sizes. Is this the case?
-
1It is not that "the Kruskal Wallis test can become less reliable when using too large sample sizes" but that with large sample sizes even very small differences in distributions are likely to produce "significant" results even when they are not in any sense substantial – Henry Sep 01 '21 at 12:00
1 Answers
Ideally, sample size for a test would be just a little larger than is necessary to detect an effect of a size that is of practical importance. When the sample size is huge the test may have enough power to find that differences of no practical significance are declared statistically significant.
Consider three samples of size $n=100$ from similar gamma populations: $\mathsf{Gamma}(5, .100)$ (with mean $\mu=50,$ median $\eta=46.709)$ and $\mathsf{Gamma}(5, .101)$ (with mean $\mu = 49.505,$ median $\eta=46.207).$
set.seed(901)
x1 = rgamma(100, 5, .1)
x2 = rgamma(100, 5, .101)
x3 = rgamma(100, 5, .101)
x = c(x1,x2,x3); g = rep(1:3, each=100)
boxplot(x ~ g, horizontal=T, col="skyblue2", notch=T)
Boxplots do not show noticeable differences in locations. Notches in the sides of the boxes are nonparametric confidence intervals, calibrated so that overlapping of two intervals often indicates no true difference in location.
A K-W test does not reject the null hypothesis that all three samples are from populations with the same location.
kruskal.test(x ~ g)
Kruskal-Wallis rank sum test
data: x by g
Kruskal-Wallis chi-squared = 1.4712, df = 2, p-value = 0.4792
By contrast, if we repeat the above except for sample size $n=10\,000,$ the K-W tests finds highly statistically significant differences among the populations.
set.seed(2021)
x1 = rgamma(10000, 5, .1)
x2 = rgamma(10000, 5, .101)
x3 = rgamma(10000, 5, .101)
x = c(x1,x2,x3); g = rep(1:3, each=10000)
kruskal.test(x ~ g)
Kruskal-Wallis rank sum test
data: x by g
Kruskal-Wallis chi-squared = 16.363, df = 2, p-value = 0.0002798
Neither scenario of wrong. However, the second uses more data than necessary---unless it is of practical importance to detect a difference in location of size about $0.5$ units.
- 51,500

