2

Let $x$ and $y$ are two arrays of independent random numbers with Gaussian normal distribution. $X$ and $Y$ are their accumulated values at each step,

$X_i = \sum_{k=0}^i x_k$
$Y_i = \sum_{k=0}^i y_k$

Even though there is no correlations between $x$ and $y$, since they are i.i.d, but there appears to be strong correlation between $X$ and $Y$.

I was writing some R code and found this perplexing, someone with better math knowledge might have a simple explanation of this phenomena?

R code here:

N = 1000
x = rnorm(N)
y = rnorm(N)

X = x for(i in 2:N){ X[i] = X[i-1] + x[i]} Y = y for(i in 2:N){ Y[i] = Y[i-1] + y[i]}

summary(lm(y~x)) summary(lm(Y~X))

The first regression shows $R^2$ almost zero, will second one has rather large nonzero $R^2$. Try this multiple times.

Henry
  • 157,058
wang1908
  • 91
  • 4
  • Do the $x_i$s and $y_i$s have non-zero means? – Henry Oct 25 '20 at 23:07
  • 1
    Could you post your R code? – yberman Oct 25 '20 at 23:07
  • The means of x and y are zero. – wang1908 Oct 26 '20 at 00:37
  • If you repeat the simulations many times you will find the correlation is widely distributed (it seems almost uniformly between $-0.5$ and $+0.5$ and quite likely to be outside that range). The expectation of the correlation is $0$ but the expectation of the square of correlation is higher (perhaps near $\frac14$ in some cases). Auto-correlation of the individual cumulative sums is major cause. – Henry Oct 26 '20 at 03:25

1 Answers1

1

Try plotting them and you might see why:

import numpy as np
n = 100000
x = np.random.normal(0, 1, n)
y = np.random.normal(0, 1, n)

X = np.cumsum(x) Y = np.cumsum(y)

x and y, correlation - 0.000755: enter image description here

X and Y, correlation 0.3245: enter image description here

Essentially x and y are truly random next to each other, whilst it is the X and Y that go on a "walk" together and thus have correlation just because of the clustering of points on the X, Y graph.