11

I am wondering why in Pearson's chi-squared test, the divisor of each element in the sum is the matching expectation and not the matching variance.

As I understand it, the test works by standardizing each normal variable before summing, so the results set can be tested against the chi-squared distribution which deals with a sum of squares of standard normal random variables.

The way a normal random variable is standardized is by subtracting the expectation and dividing by the standard deviation. So, in Pearson's test, this should give the variance in the divisor of each element, not the expectation.

Jean-Sébastien
  • 4,623
  • 1
  • 17
  • 39
Danra
  • 225
  • 2
    Notice that ${O_i}$ are not independent, their total must equal 1. More careful analysis must be made to establish the law of the statistics. – Sasha Dec 06 '12 at 16:56

2 Answers2

4

Intuition/informal proof: The expected value is equal to the variance, so when you divide by the expected value you are in fact dividing by the variance, as you thought you should. If you think of it in terms of counts that follow a Poisson distribution this is natural, since the mean and variance of a $\operatorname{Poisson}(\lambda)$ distribution are both $\lambda$.

For a formal proof, check out MIT's OpenCourseWare.

Great question!

Jonathan Christensen
  • 3,870
  • 15
  • 21
  • Thank you for the reference to the proof! Could you please elaborate on the intuition? The Poisson distribution is what the Binomial distribution tends to as n->inf and p->0. This is not the general case in Pearson's test. – Danra Dec 07 '12 at 01:38
  • The Poisson distribution has an independent existence, not just as the limit of the Binomial distribution. It's the distribution of the number of "rare" events that occur in a given time period, and is very often used to model counts. In each cell of the contingency table we are counting how many observations fall into that cell, so you can think of the counts in the contingency table like Poisson-distributed random variables. It's not formal--in most cases those cells aren't actually Poisson-distributed, for various reasons--but I think it's a reasonable intuitive argument. – Jonathan Christensen Dec 07 '12 at 01:54
  • As I understand it, a Poisson distribution is appropriate when n is large and p is small; See http://en.wikipedia.org/wiki/Poisson_distribution#Derivation_of_Poisson_distribution_.E2.80.94_The_law_of_rare_events . Granted, "large" and "small" are relative terms; But does the Poisson-ic intuition really hold, for instance, in the case of a repeated coin toss, where there are only two cells, and p0 and p1 are both "large"? Isn't it more intuitive that the variances in this case are of binomial distributions? – Danra Dec 07 '12 at 08:55
  • 1
    It's intuition. I find it useful because I often think of counts like those found in contingency tables in terms of Poisson random variables. If you don't find it useful then ignore it. If you want a rigorous answer, work through the rigorous proof. – Jonathan Christensen Dec 07 '12 at 17:16
2

The mathematical proof shared by Jonathan Christensen in the answer below is great.

Here is my intuitive interpretation:

I was also deeply confused when every "simple" explanation out there references the Poisson distribution which is intuitively not right because the underlying process should be a Binomial. I too initially thought that the chi-squared test makes more sense if the divisor is $np_iq_i$ instead of $np_i (i.e. Ei)$.

After reading the proof, I now understand it much better. Long story short, we must not interpret each cell's calculation individually because doing so would cause confusion instead of giving the right intuition. The chi-squared test applies Pearson's theorem as a whole. Did you notice that we have to sum all the cells and not allowed to pick and choose cells (e.g. remove columns/rows that are not of our interest)? The statistic $\sum\dfrac{E_i - O_i}{E_i}$ only converges to $\chi^2$ distribution if all the cells (mutually exclusive and collectively exhaustive) are added together.

Individually, each cell's variance is $np_iq_i$, but all the cells are not independent of each other because they sum up to a total so that by knowing the first n-1 cells, the final cell value is known. That is, the covariance between the cells is not zero. It is actually negative because a large value in one cell means that the other cells need to be smaller to compensate. Following the proof, when you sum up all the cells, the resulting distribution needs to take the covariance into account. The end result is such that (with full two pages of maths) $\sum\dfrac{E_i - O_i}{E_i}$ converges to $\chi^2$ distribution with the degree of freedom as described by the theorem. It is a full integral across all the cells. Removing any would break the proof and render the theorem not applicable.

In summary, don't take the "intuitive" interpretation. There is literally no mention of Poison distribution in the proof. Think of the chi-squared statistic as a single statistic instead of the sum of individual statistics.

Thanks Daniel