3

The Pearson's Chi Square standard notation is a summation of $(O_i-E_i)^2/E_i$. However, I cannot really get why it is divided by $E_i$.

For example, if I want to measure the distance between $f(x)$ and $g(x)$, then I would simply do the following: $d(x)=|f(x)-g(x)|$ or $d^2(x)=(f(x)-g(x))^2$.

So, if I haven't decided upon my critical value yet and I just keep the results of the formula, what is the purpose of the division?

1 Answers1

3

Take the underlying model to be $X_i \sim \mathsf{Pois}(E_i)$ so that $Z_i = \frac{X_i - E_i}{\sqrt{E_i}} \stackrel{aprx}{\sim} \mathsf{Norm}(0,1)$ and $Z_i^2 = \frac{(X_i - E_i)^2}{E_i} \stackrel{aprx}{\sim} \mathsf{Chisq}(1).$ Then the sum of the $Z_i^2$ is approximately $\mathsf{Chisq}(\nu),$ where linear constraints on the $X_i$'s are used to determine the degrees of freedom $\nu.$

BruceET
  • 51,500
  • [+1] very illuminating answer. – Jean Marie Sep 03 '17 at 07:02
  • 1
    @JeanMarie. It is an answer that explains (not exactly justifies) why this approx chi-sq stat is commonly used. The transparent comparison of Obs and Exp counts is appealing. As long as all $E_i$'s exceed 5 or so (the bigger the better), it is accurate enough to be useful. (We know that not so much from analysis as from simulations of many situations. R statistical software will simulate an accurate P-value in many applications.) // The chi-sq stati based on a likelihood-ratio is noticeably more accurate, but its use is resisted in soc and biol sciences because it's distastefully messy. – BruceET Sep 03 '17 at 15:00
  • Thanks very much for these complementary comments. – Jean Marie Sep 03 '17 at 15:17