Determining the degree of freedom for a $\chi$-squared test

Question

I have read that the degree of freedom is calculated by subtracting $1$ from the number of states a random variable can be in. I am performing a goodness of fit test on a $64\times 32$ matrix where the expected frequency of any $a[i,j]$ is $50\,000$ and the observed frequency can lie between $0$ and $100\,000$. What I am confused about is that how do I calculate the degree of freedom? Since the observed value might range from $0$ to $100\,000$, will my degree of freedom be equal to $100\,000-1$? Please advise.

BruceET · Answer 1 · 2016-04-09T20:45:02.917

If you are doing a chi-squared goodness-of-fit (GOF) test for data in a matrix with $r$ rows and $c$ columns, and finding the expected count in a cell as (row total)(column total)/(grand total), then $df = (r-1)(c-1).$

Degrees of freedom depend on the numbers of row and column categories, not on the observed and expected counts in the cells.

Note: That said, I have never done a chi-squared GOF test for counts in a matrix anywhere near as large as the one you are talking about. I think you should read about the assumptions of the GOF test and make sure they apply in your situation. If you have doubts, perhaps describe your situation, data, and goals on our sister 'statistics' (or 'crossvalidated') site, and ask whether there is a better way toward your goals. That site tends to get more people with active experience in 'big data' applications.

I'm not saying you are doing the wrong analysis, but something seems to be confusing you, and I'm not sure your simply-resolved question here is the one you really should be asking.

Addendum (posted later, based on information in Comments): I had a look at the paper you linked. It is not exactly entry level material for the main subject matter, which I will not pursue here. However, I think I have a clearer view of what you are trying to do with a statistical test.

Chi-squared test. The chi-squared GOF test you propose is based on $rc = 2048$ $X$-values, each with expectation $E = 50,000.$ For purposes of the test, you essentially ignore the matrix structure because you do not use it to get $E$ (already specified for each cell). Thus, your GOF statistic turns out to be

$$Q = \sum_{i=1}^{rc} \frac{(X_i - E)^2}{E}.$$

Under the null hypothesis that $E(X_i) \equiv E$, the test statistic is approximately $Chisq(rc).$ (The distinction between $df = rc$ and $df = rc - 1$ would hardly matter in practice, but the former is correct because you are not using your $X$-values to estimate $E$, nor using the total of the $X_i$.)

An assumption of the test is that $X_i$ are approximately normal so that $Z_i = (X_i - E)/\sqrt{E}$ is approximately standard normal, $Z_i^2 = (X_i - E)^2/E$ is approximately $Chisq(df=1)$, and $Q$ is approximately $Chisq(df=rc).$ Thus one would reject $H_0$ at the 5% level, if $Q \ge 2154.4,$ the value that cuts 5% from the upper tail of $Chisq(df = rc)$.

If the $X_i$ are counts distributed $Pois(\lambda = E),$ then $E(X_i) = E,\;$ $V(X_i) = E,\,$ and $SD(X_i) = \sqrt{E}.$ Certainly, the discrete distribution $Pois(50,000)$ is well approximated by $Norm(50,000, \sqrt{50,000}).$

Normal test. A simpler and somewhat similar test (of the null hypothesis that cell means average $E = 50,000$) would use the statistic $Z = (\bar X - E)/\sqrt{E/rc},$ where $\bar X$ is the sum of the $X_i.$ Under the same assumptions as above, $Z$ is approximately standard normal. Thus, one would reject $H_0$ at the 5% level if $|Z| \ge 1.96.$

The following simulation in R of $m = 10,000$ tests of each type shows that they do have a significance level near 5%, when the $X_i \sim Pois(50,000).\;$ [A larger $m$ would get results a little closer to 5%; but not exactly, because the tests themselves are based on continuous distributions approximating discrete observations.]

 m = 10000;  Q = Z = numeric(m)
 E = 50000;  k = 64*32;  se = sqrt(E/k);  c = qchisq(.95, k) 
 for(i in 1:m) {
   X = rpois(k, 50000)
   Q[i] = sum((X-E)^2/E)
   Z[i] = (mean(X) - E)/se }

 mean(Q);  sd(Q); mean(Q > c)
 ## 2046.720  # aprx rc = k = 2048
 ## 63.77823  # aprx sqrt(2k) = 64
 ## 0.0483    # aprx 5% signif level: P(Rej Ho | Ho true)

 mean(Z); sd(Z);  mean(abs(Z) > 1.96)
 ## 0.001443955  # aprx E(Z) = 0
 ## 1.005732     # aprx SD(Z) = 1
 ## 0.0516       # aprx 5% signif level

The figure below on the left shows simulated values of $Q$ along with the density of $Chisq(df = rc)$; the area to the right of the vertical red line is 5%. On the right are simulated values of $Z$ along with the standard normal density curve; areas outside of the vertical red lines add to 5%.

 par(mfrow = c(1,2))    # 2-panel graph
   hist(Q, prob=T, col="wheat", ylim=c(0,.007))
     curve(dchisq(x, k), col="blue", add=T)
     abline(v = qchisq(.95, k), col="red")
   hist(Z, prob=T, col="skyblue2")
     curve(dnorm(x), col="blue", add=T)
     abline(v = c(-1.96, 1.96), col="red")
 par(mfrow = c(1,1))   # return to default graphs

I am not calculating the expected frequency according to that formula rather expected frequency has been derived from somewhere else. Yes, this test applies in my condition because I have read about them in some relevant research papers. Yes, I am unsure if I am applying the test correctly because I am a novice right now. I have got another answer on a different site which says dof=(64*32)-1 . I am more confused now — , Apr 09 '16 at 14:19
If expected cell counts are externally determined (without looking at your data), not by using row and column totals, then $df = rc - 1$ is correct. This is not the'usual' situation. Do you really have $rc$ 'categories' of interest? (In the 'big data' field, I have found that the fact someone else has taken a particular approach and published the results does not necessarily mean the approach is correct. Just raw instinct, but something seems wrong here.) — BruceET, Apr 09 '16 at 14:27
I am following the procedure(SAC test) for calculation of a strict avalanche matrix according to the following paper and then applying a goodness of fit test to determine whether the strict avalanche criteria is satisfied or not.This is the link to the paper:http://eprint.iacr.org/2010/564.pdf .Why I think all the categories matter is because the strict avalanche criterion states indirectly that all observed frequencies should be 50000. — , Apr 09 '16 at 15:07
Now, that I have performed the test,I am getting chi-squared parameter to be too high and p-value as 0.Can you suggest why is it so? — , Apr 09 '16 at 18:56
A large chi-sq stat gives tiny p-value: leads you to reject the null hypoth. that the cell means are all $E = 50,000.$ Some $X_i$ may be too small, others too large. Presumably, $\bar X \approx 50,000.$ or you just have the wrong value of $E.$ Maybe try setting $E = \bar X$ from the start and use $df = rc - 1.$ Or perhaps deviations from $E$ occur in clusters, or $X_i$ are nothing like Poisson. Shouldn't speculate further without learning more about basic subject matter. If you can focus in on the real issue, maybe post another question. Here or on 'crossvalidated' or on a cryptography site. — BruceET, Apr 09 '16 at 20:38

Determining the degree of freedom for a $\chi$-squared test

1 Answers1

Linked