2

I am currently reading a "pop-science" book on statistical fallacies. On page 36 the authors discuss how events can cluster around certain locations by chance. The authors exemplify this by a $6*6$ checkerboard and two dice.

enter image description here

Without further explanation they calculate that a square with $4$ events will happen roughly every second simulation ($\frac{1}{0.54} = 1.85 \approx 2$), where one simulation consists of $36$ rolls of two dice (as far as I understand).

Now I understood that, according to this answer, the expected number of rolls for a number to repeat, given a single fair die, is:

$$\operatorname{E}[r] = \sum_{r=2}^{n+1} r \frac{n!(r-1)}{(n-r+1)!n^r}$$

So given that rolling two dice or one 36-sided die give equal results, I can calculate the number of rolls to see a square with two events. But how do I generalize this to $N$ events? Which I suspect is one way to figure out how many rolls it takes to see a square with $4$ events, right?

Also, according to a footnote, the authors appear to be using the Poisson distribution to calculate this.

I know that the Binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product $np$ remains fixed.

I see how one could e.g. calculate the probability that $4$ events occur in a specific square using both the Binomial and Poisson distribution for a simulation of 36 throws of a 36-sided die:

$${36 \choose 4} \left( \frac{1}{36}\right)^4 \left( \frac{35}{36}\right)^{32} \approx 0.014$$

$$\frac{1^4e^{-1}}{4!} \approx 0.015$$

But I am not sure how the authors used the Poisson distribution to calculate the expected number of simulations of 36 throws necessary to see an unspecified square with 4 events.

Sorry for the long question, I'd appreciate if someone could at least point out some book or other resource where I can learn all this from. Thanks a lot!

  • On your picture, what are the sticks/Roman numerals? Can you actually reproduce the specific statement the author is making (to make sure there is no confusion between a fixed "event," "any event," etc). – Clement C. Aug 24 '15 at 15:19
  • The idea is to roll two distinguishable fair dice (a white and a black die) and each time draw a line at the coordinates given by the dice. If you were to e.g. roll a 3 (white die) and a 4 (black die), you would draw a line at the square in row 3 and column 4.

    For a simulation you roll the dice 36 times. After which you have drawn 36 lines on a 6*6 square board and therefore a mean of 1 line per square.

    "The mean value after 36 rolls is exactly 1 strike per box. Statically one would expect a box with 4 strikes in approximately every second simulation (1/0.54 = 1.85 ~ 2) [...]."

    – user245312 Aug 24 '15 at 17:43
  • 1
    I am curious. Right now, I have the impression that what they do is (1) look at the probability to have 4 strikes in a fixed, specified square (as you computed with a Poisson r.v.); (2) do some union bound/assumption of independence (?) that seems dubious to multiply it by 36, to get the probability that this happens in /any/ square (getting something like .55); (3) consider this as a geometric r.v. to look at the expected number of occurrences needed for this to happen (the $1/.55$). I am definitely not certain of that, and if it is it looks highly sketchy to me.. – Clement C. Aug 24 '15 at 18:50

2 Answers2

2

I think the cited book may be trying to state something much simpler than the question and answers (so far) seem to suppose.

One trial of the experiment consists of $36$ instances of selecting at random one of $36$ boxes. The expected number of boxes to be selected exactly $k$ times per trial is then($^*$) $36\ p_k$, where

$$p_k = {36 \choose k} \left( \frac{1}{36}\right)^k \left( \frac{35}{36}\right)^{36-k}\ \ [k\in \{0,1,...,36\}]$$

with corresponding Poisson approximations.

For ease of expression, let's refer to a box selected exactly $k$ times (in one trial) as a "$k$-hit" box.

Now, the above rate of $36\ p_k$ $k$-hit boxes per trial is the same as $\frac{1}{36\ p_k}$ trials per $k$-hit box.

E.g., a rate of $36\ p_4 \approx 36\cdot 0.015 \approx 0.54$ $4$-hit boxes per trial is the same as approximately $\frac{1}{0.54}\approx 1.85\approx 2$ trials per $4$-hit box, just as the cited book states.

NB: The expected number of trials per $k$-hit box is $E(\frac{1}{N_k})$, where $N_k$ is the number of boxes that are hit exactly $k$ times in one trial. For this, $\frac{1}{E(N_k)}$ might not be a very good approximation; however, the latter appears to be what the book uses, perhaps because it can be more easily calculated (without need of simulation). Note that, although similar, this is not the same as the expected number of trials needed to obtain at least one $k$-hit box (which could also be readily computed from the binomial probabilities $p_k$).


($^*$) This can be proved easily by the same method as used here. Note that $X$ (= the maximum number of hits among the boxes, in one trial) is not the quantity of interest; rather, the focus is on $N_k$ (= the number of boxes that are hit exactly $k$ times in one trial). That is, the computation is not about $P(X=k)$, but rather $E(N_k)$. This expectation can be written $$E(N_k) = E(I^{(k)}_1 + I^{(k)}_2 + ... + I^{(k)}_{36}) = 36\ E(I^{(k)}_1) = 36\ p_k$$ where $I^{(k)}_j =1\text{ IF box }j\text{ is hit exactly }k\text{ times ELSE }0$.

NB: The indicator variables $I^{(k)}_j$ are not independent, but the result holds as stated because of the linearity of the expectation operator.

r.e.s.
  • 14,371
  • The Poisson approximation here is to restore independence among the boxes? – Clement C. Aug 25 '15 at 18:17
  • Say you have 4 boxes, 2 with 2 balls and 2 with 0 balls. The probability of a box with 3 balls, after randomly allocating a 5th ball, is 0.5. If however you have 4 boxes with 1 ball, P(X = 3) = 0. So how can it be correct to assume independence? I also don't see any error in the simulation provided by Bruce Trumbo, which gives a different result. – user245312 Aug 25 '15 at 19:26
  • @ClementC. - No, there is no need to "restore independence" -- the Poisson approximations are just easier to compute than the exact binomial probabilities. – r.e.s. Aug 26 '15 at 01:58
  • I see. Don't really see the point in it there, however, since for these values the Binomial probability is clearly computable. – Clement C. Aug 26 '15 at 02:02
  • 1
    @user245312 - I've added some detail about the expected value calculation. The book appears to focus on the expected value of the number of boxes that are hit exactly $k$ times in one trial, not on the maximum number of hits in one trial. See especially Tabelle 7 in the same section of the book -- it tabulates Poisson approximations to what I have called $E(N_k)$ and $p_k$ for $k = 0,1,2,3,4,5$. (Note that these sum to $36$ and $1$, respectively, over all $k\in{0,1,...,36}$.) – r.e.s. Aug 26 '15 at 02:05
  • Thanks, makes sense to me now. What I thought the book was trying to state is the expected number of trials of the experiment until at least $1$ box is selected $4$ times. According to a simulation this turns out to be $\frac{1}{0.42}\approx 2.38$. The "expected value of the number of boxes that are hit exactly $4$ times in one trial" is indeed $\frac{1}{0.54}\approx 1.85$ (in agreement with another simulation). – user245312 Aug 26 '15 at 14:56
  • 1
    @user245312 - In your last sentence, I think you meant to write that the "expected value of the number of boxes that are hit exactly k times in one trial" is indeed approximately $0.54$ (in agreement with your latest simulation, which verifies the binomial probability of about $0.513$), so a box with exactly $4$ hits occurs approximately $\frac{1}{0.54} \approx 1.85$ times per trial. (More accurately, that's $\frac{1}{0.513} \approx 1.95$ times per trial.) – r.e.s. Aug 26 '15 at 15:29
  • @r.e.s I have to admit that I am still confused. Is $0.513$ the expected number of boxes that are hit exactly 4 times in one trial? Is $1.95$ then the expected number of trials for $1$ box with 4 hits to occur? I wrote another simulation and it returns $2.3478$ for the mean number of trials until a box with 4 hits occurs. – user245312 Aug 26 '15 at 18:25
  • @r.e.s. Conduct the following experiment: (1.) randomly allocate $36$ balls into $36$ boxes (2.) repeat 1. until you see a box with $4$ balls and write down the number of repeats it took you until this happened (3.) repeat 2. $N$ times (large $N$). What is the mean number of repeats of 1. until you saw a box with $4$ balls? According to my simulation, it is $2.3478$. – user245312 Aug 26 '15 at 18:33
  • 1
    @user245312 - Correction: In my last comment I mistakenly referred to 0.513 as the "binomial probability", when I meant to say it is the expected value $E(N_4)=36⋅p_4$. I was trying to point out that this is consistent with the book, with my answer, and also with your own simulation here. The book appears to treat $\frac{1}{E(N_4)}$ as an approximation of $E(\frac{1}{N_4})$ (the "expected number of trials per occurrence of a $4$-hit box"). The experiment in your most recent comment simulates something very similar, but not quite the same. – r.e.s. Aug 27 '15 at 01:44
1

Comment:

Your analysis for a particular cell seems correct. However, you are looking for $P(X = 4)$, where $X$ is the maximum number of hits in any of 36 cells when the two-dice experiment is repeated 36 times. In the spirit of the problem, it may make more sense to look for $P(X \ge 4).$

If it helps toward an analytic solution, I conclude from a simulation in R that the author's claim is wrong. The table at the end of the following simulation gives the approximate distribution of $X$ correct to about 2 places--enough to see that $P(X = 4)$ is nearer to 0.39 than to 0.54. The author also claims that $P(X = 5) = 0.11$, while my simulation has $P(X = 5) \approx 0.09.$

However, the author is correct in his general message that such 'clusters' are more likely than one might guess from intuition. And one gets four or more hits about half the time.

 m=10^5; x = numeric(m)
 for(i in 1:m){
   x[i]= max(rle( sort(sample(1:36,36,repl=T)))$length)  }
 mean(x==4);  mean(x >= 4)
 ## 0.39411
 ## 0.49928
 round(table(x)/m, 3)
 ## b
 ##     2     3     4     5     6     7     8     9 
 ## 0.018 0.482 0.394 0.090 0.014 0.002 0.000 0.000 

Note on the simulation: Cells are numbered 1 through 36. The R function rle is for 'run length encoding'. Finding lengths of runs in $sorted$ data is an easy (if perhaps not optimally efficient) way to find numbers of repeated hits.

BruceET
  • 51,500
  • 1
    I wrote a quick Python simulation and get results near 0.42 for P(X =4) and near 0.09 for P(X = 5) in agreement with your simulation. Could it be because the authors assume a Poisson distribution for simplicity? – user245312 Aug 24 '15 at 19:08
  • I read the article in German (up to an approximation of my knowledge of German), and in the free translation offered (very amusing in places, hardly accurate). But I think I understood the problem exactly. Then I tried half a dozen "approximately correct" methods the author might have used, without getting a match to his answer. One guess is that he may have done an approximation but impatiently with two few iterations to get the right answer. – BruceET Aug 24 '15 at 19:32
  • 1
    The answer by r.e.s. makes a lot of sense to me now (see linearity of expectation). So the answer for P(X = 4) would be $36{36 \choose 4} \left( \frac{1}{36}\right)^4 \left( \frac{35}{36}\right)^{32} \approx 0.51$. The book gives 0.54, but that's easily explained by a Poisson approximation. But now I wonder, what went wrong with the simulation (annotated version of my Python simulation)? – user245312 Aug 26 '15 at 13:17
  • 1
    This does it. As r.e.s. wrote in a comment, "the book appears to focus on the expected value of the number of boxes that are hit exactly k times in one trial." – user245312 Aug 26 '15 at 13:40
  • Good. Now we know how the author got his answer. The question remains whether it is the answer to exactly the right question to make his point. It would be the cell with the max number of cases that got the attention. – BruceET Aug 26 '15 at 16:55
  • Here is the experiment that I thought the book wanted us to conduct (although with two dice and checkmarks): (1.) randomly allocate $36$ balls into $36$ boxes (2.) repeat 1. until you see a box with $4$ balls and write down the number of repeats it took you until this happened (3.) repeat 2. $N$ times (large $N$). What is the mean number of repeats of 1. until you saw a box with $4$ balls? According to my simulation, it is $2.3478$. What do you mean by the cell with the maximum number of cases? How often a box gets 36 out of 36 cases? – user245312 Aug 26 '15 at 18:47
  • I'm pleased that you believe you have the answer to your interpretation of the question. – BruceET Aug 26 '15 at 22:34