6

A lottery, where 6 balls out of 50 are drawn randomly without replacement, allows players of the lottery to be off by at most 1 for each number on their lottery ticket. Additional rules of the lottery are as follows:

  • The balls are numbered 1 thru 50 and are all equally likely to be drawn.

  • Both the winning numbers drawn and the players tickets contain random numbers (from 1 to 50). That is, the player does not pick/choose their own numbers, it is done randomly for them via computer.

  • Both the winning numbers and the numbers on lottery tickets are sorted in ascending order and winning tickets must match/map in that order. For example, if the winning numbers drawn are (sorted) 5, 10, 15, 16, 20, and 30 and the ticketholder has 5, 10, 16, 17, 20, and 30 (also sorted), the 16s will not map together. The matching numbers are (5:5), (10:10), (15:16), (16:17), (20:20), and (30:30). That is a winning ticket.

So the question is how much more likely is a single lottery ticket expected to be a winner using the off by 1 rule verses a similar but much stricter lottery (also 6 out of 50 balls) but which disallows the off by 1 rule (numbers must match exactly)?

Work done so far:

The strict lottery chances of winning on a single ticket are 1 in about 15.9 million which is 1 / (50 choose 6).

The off by 1 lottery was simulated on computer and I am seeing a 503 increase in probability (about 1 in 31,600 chance of winning).

I would like to know if this type of problem can be done mathematically or if it is too difficult. Also, I was hoping someone could do a simulation also to help verify my finding of 503x increase in chances of winning.

Additional info that may be helpful for analysis... The worst case boost in expected win probability is 7x if (for example), the drawn balls are 1,2,3,4,5,6. This would match the following tickets: (1,2,3,4,5,6), (1,2,3,4,5,7), (1,2,3,4,6,7), (1,2,3,5,6,7), (1,2,4,5,6,7), (1,3,4,5,6,7), and (2,3,4,5,6,7). Best case boost is 729x where the drawn balls are something like 5, 10, 15, 20, 23, 30. The 3 most frequent boost scenarios (from simulation results) are 486x, 648x, and 729x. These are 3 out of 137 scenarios I am seeing in simulation of 100,000,000 decisions. I am not yet sure if there more.

If something is unclear, please ask before attempting to answer the question and I will clarify.

David
  • 1,702
  • 3
    You asked the same question two days ago, and it was put on hold. Why are you posting it again instead of trying to improve the original one??? – barak manos Jun 22 '15 at 06:22
  • 2
    Because I was asked to reword it for clarity so rather than try to clean up the previous post, I started fresh. – David Jun 22 '15 at 07:45
  • This version of the Question seems very clear. But that does not seem mean it's easy to solve. – BruceET Jun 23 '15 at 00:18
  • I agree there appear to be a lot of cases to handle in this problem but simulating it seems very easy (for me) so it should be for others too. Just think about what the winning ticket must have and just check those and count them up. Actually my simulation program doesn't even have any checks for winning tickets because the nested loops just loop thru them all and I just count up the innermost loop iterations. For example, if the lowest numbered ball chosen is 5, my outmost loops starts at 4, not at 1. 1 cannot possibly be a match so no sense even checking it. The trick is in the looping. – David Jun 23 '15 at 16:48
  • If someone wrote this simulation on a fast computer using a true complied language (such as c), I suspect they could simulate 1 million or more decisions per second so even 1 billion should only take 10 minutes or less. With that many, the boost should be very accurate. – David Jun 23 '15 at 16:50
  • The most virtual tickets my simulation checks is 729 cuz each loop will iterate at most 3 numbers and there are 6 nested loops. In many cases they will iterate less because of close neighbors in the drawn balls (such as 5 and 6). This simulation is simple conceptually to get to run but analyzing the results is not so simple as there appear to be at least 137 different cases and possibly more. Not what I expected. – David Jun 23 '15 at 17:06
  • The most amazing thing about this off by 1 variation to me is how it goes from only 1 possible winning ticket (without the off by 1), to 137 different categories/classes of different winning tickets. I didn't expect that at all. It is amazing that with only 6 balls this can happen. This is why I have some doubt about my simulation and was hoping someone else could write one to help verify. I spot checked a few of the 137 and they seem accurate. The categories come from a combination of how close or far the "neighbors" are apart and if any of them are "end neighbors" (such as 1 or 50). – David Jun 24 '15 at 16:08
  • For clarity I should define what I mean by different classes/categories of winning tickets. Anytime my simulation program sees a set of 6 drawn balls such that the number of associated winning tickets (such as 486) has not previous been seen, I log that as a new entry. In that case I would call that a 486x boost in win chances. It is amazing to me how the boosts are fairly well distributed over the 7x to 729x range as well. This problem has some very interesting properties that I (and likely many others) would not have expected. It warrants further/deeper analysis. – David Jun 24 '15 at 16:14
  • @Bruce Trumbo, I appreciate your efforts and welcome you to help analyze this problem. I am analyzing the 137 cases and should have more insight soon which will help. My program found 131 of them so far so I am waiting on the last few then I can play around with the numbers which I am dumping into a database along with sample 6 ball sequences. – David Jun 24 '15 at 23:08
  • Just for fun I played around with increasing the number of balls above 50 and noticed if it becomes 106, I get over 1 trillion possible winning tickets for all possible 6 ball combinations combined. That is, T(106+1, 6) > $10^{12}$. – David Jun 29 '15 at 16:11
  • Also I am rerunning the uncached recursion and having the computer count how many calls it makes to itself (to the function T). It has only been running about 1 hour and already it is at about 1 billion calls. I suspect it will hit somewhere on the order of 8 to 12 billion uncached. Quite a difference from about 12,000 cached. – David Jul 01 '15 at 01:41
  • The final tally for the # of recursive calls to function T(n,k) is about 14 billion. That of course is without caching (storing) the intermediate results. I just counted the each time it hit a "base case" and not the part where it calls T four times. If I counted all it would be over 30 billion. Caching the intermediate results in this example makes a HUGE difference. This is a very "cool" formula that works but of course the computer simulation gives much more detail such as all the buckets and # of recursive calls... Still, both methods are very useful and confirm each others correctness. – David Jul 01 '15 at 13:00

4 Answers4

2

Here's a quick derivation which gives the same solution as that provided by gar, although a slightly different recursion.

A winning $k$ ball combination consists of integer sequences $0<x_1<\cdots<x_k$ and $0<y_1<\cdots<y_k$ so that $|x_i-y_i|\le1$.

Let $A_k(n)$ be the number of ways to pick winning $k$ ball combinations so that $x_k,y_k\le n$. This is the same as $T(n+1,k)$ defined by gar.

Similarly, let $B_k(n)$ be the number of ways to pick $k$ ball combinations so that $x_k\le n$ and $y_k\le n+1$. We'd get the same number of instead we require $x_k\le n+1$ and $y_k\le n$ due to symmetry.

Obviously, $A_k(n)=B_k(n)=0$ for $n<k$. Otherwise, $A_0(n)=B_0(n)=1$.

Let's first consider $A_k(n)$. If $(x_k,y_k)=(n,n)$, what remains is picking the remaining $k-1$ balls with $x_{k-1},y_{k-1}\le n-1$, which can be done in $A_{k-1}(n-1)$ ways. If $(x_k,y_k)=(n,n-1)$, the remaining $k-1$ balls should have $x_{k-1}\le n-1$ and $y_{k-1}\le n-2$, which can be done in $B_{k-1}(n-2)$ ways; the same applies to $(x_k,y_k)=(n-1,n)$. Otherwise, $x_k,y_k\le n-1$, which leaves $A_k(n-1)$ alternatives. Thus, $$ A_k(n)=A_{k-1}(n-1)+2B_{k-1}(n-2)+A_k(n-1). $$

We derive a recursion for $B_k(n)$ in a similar way. If $(x_k,y_k)=(n,n+1)$, the remaining $k-1$ balls require $x_{k-1}\le n-1$, $y_{k-1}\le n$, leaving $B_{k-1}(n-1)$ alternatives. Otherwise, $x_k,y_k\le n$, leaving $A_k(n)$ alternatives. Thus, $$ B_k(n)=B_{k-1}(n-1)+A_k(n). $$

Plugging in $k=6$, $n=50$ gives $A_6(50)=8000567708$. Since each of the two integer sequences are drawn randomly with probability $1/\binom{n}{k}$, this gives the probability of winning $$ \frac{A_6(50)}{\binom{50}{6}^2} =\frac{8000567708}{15890700^2}\approx 0.00003168361647. $$

  • I wonder if this version will run faster as a recursive function than gars which is taking so long to run on my computer. – David Jun 27 '15 at 13:13
  • 1
    @David: It is possible to eliminate $B_k(n)$ from the first equation, which yields the same recursion as that of gar. In order to compute it fast, you need to store the results in a table so it doesn't have to recompute every value. Then, the time it takes is basically proportional to the size of the table, which is $k\times n$. – Einar Rødland Jun 27 '15 at 16:48
  • Einar, I gave you the checkmark cuz gar chimed in first with the answer that worked but you both did well thanks. What a fun problem to solve and it shows that with both math and computers, it is rather "easy" to solve. – David Jul 01 '15 at 01:12
  • Also, even with caching of the intermediate results, my implementation is showing over 12,000 recursive calls to T() function. However that is a huge reduction to the uncached version which I am still running and has so far hit over 13.7 billion calls to itself. – David Jul 01 '15 at 09:01
  • The recursive formula gives the correct answer but the simulation is more informative because it counts the buckets with probability of each and shows the lowest numbered ball combo that fit in each bucket. It is informative to see and analyze those. – David Jul 02 '15 at 12:05
1

The results of the "exhaustive" simulation are in. I simulated every possible (sorted) 6 ball drawn outcome (there are 50 choose 6 = 15,890,700 of them). In total, I got 8,000,567,708 (slightly over 8 billion) winning tickets which means there is a 503.48x boost in chances of winning using the off by 1 rule. I had my simulation program print out (to the screen) how many iterations and it did indeed show the correct 15,890,700. Note that even with a very small percentage of randomly simulated 6 ball draws, I was already getting 503x so this exhaustive simulation was not really necessary but it is reassuring. Also the 137 different buckets was confirmed here, meaning there are 137 different boost factors contributing to the final result of 503.48x boost in win probability (vs. the must match exactly lottery).

I welcome other analysis, simulations, and comments about these results and my simulation. If you do your own simulation, please include code and some output for a chance to win bounty of 100 points.

Update:

I have the results of the 137 buckets/categories of winning tickets (from all possible 6 balls combinations). Format of tuples is bucket #, boost amount, frequency (out of $50 \choose 6$ = 15,890,700), [sample 6 balls fitting this category (lowest # balls)], % this category occurs out of the $50 \choose 6$:

1 486 3882669 [1,4,7,10,13,16 ] 24.4335932%
2 729 2760681 [2,5,8,11,14,17 ] 17.3729351%
3 648 2509710 [2,4,7,10,13,16 ] 15.7935774%
4 432 1578938 [1,4,6,9,12,15 ] 9.9362394%
5 324 1280163 [1,4,5,8,11,14 ] 8.0560517%
6 405 805638 [1,3,6,9,12,15 ] 5.0698711%
7 270 665494 [1,3,6,7,10,13 ] 4.1879464%
8 576 442890 [2,4,7,9,12,15 ] 2.7871019%
9 567 295260 [2,4,6,9,12,15 ] 1.8580679%
10 216 219037 [1,2,5,7,10,13 ] 1.3783974%
11 243 212420 [1,2,5,8,11,14 ] 1.3367567%
12 288 167656 [1,4,5,8,10,13 ] 1.0550574%
13 360 160284 [1,3,6,8,11,14 ] 1.0086654%
14 180 136968 [1,3,6,7,10,11 ] 0.8619381%
15 378 105450 [1,4,6,8,11,14 ] 0.6635957%
16 162 88482 [1,2,4,6,9,12 ] 0.5568163%
17 384 78033 [1,4,6,9,11,14 ] 0.4910608%
18 351 71706 [1,3,5,8,11,14 ] 0.4512451%
19 240 70984 [1,3,6,7,10,12 ] 0.4467015%
20 504 50616 [2,4,6,9,11,14 ] 0.3185259%
21 135 40020 [1,2,5,6,8,11 ] 0.2518454%
22 333 27417 [2,4,5,7,10,13 ] 0.1725349%
23 108 26358 [1,2,3,6,9,12 ] 0.1658706%
24 495 25308 [2,4,6,8,11,14 ] 0.1592630%
25 189 21438 [1,2,4,7,10,13 ] 0.1349091%
26 144 16778 [1,2,4,6,9,11 ] 0.1055838%
27 234 13418 [1,3,5,8,9,12 ] 0.0844393%
28 225 9751 [1,3,6,7,9,12 ] 0.0613629%
29 126 9522 [1,2,4,7,8,11 ] 0.0599218%
30 90 8690 [1,2,4,6,48,50 ] 0.0546861%
31 512 8436 [2,4,7,9,12,14 ] 0.0530877%
32 120 6720 [1,2,5,7,48,50 ] 0.0422889%
33 150 6595 [1,3,6,7,8,11 ] 0.0415023%
34 192 6318 [1,2,5,7,10,12 ] 0.0397591%
35 312 5776 [1,3,5,8,10,13 ] 0.0363483%
36 315 5776 [1,3,6,8,10,13 ] 0.0363483%
37 72 5288 [1,2,3,5,8,10 ] 0.0332773%
38 198 4602 [1,3,4,6,9,12 ] 0.0289603%
39 222 4524 [1,4,6,7,9,12 ] 0.0284695%
40 252 4370 [1,4,5,8,10,12 ] 0.0275004%
41 306 4370 [1,3,5,7,10,13 ] 0.0275004%
42 330 4294 [1,4,6,8,10,13 ] 0.0270221%
43 105 3522 [1,2,4,7,8,10 ] 0.0221639%
44 96 3280 [1,2,3,6,8,11 ] 0.0206410%
45 160 3120 [1,4,5,6,9,11 ] 0.0196341%
46 186 3120 [2,3,4,6,8,11 ] 0.0196341%
47 168 2964 [1,2,4,7,9,12 ] 0.0186524%
48 336 2812 [1,4,6,8,11,13 ] 0.0176959%
49 81 2546 [1,2,3,5,8,11 ] 0.0160219%
50 45 1808 [1,2,3,4,7,10 ] 0.0113777%
51 63 1804 [1,2,4,7,49,50 ] 0.0113526%
52 171 1718 [1,3,4,6,8,11 ] 0.0108114%
53 210 1636 [1,3,6,8,10,50 ] 0.0102953%
54 296 1482 [2,4,5,7,10,12 ] 0.0093262%
55 320 1406 [1,3,6,8,11,13 ] 0.0088479%
56 440 1406 [2,4,6,8,11,13 ] 0.0088479%
57 100 943 [1,3,5,7,8,9 ] 0.0059343%
58 256 703 [1,4,6,9,11,50 ] 0.0044240%
59 441 703 [2,4,6,9,11,13 ] 0.0044240%
60 60 539 [1,2,3,5,7,9 ] 0.0033919%
61 84 486 [1,2,3,6,8,10 ] 0.0030584%
62 195 388 [1,3,5,8,9,11 ] 0.0024417%
63 117 361 [1,2,5,6,8,10 ] 0.0022718%
64 132 318 [1,3,4,6,9,10 ] 0.0020012%
65 204 310 [1,3,5,7,10,11 ] 0.0019508%
66 54 287 [1,2,3,5,8,9 ] 0.0018061%
67 70 248 [1,2,4,7,8,9 ] 0.0015607%
68 114 244 [1,3,4,6,8,50 ] 0.0015355%
69 99 242 [1,2,4,6,7,10 ] 0.0015229%
70 30 168 [1,2,3,4,7,8 ] 0.0010572%
71 40 168 [1,2,3,4,7,9 ] 0.0010572%
72 48 166 [1,2,3,6,7,50 ] 0.0010446%
73 124 160 [1,4,5,6,8,10 ] 0.0010069%
74 112 158 [1,2,4,7,9,50 ] 0.0009943%
75 147 158 [1,2,4,7,9,11 ] 0.0009943%
76 177 158 [1,3,5,6,8,11 ] 0.0009943%
77 156 156 [1,3,5,8,9,50 ] 0.0009817%
78 267 154 [1,3,5,7,9,12 ] 0.0009691%
79 64 122 [1,2,3,6,8,50 ] 0.0007677%
80 36 90 [1,2,3,5,6,8 ] 0.0005664%
81 42 90 [1,2,3,5,7,8 ] 0.0005664%
82 18 86 [1,2,3,4,5,8 ] 0.0005412%
83 75 84 [1,2,4,5,7,9 ] 0.0005286%
84 102 84 [1,2,44,46,48,50 ] 0.0005286%
85 33 82 [1,2,3,4,6,9 ] 0.0005160%
86 87 82 [1,2,4,5,7,10 ] 0.0005160%
87 69 80 [1,2,3,5,7,10 ] 0.0005034%
88 111 80 [1,2,5,7,8,10 ] 0.0005034%
89 130 80 [1,3,5,8,9,10 ] 0.0005034%
90 165 80 [1,2,5,7,9,11 ] 0.0005034%
91 141 78 [1,2,4,6,8,11 ] 0.0004909%
92 176 78 [1,3,4,6,9,11 ] 0.0004909%
93 185 78 [1,3,6,8,9,11 ] 0.0004909%
94 251 78 [2,4,5,7,9,11 ] 0.0004909%
95 208 76 [1,3,5,8,10,50 ] 0.0004783%
96 272 76 [1,3,5,7,10,12 ] 0.0004783%
97 273 76 [1,3,5,8,10,12 ] 0.0004783%
98 275 76 [1,3,6,8,10,12 ] 0.0004783%
99 28 47 [1,2,3,4,6,8 ] 0.0002958%
100 161 40 [2,4,5,7,8,10 ] 0.0002517%
101 148 39 [1,4,6,7,9,50 ] 0.0002454%
102 249 39 [2,4,6,7,9,11 ] 0.0002454%
103 200 38 [1,3,6,8,48,50 ] 0.0002391%
104 220 38 [1,4,6,8,10,50 ] 0.0002391%
105 377 38 [2,4,6,8,10,12 ] 0.0002391%
106 66 6 [1,2,4,6,7,50 ] 0.0000378%
107 22 4 [1,2,3,4,6,50 ] 0.0000252%
108 25 4 [1,2,3,4,48,50 ] 0.0000252%
109 27 4 [1,2,3,5,49,50 ] 0.0000252%
110 46 4 [1,2,3,5,7,50 ] 0.0000252%
111 52 4 [1,2,3,46,48,50 ] 0.0000252%
112 7 2 [1,2,3,4,5,6 ] 0.0000126%
113 12 2 [1,2,3,4,5,50 ] 0.0000126%
114 13 2 [1,2,3,4,5,7 ] 0.0000126%
115 15 2 [1,2,3,4,49,50 ] 0.0000126%
116 51 2 [1,2,4,5,7,8 ] 0.0000126%
117 55 2 [1,3,5,6,7,8 ] 0.0000126%
118 57 2 [1,3,4,5,7,8 ] 0.0000126%
119 58 2 [1,2,4,5,7,50 ] 0.0000126%
120 76 2 [1,3,5,6,7,50 ] 0.0000126%
121 78 2 [1,3,4,6,7,50 ] 0.0000126%
122 85 2 [1,3,4,5,7,9 ] 0.0000126%
123 91 2 [1,2,4,46,48,50 ] 0.0000126%
124 94 2 [1,2,4,6,8,50 ] 0.0000126%
125 95 2 [1,3,4,6,7,9 ] 0.0000126%
126 110 2 [1,3,4,6,48,50 ] 0.0000126%
127 118 2 [1,3,5,6,8,50 ] 0.0000126%
128 123 2 [1,2,4,6,8,10 ] 0.0000126%
129 149 2 [1,3,4,6,8,10 ] 0.0000126%
130 153 2 [1,3,5,6,8,10 ] 0.0000126%
131 155 2 [1,3,5,7,8,10 ] 0.0000126%
132 170 2 [1,3,5,7,48,50 ] 0.0000126%
133 178 2 [1,3,5,7,9,50 ] 0.0000126%
134 233 2 [1,3,5,7,9,11 ] 0.0000126%
135 16 1 [1,2,3,48,49,50 ] 0.0000063%
136 49 1 [1,2,4,47,49,50 ] 0.0000063%
137 169 1 [1,3,5,46,48,50 ] 0.0000063%

These results should be 100% accurate now since they are not from random 6 ball combos, they are from all possible 6 ball combos (out of 50 possible) counting each only once. If I multiply the boost amount by the occurring percentage of each bucket, I get an overall (weighted average) boost value of 503.4748449x. That should be the "mathematically" correct answer (using a computer).

Here is something else interesting. Bucket 20 has a boost factor of 504x. That is almost identical to the average of 503.47...x but just with a single case/bucket (out of 137). It seems very rare that something like that would happen for a problem so complex but that reinforces my observation and point that the boost factors are spread out rather uniformly over a 7x to 729x range.

Also here is something else interesting: Running the nested loops starting at {1,2,3,4,5,6] and proceeding sequentially revealed that bucket 137 (not sorted by frequency yet) was "hit" very early on (at 6 ball sequence [2,5,8,11,14,17]). It seems like any 6 balls sequence after that will fall into one of the already seen 137 buckets. This is somewhat surprising since the nested loops have to run up to [45,46,47,48,49,50].

David
  • 1,702
  • Also I wanted to comment that this problem IS doable on paper if the person is patient and is able to isolate all 137 different cases. Perhaps this would be a good class problem where the workload is divided up among many students and then the results are combined. Isolating the patterns of all 137 different cases would be a task. My simulation program reported these to me but I did not ask it to show me the details of those different cases, just that they exist. Some are kinda obvious such as drawn balls of 5,10,15,20,25, 30. This is clearly a 729x scenario. However, 1,6,11,16,21,26 is not. – David Jun 24 '15 at 07:38
  • 1,6,11,16,21,26 is a 486x scenario because the first digit can only be matched 2 ways, not 3. So 2x3x3x3x3x3 = 486. This scenario also happens with something like 10, 20, 30, 40, 45, 50 because the 50 can only be matched 2 ways (with 49 and 50). It is interesting to note how close this common 486x scenario is to the overall average of 503.48x. It may just be coincidental. – David Jun 24 '15 at 07:45
  • I am thinking I could capture all 137 cases in my program and sort them by frequency of occurrence starting with the most frequent and list them but it will be a long list. I am still amazed at how, by allowing each digit to be off by 1, so many scenarios (buckets) occur. On paper, and by not knowing how many there are, errors in computation would almost be guaranteed. The computer however easily catches them all and I can even have it print out the ball numbers so I can see what the actual scenario is. For example, balls 1,2,47,48,49,50 has 15 possible matching winning tickets. – David Jun 24 '15 at 12:51
  • Are you sure all of your "137" cases are equally likely? In classical model, getting balls 5, 10, 15, 20, 25, 30 and balls 1,2,3,4,5,6 are equally likely outcomes. For the 'within one' model it is more complicated. My simulation of a special case seems to have been unhelpful, and I am deleting it, along with my previous comments. – BruceET Jun 24 '15 at 16:29
  • I never said the 137 cases are equally likely. Remember these are classes of winning tickets. For example, (1,2,3,4,5,6) and (45,46,47,48,49,50) fit into the same class which is the 7x boost class (since there are 7 winning tickets that will match either). Also, (5,10,15,20,21,30) should be in the same class as (10,15,16,19,24,29). cuz they both have 1 very close neighbor and the others are at least 3 apart. – David Jun 24 '15 at 17:18
  • Remember the order things happen here. We are selecting 6 random equiprobable balls as the winning numbers and a lottery player gets 1 random set of 6 generated numbers. So if the winning numbers are say 7, 11, 14, 15, 30, and 49, all we are interested in counting (in this trial) is how much of a win boost does the off by 1 allowance contribute? The easiest way I have seen to calculate this is run nested loops starting with all player ticket numbers at -1. For example, 6,10,13,14,29, and 48 is the winning ticket with the lowest sum of the 6 numbers. Next winner would be 6,10,13,14,29,49. – David Jun 24 '15 at 17:27
  • I think you will have better luck attracting a fresh ideas and useful answers if you clearly describe your findings to date in your Question, organizing contents of your key comments. Then delete comments. Getting very difficult to follow. – BruceET Jun 24 '15 at 17:36
  • I am currently simulating all 137 cases and dumping them into a database and sorting them based on frequency. I can list perhaps the 10 most frequently occurring scenarios (with example winning numbers), since they will make up a good chunk of all the outcomes and we can use that as a starting point for some deep analysis of this "beast" of a problem. The first 100 or so are reasonably fast but the last 37 or so take a long time to show up (using randomly generated winning balls). – David Jun 24 '15 at 22:27
  • For the sake of verifying an algorithm that might exhaustively generate them, please post (here or elsewhere) all of the known cases. – r.e.s. Jun 25 '15 at 01:59
  • There are so many buckets it will take me so long to list them all. I listed the 5 most and the 5 least frequently occurring categories for analysis (for checking by hand) in my posted answer. If those are all correct then chances are my entire simulation is correct. – David Jun 25 '15 at 02:04
  • I suspect you're missing many cases. E.g. what about $(15,[1,2,3,4,49,50])$? – r.e.s. Jun 25 '15 at 02:34
  • OK, thanks for posting all the known cases. – r.e.s. Jun 25 '15 at 02:39
  • You're welcome. What amazes me is not only how many categories/buckets there are, but how "evenly distributed" then are across the 7x to 729x range. If you sort them by boost amount you will see that. It is totally not what I expected but seems accurate. I haven't found any flaws in my simulation yet and am waiting to see if anyone else does. I think it is correct. – David Jun 25 '15 at 02:51
  • Also, another somewhat amazing thing is that the 486x case which is most common may be the first case that pops up during a simulation of only 1 trial. Someone may see that and just say "oh about a 500x win boost" and they would be so very close to the correct answer of 503.48x which takes several orders of magnitude more trials to obtain. – David Jun 25 '15 at 03:20
1

You've already seen there are different cases of winning numbers $a_1 < \dots < a_6$. That is, the number of winning tickets under the off-by-one rule varies based on the gaps between the winning numbers. You can model the different cases as $7$-tuples $(x_0, x_1, x_2, x_3, x_4, x_5, x_6)$. Here $x_i$ denotes the gap between the $i$-th and $i+1$-th number for $1 \leq i \leq 5$, so $x_i = a_{i+1} - a_i$. The $x_0$ and $x_6$ denote the gap between the "edges", that is, the numbers 1 and 50, so $x_0 = a_1 - 1$ and $x_6 = 50 - a_6$.

For achieving the maximum number of winning tickets we have two "empty spots" around each $a_i$ that don't overlap with the other empty spots. Then, for each $a_i$ we have 3 matching numbers under the off-by-one rule, independently, so the number of winning tickets is $3^6 = 729$. The condition on the $x_i$ is that for $1 \leq i \leq 5$ we have $x_i \geq 2$ (for if $x_i = 1$ then we have one "overlapping gap" between two numbers), and we have $x_0, x_6 \geq 1$. Otherwise, we want the $x_i$ to sum to $50-6 = 44$. The number of winning numbers that meets these conditions is ${44 - 5-1\choose 7-1} = {38 \choose 6}$ (Stars and bars, where we first subtract 1 from each $x_i$ for $1 \leq i \leq 5$, so we're looking for 7-tuples of positive integers that sum to $44-5 = 39$)

The full answer can be computed by considering all different conditions for the $x_i$ . For $1 \leq i \leq 5$ we have 3 conditions: $x_i = 0 \vee x_i = 1 \vee x_i \geq 2$. For $i \in \{0, 6\}$ we have either $x_i = 0$ or $x_i \geq 1$.

For example, fixing $x_i = 1$ for a single $1 \leq i \leq 5$ and leaving the other conditions as in the maximum situation we just computed, we have a single "overlapping gap" between two numbers, hence the number of matching tickets will be $3^4 \cdot 8 = 648$. The number of winning numbers that match this will be ${39 \choose 5} \cdot 5$, where multiplication with 5 follows from that we can pick $i \in \{1, 2, 3, 4, 5\}$.

As we saw here, the $x_i$ do not play a unique role, i.e. they are basically symmetric (besides the difference between the "interior" ones and the ones at the edges). However, fixing two adjacent $x_i$ is different from fixing two non-adjacent $x_i$.

Mark
  • 1,433
  • Mark. That $39 \choose 5$ * 5 answer is not working for me. That gives about 18.12% if divided by $50 \choose 6$. My simulation is only showing 15.79% for the 648x bucket. I think it should be $38 \choose 5$ * 5. Please check and verify. – David Jun 25 '15 at 09:59
  • What I would also like to know is how can anyone (without the aid of an electronic computer), figure out and isolate all the different winning classes without missing any and without double counting any? This type of problem goes from super easy (without the off by 1 allowance), to a "nightmare" (on paper) with it. Another point is that even though I have IDed 137 different classes of winning tickets (based on their boost value), it could be that 2 (or more) different patterns of winners got lumped into the same bucket because they just happened to have the same boost factor. – David Jun 25 '15 at 10:05
1

The probability of winning is given by \begin{align*} P(win) &= \frac{T(n+1,k)}{\dbinom{n}{k}^2} \end{align*} where $T(n,k)$ is computed by the recurrence relation \begin{align*} T(n,k) &= T(n-1,k) + 2*T(n-1,k-1) + T(n-2,k-1) - T(n-2,k-2)\\ T(0,0) = T(1,0) = T(2,0) = T(2,1) &= 1\\ T(1,1) = T(2,2) &= 0 \\ T(n,k) &= 0 \;\;\;\;\;\;\text{if $k<0$ or $k>n$} \end{align*} as given in A209414

Hence, for $n=50, k=6$, the resulting probability is \begin{align*} P(win) &= \frac{8000567708}{\dbinom{50}{6}^2}\approx 0.0000316836 \end{align*}

I have matched with simulations for small values and they seem to agree.

For example, in J:

   n=: 9
   k=: 5
   sim=: 3 : 0
a=:/:~k?n
win=:/:~k?n
k=(+/0=|a-win)+(+/1=|a-win)
)
   (+/%#)(sim"0)1e6#0

which gave about 0.28, close to actual $\dfrac{4446}{\dbinom{9}{5}^2}\approx 0.280045351473923$

Update

Some implementations,

In python:

def memoize(f):
    """ Memoization decorator for functions taking one or more arguments. """
    class memodict(dict):
        def __init__(self, f):
            self.f = f
        def __call__(self, *args):
            return self[args]
        def __missing__(self, key):
            ret = self[key] = self.f(*key)
            return ret
    return memodict(f)

@memoize
def T(n,k):
    if n == 0 and k == 0:
        return 1
    if n == 1 and k == 0:
        return 1
    if n == 2 and k == 1:
        return 1                
    if n == 2 and k == 0:
        return 1   
    if n == 1 and k== 1:
        return 0
    if n == 2 and k == 2:
        return 0
    if n < k or k< 0:     
        return 0
    return T(n-1,k) + 2*T(n-1,k-1) + T(n-2,k-1) - T(n-2,k-2)
nn = 50
kk = 6
print T(nn+1,kk)

The memoization decorator is taken from here

If a CAS is used, it's easier to implement. E.g., in FriCAS:

)set fun cache all A
T(0,0) == 1
T(1,0) == 1
T(2,0) == 1
T(2,1) == 1
T(1,1) == 0
T(2,2) == 0
T(n,k) == 
  if k<0 or k>n then 0
  else T(n-1,k) + 2*T(n-1,k-1) + T(n-2,k-1) - T(n-2,k-2)
T(51,6)

and Einar Rødland's formula:

)set fun cache all A
)set fun cache all B
A(n,0) == 1
B(n,0) == 1
A(n,k) == 
    if k>n then 0
    else A(n-1,k-1)+2*B(n-2,k-1)+A(n-1,k)
B(n,k) == 
    if k>n then 0
    else B(n-1,k-1)+A(n,k)
A(50,6) 
gar
  • 4,948
  • I can confirm the answer for $n=50$, $k=6$, although my derivation was a little different. I guess the reasoning leading up to the recursion should be added to the answer. – Einar Rødland Jun 26 '15 at 20:31
  • @EinarRødland: I computed for small numbers and took it to oeis – gar Jun 27 '15 at 03:35
  • Wow so interesting. I have never seen a binomial squared in a counting problem in college. I made this problem up myself and worked reasonably hard on it to make sure I got correct results to share. I thank all those that upvoted it and even voted it as their favorite question. To me it seems like one of the most interesting counting problems I have come up with to date. I am surprised the final answer can be gotten mathematically without having to isolate all the 137 "buckets" (cases). but the P(win) answer matches mine within a very small fraction of a percent. Very impressive! – David Jun 27 '15 at 03:45
  • gar: Why is T(n,k) defined twice? – David Jun 27 '15 at 03:52
  • @David: I too enjoyed the problem! The square in the denominator is because the sample space for the player and the lottery ticket seller is the same. And there's a condition on the right side for the second definition (k<0 or k>n). – gar Jun 27 '15 at 03:57
  • Oh ok I didn't see that way out to the right until you mentioned it cuz it got "lost" in some clutter on my screen. It seems like my simulation was right on the money which I am happy about. Can you explain a little more about how the recurrence relation works and how you came up with it? I probably don't have to ask this but I will to satisfy my curiosity and maybe that of others viewing this problem... Is there any way (other than brute force) to (mathematically) come up with the 137 different "buckets" as I did in my simulation? – David Jun 27 '15 at 04:14
  • Ok, I updated the formula for readability. And as I mentioned in the previous comment, I computed for small values and looked up in oeis.org. I computed the exact values for $n=4$ and $k=1,2,3,4$, and looked up the sequence. – gar Jun 27 '15 at 04:26
  • @gar: I've posted a formal derivation, although giving a slightly different recursion formula. Result is the same. – Einar Rødland Jun 27 '15 at 10:01
  • @gar. I implemented your solution as a recursive function. It ran reasonably fast for n=9, k=5 but n=50,k=6 is taking "forever and a day". I am surprised it is taking so long. Did yours take a long time to compute for n=50 and k=6? – David Jun 27 '15 at 12:58
  • Don't compute the recurrence directly. The results must be cached (memoize) to get the answer. I can update with the python code I used if you want. – gar Jun 27 '15 at 13:46
  • @EinarRødland: Yes, that's a nice formal derivation, thanks! – gar Jun 27 '15 at 13:48
  • ok thanks, I will try storing (caching) the partial results. I'll try it with a small n like maybe 20 to 25 first and compare the speedup. Then hopefully I can "let 'er rip" with n=50. – David Jun 28 '15 at 02:55
  • Something interesting happened. For me, the recursion seems to run faster if I combine some of the cases so the code is shorter. I was able to get about a 10% speedup that way. I will just let the recursion run when I sleep and then when I wake I will check the answer and elapsed time and share the results. I am not caching the intermediate results because I want to see how long it takes the pure recursion to run. – David Jun 28 '15 at 04:02
  • I ran the recursion and also got T(51,6) = 8,000,567,708 but here is the funny part to me... the recursion (uncached) ran somewhere around 5 times slower than my actual simulation of all possible 6 ball drawings which also computes the number of winning tickets for each. I have to play around with caching to see if I can get the recursion faster. Depending on the programming environment, it is not just as simple as storing the intermediate results in memory. I am very impressed the recurrence relation works but not impressed with the very sluggish speed I get. I will work on it more. – David Jun 28 '15 at 10:47
  • Cached is around 6 orders of magnitude faster for me. That is, about 1 million times faster. I used a 2 dimensional array and let the program store and lookup the intermediate results such as T(10,5). I noticed that the values stored contain all 6 values of k from n = 6 up until n gets to 46. When n gets to 47, only the k values of 2 to 6 get stored. For n = 48, only k values of 3 to 6... I counted 275 stored values (including T(51,6) itself). – David Jun 28 '15 at 11:15
  • @gar. 2 "subquestions" related to this question: 1) How did you know to use this particular recurrence relation for this problem? 2) How does this recurrence relation even work? – David Jun 29 '15 at 23:58
  • Also, what did they do with formulas like this back in the days before computers? How could they compute some final answer such as for T(51, 6)? Even with my highly cached version, the computer is counting over 12,000 calls to itself for function T. Uncached I am not sure but I could check sometime soon. Perhaps it is millions. – David Jun 30 '15 at 11:43
  • @David : As I said earlier, I got the formula experimentally. Before computers, I think it was not their goal to compute large values! – gar Jun 30 '15 at 14:04
  • @David: Computing $T(n,k)$ for $n\le N$, $k\le K$ starting with the lowest values of $n$ and $k$ only requires filling in a $N\times K$ table: in this case 306 table cells in all. This is doable by hand (and a little patience). However, I suspect this recursion may also have a closed form since it is linear. – Einar Rødland Jul 01 '15 at 10:39
  • ok well my program told me 275 (table cells) were cached (I didn't bother caching the very small values of n and k) but even with caching about 90% of the possible n,k combos, my program told me over 12,000 calls to the T function. Without any cache I am seeing over 17.4 billion calls and it is still going. – David Jul 01 '15 at 11:15
  • @David : We can get the number of calls to that recursion by using a cached version of its modification. Modify the formula as $$T(n,k) = T(n-1,k) + T(n-1,k-1) + T(n-2,k-1) + T(n-2,k-2) + 1$$ and return all 1 at the boundaries. The value I got is $18692318145$ without caching and $413$ calls with caching. – gar Jul 01 '15 at 17:21
  • I got the same exact number uncached but since I didn't cache the small values of n and k, I got slightly over 12,000 calls to T but the runtime was only about 1/40th of a second on my dual core computer – David Jul 01 '15 at 23:12
  • I wonder what would be a good way to estimate the number of recursive calls not actually having a computer count them up. There are 4 calls in the last condition of the recursion but how would we estimate how many times that line will be called? When I first saw this equation I had no idea there would be that many calls and that it would take hours to evaluate uncached. – David Jul 01 '15 at 23:18