4

Informal description and motivation

I am comparing the output of two search engines. Each engine is searching over the same set of 2,000 documents and returning the top 20 hits. I'm trying to construct a hypothesis test where $H_0$ is the hypothesis that the search results are being randomly generated by both engines (i.e., that the results of one search engine are independent of the results of the other search engine).

My first strategy is to answer the question: If I select 20 numbers at random (and without replacement) from the integers 1 to 2,000 to form one hitlist, and repeat to form another hitlist of 20, how likely is it that they these two lists will share $0 \leq X\leq 20$ hits in common?


More formal, and general, description

Let $S_X:=\{1,2,...,N\},S_Y:=\{1,2,...,M\}$.

Let $X_k\subset S_X:|X_k|=k\leq N$ and $Y_j\subset S_Y:|Y_j|=j \leq M$

Finally, let $P(x\in X_1)=\frac{1}{N}\; \forall x \in S_X, \textrm{and}\; P(y \in Y_1)=\frac{1}{M}\;\forall y \in S_Y$, and, more generally:

$$P(x\in X_{k+1}|x \notin X_k) = \frac{1}{N-k} \; \forall x \notin X_k$$ and

$$P(y\in Y_{k+1}|x \notin Y_k) = \frac{1}{M-k} \; \forall y \notin Y_k$$


Question

Given $0\leq k \leq N$, $0 \leq j \leq M$, and $ 0 \leq c \leq \min \{k,j\}$, find $P(|X_k\cap Y_j|\leq c)$

  • 2
    The total number of ways to create the 2 lists is ${2000 \choose 20}^2$. The number of ways to create 2 lists with exactly $X \leq 20$ elements common is ${2000 \choose X}{ {2000-X} \choose {20-X}}^2$. Dividing these should give you the probability. – karmanaut Oct 21 '15 at 01:16
  • 1
    @karmanaut it seems like you're overcounting. See DirkGently's answer. –  Oct 21 '15 at 19:08

2 Answers2

2

Assuming $M\le N$, $$ P(|X_k\cap Y_j|=c)=\frac{\binom Mc\binom{M-c}{j-c}\binom{N-j}{k-c}}{\binom M j\binom N k}, $$ as there are $\binom M c$ ways of choosing the intersection, $\binom{M-c}{j-c}$ ways of choosing the rest of the elements of $Y_j$, and $\binom{N-j}{k-c}$ ways of choosing the rest of the elements of $X_k$.

DirkGently
  • 1,618
0

The experiment you are conducting follows the hypergeometric distribution where a success is a site appearing in the second list that already appeared in the first.

Suppose you have sampled the first 20 sites to form the first list. Using Wikipedia's notation:

  • the population size is $N = 20,000$,
  • the number of successes in the population is the size of the first list $K = 20$,
  • the number of draws you're making is the size of the second list $n = 20$,
  • you want to know the probability of $k$ successes i.e. $k$ sites appearing in both lists.

Suppose $X$ is a random variable following this distribution. Then $P(X = k) = \frac{{K \choose k}{N - k \choose n-k}}{{N \choose n}}$ and the expected number of successes is $E[X] = n \frac{K}{N}$.

So the expected number of sites appearing in both lists is $\frac{20^2}{20,000} = 0.02$ (if the lists are in fact generated independently at random).

rlchqrd
  • 103
  • 5