0

I am continuing from a previous question Expected Value for Sum of Unfair Dice equals Expected Value for Sum of Fair Dice?

Introduction:

  • Suppose there are two dice : $D_1$ and $D_2$. Both $D_1$ and $D_2$ have 6 sides (i.e. $i = 1,2,3,4,5,6$).

  • If $D_1 = i$, there is a $0.5$ probability that $D_2 = i$ and an evenly distributed probability that $D_2$ equals all other remaining numbers (i.e. there is a 0.5 probability that the outcome of the second dice will have the exact same outcome as the first dice, and a 0.5/5 probability that the second dice will have any other outcome)

  • Suppose both dice are rolled $n$ times: $(x_1,y_1)$, $(x_2, y_2)$, ... $(x_n,y_n)$

  • The objective is to estimate the Expected Value of the sum of both dice, i.e. $Z = D_1 + D_2$ , thus we are interested in $E(Z)$

My Problem: Suppose there are 3 people that are trying to estimate $E(Z)$ based on these $n$ rolls

  • Person 1 is provided the individual dice rolls $(x_1,y_1)$, $(x_2, y_2)$, ... $(x_n,y_n)$ and is told that the dice rolls are independent
  • Person 2 is provided only with the sums of the dice rolls $z_1, z_2, ... z_n$ and is told that the dice rolls are independent
  • Person 3 is provided the individual dice rolls $(x_1,y_1)$, $(x_2, y_2)$, ... $(x_n,y_n)$ and told that all outcomes of Dice 2 depends on the outcome of Dice 1

Part 1: Expected Values (i.e. Mean Estimator)

  • Person 1: To calculate the expected value of the dice sum for Person 1, we can use the formula $E(x+y) = E(x) + E(y)$ :

$$E(Z) = \hat{Z} = \hat{E}(D_1) + \hat{E}(D_2) = \left( \sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n} \right) + \left( \sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n} \right)$$

  • Person 2: To calculate the expected value of the dice sum for Person 2, we can directly calculate the expected value of $z$: $$E(Z) = \hat{Z} = \sum_{k=2}^{12} z \cdot \frac{n_z}{n}$$

  • Person 3: To calculate the expected value of the dice for Person 3, we assume that every outcome of Dice 2 potentially depends on the outcome of Dice 1, thus resulting in a large sum of conditional probabilities (in reality, many of these will likely be $0$):

$$ E(Z) = \hat{Z} = \sum_{i=1}^{6} \sum_{j=1}^{6} (i+j) \cdot P(Y=j|X=i) \cdot P(X=i) = \sum_{i=1}^{6} \sum_{j=1}^{6} (i+j) \cdot \frac{n_{2ij}}{n_{1i}} \cdot \frac{n_{1i}}{n}$$

Apparently, because of the law of linear expectations (Proof of linearity for expectation given random variables are dependent), the expected values for all 3 people are equal.

Part 2: Variances of the Mean Estimators

For all people, we use the formulae:

  • $Var(Z) = E(Z^2) - [E(Z)]^2$
  • $Var(E(Z)) = \frac{Var(Z)}{n}$
  • $Cov(X,Y) = E(XY) - E(X)E(Y)$

Thus,

  • For Person 1:

$$\text{Var}(Z) = \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{1i}}{n}\right) - \left(\left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right)^2\right) + \left(\sum_{j=1}^{6} j^2 \cdot \frac{n_{2j}}{n}\right) - \left(\left(\sum_{j=1}^{6} j \cdot \frac{n_{2j}}{n}\right)^2\right)$$

$$ \text{Var}(\hat{Z}) = \text{Var}(E(Z)) = \frac{1}{n} \left[ \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{1i}}{n}\right) - \left(\left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right)^2\right) + \left(\sum_{j=1}^{6} j^2 \cdot \frac{n_{2j}}{n}\right) - \left(\left(\sum_{j=1}^{6} j \cdot \frac{n_{2j}}{n}\right)^2\right) \right]$$

  • For Person 2:

$$\begin{align*} {Var}(Z) = \left(\sum_{k=2}^{12} k^2 \cdot \frac{n_k}{n}\right) - \left(\left(\sum_{k=2}^{12} k \cdot \frac{n_k}{n}\right)^2\right) \end{align*}$$

$$\text{Var}(\hat{Z}) = \text{Var}(E(Z)) = \frac{1}{n} \left[ \left(\sum_{k=2}^{12} k^2 \cdot \frac{n_k}{n}\right) - \left(\left(\sum_{k=2}^{12} k \cdot \frac{n_k}{n}\right)^2\right) \right]$$

  • For Person 3:

$$Var(X) = E(X^2) - [E(X)]^2 = \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{1i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right)^2$$

$$Var(Y) = E(Y^2) - [E(Y)]^2 = \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{2i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right)^2$$

$$Cov(X,Y) = E(XY) - E(X)E(Y) = \left(\sum_{i=1}^{6} \sum_{j=1}^{6} i \cdot j \cdot \frac{n_{2ij}}{n_{1i}} \cdot \frac{n_{1i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right) \cdot \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right)$$

$$Var(Z) = Var(X+Y)= Var(X) + Var(Y) + 2Cov(X,Y) = \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{1i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right)^2 + \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{2i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right)^2 + 2\left(\sum_{i=1}^{6} \sum_{j=1}^{6} i \cdot j \cdot \frac{n_{2ij}}{n_{1i}} \cdot \frac{n_{1i}}{n}\right) - 2\left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right) \cdot \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right)$$

$$\text{Var}(\hat{Z}) = \text{Var}(E(Z)) = \frac{1}{n} \left[ \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{1i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right)^2 + \left(\sum_{i=1}^{6} i^2 \cdot \frac{n_{2i}}{n}\right) - \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right)^2 + 2\left(\sum_{i=1}^{6} \sum_{j=1}^{6} i \cdot j \cdot \frac{n_{2ij}}{n_{1i}} \cdot \frac{n_{1i}}{n}\right) - 2\left(\sum_{i=1}^{6} i \cdot \frac{n_{1i}}{n}\right) \cdot \left(\sum_{i=1}^{6} i \cdot \frac{n_{2i}}{n}\right) \right]$$

Based on the above analysis, it seems that the Variance for Person 3 will be larger than the Variance of Person 1 and the Variance of Person 2 (since both Person 1 and Person 2 are not taking into consideration covariances). However, this suggests that the Variance estimate for Person 3 will be closer to the actual variance, while Person 1 and Person 2 will underestimate the actual variance.

My Question: Have I correctly calculated the variances for all 3 people?

Thanks!

  • Note: Additional Variance Identity

Case 1:$$ Var(X+Y) = E(X^2) - [E(X)]^2 + E(Y^2) - [E(Y)]^2 + 2Cov(X,Y)$$

Case 2:

$$\begin{align*} \text{Var}(X+Y) &= E((X+Y)^2) - (E(X+Y))^2 \\ &= E(X^2 + Y^2 + 2XY) - [ (E(X))^2 + (E(Y))^2 + 2E(X)E(Y) ] \\ &= E(X^2) - (E(X))^2 + E(Y^2) - (E(Y))^2 + 2E(XY) - 2E(X)E(Y) \end{align*}$$

  • In Case 1, we assume $2Cov(X,Y)= 0$
  • In Case 2, we assume $2Cov(X,Y) = 2E(XY) - 2E(X)E(Y) = 0$
  • Thus, after making these substitutions in Case 1 and Case 2: the variances are same
stats_noob
  • 3,112
  • 4
  • 10
  • 36

1 Answers1

2

The first thing to do is to calculate the actual means and variances. We assume that $$D_1 \sim \operatorname{Categorical}(\boldsymbol p)$$ with vector parameter $\boldsymbol p = (p_1, p_2, \ldots, p_6)$ satisfying $0 < p_i < 1$ and $p_1 + \cdots + p_6 = 1$, such that $\Pr[D_1 = i] = p_i$ for $i \in \{1, \ldots, 6\}$. Then we have $$\operatorname{E}[D_1] = \sum_{i=1}^6 i p_i,$$ which we will call $\mu_1$ for convenience. The variance we will denote by $$\operatorname{Var}[D_1] = \sum_{i=1}^6 i^2 p_i - \mu_1^2 = \sigma_1^2.$$

Then conditioned on $D_1$, the second die's outcome obeys $$D_2 \mid D_1 \sim \operatorname{Categorical}(\boldsymbol \pi),$$ where $$\pi_j \mid D_1 = \begin{cases}0.1, & j \ne D_1 \\ 0.5, & j = D_1 \end{cases} = 0.1 + 0.4 \mathbb 1 (j = D_1).$$

Its conditional expectation and variance are $$\operatorname{E}[D_2 \mid D_1] = \sum_{j=1}^6 j (0.1 + 0.4 \mathbb 1 (j = D_1)) = 2.1 + 0.4 D_1 = \mu_2(D_1)$$ and $$\begin{align} \operatorname{Var}[D_2 \mid D_1] &= \sum_{i=1}^6 j^2 (0.1 + 0.4 \mathbb 1 (j = D_1)) - \mu_1(D_1)^2 \\ &= 9.1 + 0.4 D_1^2 - (2.1 + 0.4D_1)^2 \\ &= 4.69 - 1.68 D_1 + 0.24 D_1^2 = \sigma_2^2(D_1). \end{align}$$

Note that we can also write these explicitly as follows: $$\mu_2(D_1) = \begin{cases}2.5 & D_1 = 1, \\ 2.9 & D_1 = 2, \\ 3.3 & D_1 = 3, \\ 3.7 & D_1 = 4, \\ 4.1 & D_1 = 5, \\ 4.5, & D_1 = 6, \end{cases} \qquad \sigma_2^2(D_1) = \begin{cases}3.25 & D_1 \in \{1,6\}, \\ 2.29 & D_1 \in \{2,5\}, \\ 1.81 & D_1 \in \{3,4\}. \end{cases}$$

Then by the law of total expectation, $$\begin{align} \operatorname{E}[D_1 + D_2] &= \operatorname{E}[\operatorname{E}[D_1 + D_2 \mid D_1]] \\ &= \operatorname{E}[D_1 + \operatorname{E}[D_2 \mid D_1]] \\ &= \operatorname{E}[D_1 + \mu_2(D_1)] \\ &= \operatorname{E}[D_1 + 2.1 + 0.4 D_1] \\ &= 1.4 \operatorname{E}[D_1] + 2.1 \\ &= 1.4 \mu_1 + 2.1. \tag{1} \end{align}$$ By the law of total variance, $$\begin{align} \operatorname{Var}[D_1 + D_2] &= \operatorname{E}[\operatorname{Var}[D_1 + D_2 \mid D_1]] + \operatorname{Var}[\operatorname{E}[D_1 + D_2 \mid D_1]] \\ &= \operatorname{E}[\operatorname{Var}[D_2 \mid D_1]] + \operatorname{Var}[1.4 D_1 + 2.1] \\ &= \operatorname{E}[4.69 - 1.68 D_1 + 0.24 D_1^2] + 1.96 \operatorname{Var}[D_1] \\ &= 4.69 - 1.68 \operatorname{E}[D_1] + 0.24 \operatorname{E}[D_1^2] + 1.96 \sigma_1^2 \\ &= 4.69 - 1.68 \mu_1 + 0.24 (\operatorname{Var}[D_1] + \mu_1^2) + 1.96 \sigma_1^2 \\ &= 4.69 - 1.68 \mu_1 + 0.24 \mu_1^2 + 2.2 \sigma_1^2. \tag{2} \end{align}$$ Consequently $(1)$ and $(2)$ allow us to compute the desired moments for any general categorical distribution on the first die's outcome $D_1$; e.g., if $D_1$ is fair, then $\mu_1 = 7/2$, $\sigma_1^2 = 35/12$, and $$\operatorname{E}[D_1 + D_2] = 7, \quad \operatorname{Var}[D_1 + D_2] = 49/6.$$ Notice that the expectation of the sum is the same as if $D_1$ and $D_2$ were iid fair dice, but the variance is not: $49/6$ exceeds the variance $35/6$ of iid fair dice.


With this foundation in mind, the question now turns to the matter of estimation. The obvious estimator when the full data is furnished, is to simply use the sample mean and (bias-corrected) sample variance: $$\bar Z = \bar D_1 + \bar D_2,$$ and $$\widehat{\operatorname{Var}}[Z] = \frac{1}{n-1} \sum_{i=1}^n (x_i + y_i - \bar Z)^2 .$$ There is no need to use any other estimator because the correlation between $D_1$ and $D_2$ is reflected in the data itself. Moreover, the fact that both estimators can be calculated from the full data, or from the sum of paired observations, means that there is never any difference between the estimates of Persons 1 and 2. Your use of the formula $$\widehat{\operatorname{Var}}[Z] = \widehat{\operatorname{Var}}[D_1] + \widehat{\operatorname{Var}}[D_2] + 2 \widehat{\operatorname{Cov}}[D_1, D_2]$$ is superfluous, because the resulting estimators for the marignal variances and the sample covariance algebraically simplifies to the sample variance of the sum. In other words, your impression that the estimator that Person 3 calculates will be larger, is incorrect. All three people will get the same estimates.

That said, the issue becomes more nuanced if instead of being given full data, a person was given the marginal moments. For instance, suppose a hypothetical Person 4 was given the sample total and sample sum of squares $$(n\bar D_1, n\bar D_2) = \left(\sum_{i=1}^n x_i, \sum_{i=1}^n y_i\right), \quad (n\overline{D_1^2}, n\overline{D_2^2}) = \left(\sum_{i=1}^n x_i^2, \sum_{i=1}^n y_i^2 \right).$$ If Person 4 is told that $D_1$ and $D_2$ are independent, they would use the independence assumption to infer that $$\bar Z = \bar D_1 + \bar D_2, \quad \widehat{\operatorname{Var}}[Z] \overset{\text{ind}}{\approx} \overline{D_1^2} - (\bar D_1)^2 + \overline{D_2^2} - (\bar D_2)^2.$$ The mean estimate is okay because linearity of expectation does not require independence, but the second is not, because of the missing covariance term that you alluded to. In the absence of any other information from the sample--namely, the sum of products $$\sum_{i=1}^n x_i y_i,$$ it is not possible to construct a variance estimator that accounts for the dependence of $D_2$ on $D_1$.

Perhaps surprisingly, if there were a Person 5 who was given $$\sum_{i=1}^n (x_i + y_i), \quad \sum_{i=1}^n (x_i + y_i)^2,$$ this person actually can estimate the variance of $Z$ where Person 4 could not, because the sum of squared sums contains the necessary information about the relationship between the outcomes of the two dice. But Person 5 could not use this to estimate either the means or variances of the individual dice because the marginal information is lost once the data pairs are added together.


Having just read your original question, I think you have a fundamental misconception. You seem to think that the dependence or independence of variables plays a role in estimation even when data is complete. This is absolutely wrong.

Remember that estimators are statistics. They are calculated from observed data. Any relationship between random variables that are known or unknown, any property of those variables, such as parameters and model specifications, is reflected in the values of those observed data. It is only when that data is somehow reduced (e.g., insufficient statistics) in a way that loses information about the underlying model, that the "nice" properties of the usual estimators, such as unbiasedness, may no longer hold.

heropup
  • 135,869