Which indicator variable best predicts a sampled value?

Question

There are two players in this cooperative game. I will first describe the basic game and then the real problem.

Consider first that player 1 is to sample a single value from a standard normal distribution. Player 2 knows the distribution player 1 samples from. After player 1 has seen the value they have sampled they can send player 2 a single bit of information. That is a single indicator value. Player 1 and 2 have agreed beforehand what the indicator value will be indicating.

To start with, player 2's prior belief is that player 1 has sampled from a standard normal distribution (because that is what was agreed). After receiving the single bit of information, player 2 will have some posterior belief about the distribution of the sampled value. The goal of both players is to agree a strategy to minimize the variance of player 2's posterior belief on average.

For example, they could agree that player 1 will tell player 2 the sign of the sampled value. If player 1 samples a positive value then player 2's posterior belief will now be the half normal distribution with variance $1 - \frac{2}{\pi}$. The same variance would occur if player 1 had sampled a negative value

Could they have chosen a different strategy which would have given a lower variance on average?

Now to the real problem. In the real version player 1 is to sample from some arbitrary continuous distribution with finite support. They can still confer to agree a strategy beforehand and player 2 still knows the distribution that player 1 will sample from.

Can they do any better than agreeing that player 1 will send to player 2 an indicator that indicates if the sampled value is above the median or not?

Bounty question

If we constrain the support to be $[0, 1]$, what distribution gives the largest possible gap from the result you get from sending an indicator that optimally minimizes the variance of player 2's posterior belief and the variance of player 2's belief if they were simply informed whether the sampled value is above the median or not?

user51547 · Accepted Answer · 2023-05-19T07:14:13.003

I think it need not always be the median. Suppose $X$ is a triangular distribution with density $f(x)=2x$ on $[0,1]$ and the players agree to send a bit $Z=\mathbb{1}\{X > t\}$ for some $t\in (0,1)$. The mean of $X$ is $\frac{2}{3}$ and the median is $\frac{1}{\sqrt{2}}$.

We want to find $$\min_{t\in(0,1)}\mathbb{E}\left[\text{Var}(X\vert Z)\right].$$

By the Law of Total Variance, $$\text{Var}(X) = \mathbb{E}\left[\text{Var}(X\vert Z)\right] + \text{Var}\left(\mathbb{E}\left[X\vert Z\right]\right),$$ so equivalently we want to find $$\max_{t\in(0,1)}\text{Var}\left(\mathbb{E}\left[X\vert Z\right]\right).$$

We have

$$\mathbb{E}\left[X\vert Z=1\right]= \frac{1}{P(X>t)}\int_{t}^{1}xf(x)dx=\frac{\frac{2}{3}\left(1-t^3\right)}{1-t^2},$$ and $$\mathbb{E}\left[X\vert Z=0\right]= \frac{2}{3}t.$$

Then

$$\text{Var}\left(\mathbb{E}\left[X\vert Z\right]\right)=t^2\left(\frac{2}{3}t-\frac{2}{3}\right)^2 + \left(1-t^2\right)\left(\frac{\frac{2}{3}\left(1-t^3\right)}{1-t^2}-\frac{2}{3}\right)^2,$$

and you can check this is maximized at $t=\frac{\sqrt{5}}{2}-\frac{1}{2}\neq \frac{1}{\sqrt{2}}.$

Thank you. Which distribution maximizes the gap between what you get with the median and the smallest possible variance? — Simd, May 19 '23 at 08:09
Maybe trivial, but I think you can make the gap arbitrarily large by scaling $X$ by a constant? If you consider the variance ratio instead, or constrain to only distributions with support on $[0,1]$, that's an interesting question, not sure. — user51547, May 20 '23 at 01:30
Given the lack of response to the bounty, maybe it is a hard question? — Simd, May 26 '23 at 10:34

score 1 · Answer 2 · answered May 29 '23 at 02:40

Assuming WLOG a zero mean distribution for $X$, and that the indicator can be expressed as a threshold $Z=\mathbb{1}(X>a)$, we get

$$\text{Var}(X|Z)=\frac{t_a^2}{F_a(1-F_a)} \tag 1$$

where $t_a = \int_a^{\infty} x f_X(x) dx$ and $F_a = \int_{-\infty}^a f_X(x) dx$

The critical point wrt $a$ occurs at

$$ a = - \frac{t_a}{2 F_a} \frac{1-2 F_a}{1-F_a} \tag 2$$

Indeed, if the distribution is even, then $a=0$ (the median) is a solution (a little more general: if the median coincides with the mean). But in general that is not so.

Thank you. My question is, which distribution maximizes the difference between the optimal result and that which you get by choosing the median? — Simd, May 29 '23 at 03:44

Which indicator variable best predicts a sampled value?

Bounty question

2 Answers2