0

$$\min_{\theta \in \Theta} \sup_{P: D(P, P_0) \leq \rho} \mathbb{E}_P\left[ \ell(\theta;(X,Y)) \right]$$

The above equation is from this paper in ML: (https://arxiv.org/pdf/1805.12018.pdf) on the top of page 2. I'm having a hard time understanding what that equation is even saying though, specifically the part in the middle. So, is it correct to say the equation is saying there is a distribution P, which is within a distance of row from P0, and we want to minimize the expected value given that distribution? Where does the support come into play though?

curlycharcoal
  • 700
  • 4
  • 11
cmed123
  • 133
  • 2
    $\rho$ = rho and $\sup$ is supremum. – mjw Jun 08 '20 at 04:07
  • So the ML people have moved on from fitting their models on data they do have, to fitting models on data they don't have? God help us. –  Jun 08 '20 at 04:34
  • Oh, my bad; I thought sup meant "support" and not supremum. As far as I understand (from online), supremum is just basically an upper bound on the quantity below it? – cmed123 Jun 08 '20 at 04:44

1 Answers1

1

My understanding is that there is a fixed distribution $P_0$ and the inner supremum is over the set of distributions that are at most $\rho$ away from it with respect to $D$ distance. The outer minimization is choosing $\theta$ in some set $\Theta$ such that the supremum is minimized.

In words, find $\theta \in \Theta$ such that the worst expected loss with respect to $P$, given that $P$ is at most $\rho$-away from $P_0$, is minimized. So, this is a kind of minmax problem where you are minimizing the worst-case expected loss.

The support comes into play in the expectation operator. The expectation is taken over $P$ which means summing (or integrating if $P$ is continuous) over the values in the support of $P$.

curlycharcoal
  • 700
  • 4
  • 11
  • Thank you! Sorry, I initially thought the "sup" meant support and not supremum. So, as far as I understand support in this case is the same thing as "domain"? But then for supremum, I understand it as meaning "upper bound"; how so we want to minimize the upper bound? Or is the supremum where the "worst-case" part that you're talking about comes from? – cmed123 Jun 08 '20 at 04:47
  • Yes, support is the domain of $P$. sup isn't only an upper bound, it's the smallest upper bound. So, when you minimize the sup of a function, you are minimizing the largest possible value it can take. Since we always want to keep loss as low as possible in ML, it's common to think about the maximum loss as the worst-case scenario. Notice that $\mathbb{E}_P[\ell (\theta; (X,Y)) ]$ is not a fixed number. This function is applied to all close enough $P$s. So, we are trying to minimize the value it takes on the 'worst-possible $P$'. – curlycharcoal Jun 08 '20 at 04:51
  • Thank you. So, I understand all except for the one part: "minimizing the largest possible value it can take", specifically the "largest value" part. I'm confused on how "largest" comes into play here. I know you're correct that it's a minimax problem but, for some reason, I'm having a hard time seeing the "worst-case" or "largest" part of it. I get the minimizing the supremum is minimizing the lowest upper bound. is the upper bound here the loss? – cmed123 Jun 08 '20 at 05:03
  • For simplicity, suppose $S = {P : D(P, P_0) \leq \rho}$ is a finite set, in which case you can replace sup with max. For a fixed $\theta$, you can find the maximum value in ${\mathbb{E}_P[\ell (\theta, (X,Y))] : P \in S }$ by simply computing all values in it. Now vary $\theta$ to find the one that yields the smallest maximum. I think the missing link for you is understanding what supremum is and how it relates to maxima. I recommending reading pages 3-5 of Baby Rudin. – curlycharcoal Jun 08 '20 at 05:14