5

This is taken from Hastie, Tibshirani and Friedman's Elements of Statistical Learning. They describe the squared loss error and to minimize the expectation of this error, which they call Expected (squared) prediction error, $EPE$. They first define the squared error loss as $L(Y, f(X))=(Y-f(X))^{2}$. The criterion for choosing $f$ becomes, $$EPE(f)=E(Y-f(X))^{2}$$ $$EPE(f)=\int(y-f(x))^{2}P(dx, dy)$$ The authors then write the following: the condition on $X$ to get, $$EPE(f)=E_{X}E_{Y|X}([Y-f(X)]^{2}|X)$$ I think I might be missing something basic here, but how did the authors arrive at this by conditioning on X?

Then they go on to write: it suffices to minimize $EPE$ pointwise as, $$f(x) = argmin_{c}E_{Y|X}([Y-c]^{2}|X=x)$$ How do the authors arrive at this conclusion i.e. that it suffices to minimize EPE pointwise? This is not very intuitive to me.

In the end, the authors say: the solution for the equation above is, $$f(x)=E(Y|X=x)$$ How do they arrive at this solution? Could someone describe the intermediate math to achieve this as a solution?

2 Answers2

2

In addition to what's already been answered, I hope this derivation helps.

$$\text{EPE}(f)=\int [y-f(x)]^2\text{Pr}(dx,dy)$$ $$=\int [y-f(x)]^2p(x,y)dxdy$$ $$=\int_x\int_y [y-f(x)]^2p(x,y)dxdy$$ $$=\int_x\int_y [y-f(x)]^2p(y|x)p(x)dxdy$$ $$=\int_x\left(\int_y[y-f(x)]^2p(y|x)dy\right)p(x)dx$$ $$=\int_xE_{Y|X}\left([Y-f(X)]^2|X\right)p(x)dx$$ $$=E_XE_{Y|X}\left([Y-f(X)]^2|X\right)$$

Using slightly different notation to that of the book, $$f(x)=\text{argmin}_fE_{Y|X}([Y-f]^2|X=x)$$ $$\frac{\partial{EPE(f)}}{\partial{f}}=\frac{\partial}{\partial{f}}\left(E([Y-f]^2|X=x\right)=0$$ $$\frac{\partial}{\partial{f}}\left(E(Y^2-2Yf+f^2|X=x\right)=0$$ $$\frac{\partial}{\partial{f}}\left(E(Y^2|X=x)-2fE(Y|X=x)+f^2\right)=0$$ $$0-2E(Y|X=x)+2f=0$$ $$f(X)=f^*=E(Y|X=x)$$

phip
  • 36
  • 3
0

Take two functions $f,g$. If $f \leq g$ pointwise everywhere, then also $\mathbb{E} f \leq \mathbb{E} g$ for any measure.

Denote the pointwise $EPE$ of a function $g$ at $x$ by $F_g(x)$. We have that $F_f \leq F_g$ pointwise for any $g$. Thus also $\mathbb{E} F_f \leq \mathbb{E} E_g$ for any measure.

Oxonon
  • 266
  • In context of this problem, $F_{g}(x)$ represent $E_{X}$ and $F_{f}$ represents $E_{Y|X}$, and by your explanation it suffices to minimize $E_{Y|X}$ because $E_{Y|X} \leq E_{X}$. Am I understanding this correctly? The basic idea here being $Y|X \subseteq X$ ? – brewberry234 Dec 18 '21 at 20:21
  • Here $F_g(x) = \mathbb{E}[ (Y-g(X))^2 | X=x]$. – Oxonon Dec 20 '21 at 17:56
  • For the second question, which I've just noticed.. If you consider a random variable $Z$, then the $c$ minimising $\mathbb{E} (Z-c)^2$ is the expectation of $Z$. Consider this with the random variable $Y|X$ in place of $Z$ to answer your question (this is an intentionally non-rigorous explanation, since that the conditional expectation is the minimiser follows by definition of conditional expectation, and thus I can only recommend reading an intro to probability text first. That will explain this and more far better than I can). – Oxonon Dec 20 '21 at 18:00