I'm reading section $2.5$ of Elements of Statistical Learning by Hastie et al (Second edition), and theres an equation I don't quite understand (here we have $N$ training samples, and $X$ are vectors in some dimension, $Y$ scalars).
The authors write on page $24$:
Suppose that we know that the relationship between $Y$ and $X$ is linear, $$ Y = X^T \beta + \epsilon $$
where $\epsilon \sim N(0, \sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$, we have $\hat{y_0} = x_{0}^T \hat{\beta}$, which can be written as $\hat{y_0} = x_{0}^T \beta + \sum_{i=1}^{N} l_i (x_0) \epsilon_i$ where $\epsilon_i$ is the $i$'th element of $X(X^TX)^{-1} x_0$.
From what I understand, the linear least square solution for a noisy scenario like this would be
$ \hat{\beta} = (X^TX)^{-1}X^TY = (X^TX)^{-1}X^T(X \beta + \epsilon')$ where $\epsilon'$ is a vector of noise values for each training sample.
I'm not seeing how this leads to the equation described in the last line of the quote. Any insights appreciated.
