Understanding a statement from Elements of Statistical Learning about noisy linear regression

Question

I'm reading section $2.5$ of Elements of Statistical Learning by Hastie et al (Second edition), and theres an equation I don't quite understand (here we have $N$ training samples, and $X$ are vectors in some dimension, $Y$ scalars).

The authors write on page $24$:

Suppose that we know that the relationship between $Y$ and $X$ is linear, $$ Y = X^T \beta + \epsilon $$

where $\epsilon \sim N(0, \sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$, we have $\hat{y_0} = x_{0}^T \hat{\beta}$, which can be written as $\hat{y_0} = x_{0}^T \beta + \sum_{i=1}^{N} l_i (x_0) \epsilon_i$ where $\epsilon_i$ is the $i$'th element of $X(X^TX)^{-1} x_0$.

From what I understand, the linear least square solution for a noisy scenario like this would be

$ \hat{\beta} = (X^TX)^{-1}X^TY = (X^TX)^{-1}X^T(X \beta + \epsilon')$ where $\epsilon'$ is a vector of noise values for each training sample.

I'm not seeing how this leads to the equation described in the last line of the quote. Any insights appreciated.

angryavian · Accepted Answer · 2023-03-07T04:18:50.350

1

If you look carefully at the text, the $X$ in $Y=X^\top \beta + \epsilon$ is not bold, while the $X$ in the $\mathbf{X}(\mathbf{X^\top X})^{-1} x_0$ is bold.

This is because $Y$ is a real number while $X$ and $\beta$ are in $\mathbb{R}^p$, and $Y = X^\top \beta + \epsilon$ represents the model for a single data point. If you have $n$ data points, then this equation becomes

$$y_{n \times 1} = \mathbf{X}_{n \times p} \beta_{p \times 1} + \epsilon'_{n \times 1}$$ where I have added the dimensions of each term for clarity.

Thus, $$\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X}\beta + \epsilon') = \beta + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \epsilon'.$$

edited Mar 07 '23 at 04:18

answered Mar 07 '23 at 02:05

angryavian

89,882

I understand the gist of what you're saying, but the tranposeness of $X$ seems off in that last line if I am not mistaken? – IntegrateThis Mar 07 '23 at 02:27
1

@IntegrateThis Sorry, fixed a typo. The data matrix $\mathbf{X}$ is typically $n \times p$ (consisting of $n$ row vectors $X_i^\top$), so there is no transpose in the equation $y = \mathbf{X} \beta + \epsilon'$. – angryavian Mar 07 '23 at 04:20

Understanding a statement from Elements of Statistical Learning about noisy linear regression

1 Answers1