0

In a simple linear regression the predicted y values are also the “conditional means” at each x value. For each x value, there is a distribution of y values in the population. How exactly do we know each y value on the regression line is the mean of each conditional distribution for each y value?

I’m trying to think of this in the most simple way possible, with 10 x values and 10 y values. If y on the regression line is 5 when x is 1, then one would say “when x is 1, the mean value of y is 5.” How does the line tell us the “mean” of y when we only have one actual y value to work with?

3 Answers3

1

How exactly do we know each y value on the regression line is the mean of each conditional distribution for each y value?

To understand what's going on here, I think it's important to separate the theoretical probabilistic model from the parameter fitting algorithm based on actual observed data. On the one hand, we have the linear model, which is an abstract mathematical structure, and on the other hand, we have the regression line that has been calculated numerically from data, using say ordinary least squares. So I will break the explanation into these two parts.

The linear model

In this section, we'll go over the definition of a linear model. First, we assume we have random variables $Y, X_1, X_2, \ldots, X_p, \epsilon$ on the same sample space $\Omega$. This means in particular that these are all functions from the same set $\Omega$ to the real numbers. That is,

$Y, X_1, X_2, \ldots, X_p, \epsilon : \Omega \rightarrow \mathbb{R}$.

Now suppose there are constants $\beta_0, \beta_1, \beta_2, \ldots, \beta_p \in \mathbb{R}$ such that

  1. $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$, and

  2. $E[Y \mid X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \cdots + \beta_p x_p$.

Then we say that there is a linear model of $Y$ in terms of $X_1, X_2, \ldots, X_p$, and we refer to the formula in criterion (1) as the linear model itself. The $\beta_i$ are called the coefficients or parameters of the model.

Note that both criteria (1) and (2) are definitions, and both are needed for the definition of a linear model. That is, there is no derivation of them. However, we can explain the intuition behind them as follows. Criterion (2) says that for each fixed set of values $x_i$ of the variables $X_i$, the value of $Y$ will be, on average, a linear combination of the $x_i$ (plus a constant $\beta_0$), hence the term linear model. Note that we are only enforcing this strict linear relationship on average. We are allowing for the possibility that other values of $Y$ can occur, which we express formally through the use of the random variable $\epsilon$ in criterion (1). We call $\epsilon$ the $error$.

Taking the conditional expected value of both sides of equation (1), and then plugging in equation (2), we find that

$E[\epsilon \mid X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p] = 0$.

That is, the error has conditional mean $0$. We have derived this fact from the definition of a linear model.

Estimating the parameters of a linear model using data

Now suppose we have a collection of data $(y_1, y_2, \ldots, y_n)$. The connection to the above section comes by thinking of each observation $y_i \in \mathbb{R}$ as a realization, or observed value, of a random variable $Y$. Upon observing a given $y_i$, we may have also observed other quantities $x_{i1}, x_{i2}, \ldots, x_{ip} \in \mathbb{R}$. Each such $x_{ij}$ is also thought of as the realization of a random variable $X_j$. For example, we could study the population of stars in the Milky Way galaxy, and for a sample of $n$ such stars we could record that star $i$ has apparent brightness $y_i$, distance from the Earth $x_{i1}$, and frequency of emitted light (color) $x_{i2}$. In this case, we would be trying to model apparent brightness $Y$ as a function of distance $X_1$ and color $X_2$.

These variables $Y, X_1, X_2, \ldots, X_p$ we have defined may or may not actually obey the properties of a linear model, defined above, on the given population. If they don't, then we can still write down the corresponding model (as in equation (1) above) -- it just won't line up well with the data.

But suppose they do obey a linear model. Then by definition there are constants $\beta_i \in \mathbb{R}$ serving as the coefficients of this model, but we may not know their true values. A natural question is then, can we use the data itself to determine the $\beta_i$, or at least estimate them closely? This is what a fitting algorithm like ordinary least squares accomplishes.

We can now answer your question, quoted at the beginning. If we use a consistent estimator, such as ordinary least squares, then $\hat{\beta}_i$ converges in probability to $\beta_i$ as the sample size $n$ increases. So with enough data, the estimate $\hat{\beta}_i$ should be close to the true $\beta_i$. And in turn, if the estimates for $\beta_i$ are close to their true values, then the resulting regression function (a line, if $p = 1$) will be close to the true function defined by the right hand side of equation (2), which is by definition (the left hand side of equation (2)) the conditional expected value of $Y$. And for any fixed $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, if there are enough observed $y$ values at this $\mathbf{x}$, then their mean should be close to the conditional expected value of $Y$ at $\mathbf{x}$, thanks to the law of large numbers.

Note that when I say "should be" in the paragraph above, this means with high probability, but not with certainty. Indeed, this probability might never reach $1$ with a finite sample size, if the population is infinite.

Summary

In summary, assuming a linear model means that you are assuming this is how the data will actually behave. If the population actually does follow the linear model you've proposed, then with enough observations you should see all the properties of the linear model, described above, realized in your actual data.

In particular, for any fixed observed $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, the mean of the corresponding observed $y$ values should be close to the $y$ value on the ordinary least squares regression line. If it isn’t, this means (short of witnessing a very low probability event) that either the population you are studying doesn't follow this model perfectly, or you haven't collected enough data. Both of these shortcomings are probably going to happen in practice, which is why you can still see plenty of disagreements between mean observed $y$ values and regression line $y$ values, even though you are using a theoretical framework in which they agree.

  • So the idea that at each point the regression line describes the “mean of y” is theoretical? It can’t actually be describing a mean from our original data. Right? – King Squirrel Jun 09 '21 at 14:18
  • In general, that's right. It's the same as the idea that taking the mean of actual data points is a sample mean, which may or may not be the same as the theoretical, population mean. So the predicted y value at a fixed x value is not necessarily the mean of the observed y values at that x value. However, there are specific cases where this will occur. The simplest such case is when there is only one x value. Then the predicted y will be exactly the mean of the observed y's. – Cooler Paradox Jun 09 '21 at 16:49
0

Assume the random variable $Y$ can be modeled as $Y=\beta_0+\beta_1X_1+\dots+\beta_nX_n+\epsilon$ with $\epsilon\sim N(0,1)$ the random error term and $X_i$ being random variables.

After solving for the parameters $\beta_0,\beta_1,\dots,\beta_n$ via least squares, the random variable $Y$ is $\beta_0+\beta_1X_1+\dots+\beta_nX_n+\epsilon$ with $\beta_i$ filled in as actual numbers.

Then given $X_1=x_1,\dots,X_n=x_n$, $Y$ given $\textbf x$ is $\beta_0+\beta_1x_1+\dots+\beta_nx_n+\epsilon$. This is normal with mean $\beta_0+\beta_1x_1+\dots+ \beta_nx_n$ and variance 1. That is, $E(Y|\textbf x)=\beta_0+\beta_1x_1+\dots+ \beta_nx_n$.

But this is exactly the value of the least squares regression line evaluated at $\textbf x$.

Vons
  • 11,004
  • I followed you until “Them given X1.” – King Squirrel Jun 09 '21 at 06:51
  • @King by definition of what $Y$ is, $E(Y|\textbf x)= E(B_1X_1+B_2X_2+...+B_nX_n+\epsilon|\textbf x) =E(B_1x_1+B_2x_2+...+B_nx_n+\epsilon)=B_1x_1+B_2x_2+...+B_nx_n$ – Vons Jun 09 '21 at 07:25
0

Starting from your example, from ten $(x_i, y_i)$ points the parameters of 2-dimensional normal distribution will be infered in the case of linear regression (see https://en.wikipedia.org/wiki/Multivariate_normal_distribution for details), typically two-dimensional mean and 2x2 covariance matrix. Then, for $x=1$, the conditional distribituion for $y$ will be computed (see https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_conditional_expectation), and it can be shown that the mean of this distribution is the value predicted by linear regression.

  • Is there a somewhat simple and concise way to show why the regression line represents the conditional mean of y for every x? I think this is where I’m lost. – King Squirrel Jun 09 '21 at 06:29
  • The mean minimizes the mean squared error (see https://www.probabilitycourse.com/chapter9/9_1_5_mean_squared_error_MSE.php). Linear regression line is also computed in a way to minimize the squared error of "predicted" y to real y, for all x. This is why they match. – Giorgio Venturi Jun 09 '21 at 06:42
  • I think that’s where my confusion lies. I understand the mean of all y values will minimize the sum of the squared distance from each actual y value. I do not understand how af each and every x value we call the corresponding predicted y value the “mean y value.” I feel like these are two different things? Assuming a positively sloped regression curve we certainly are getting larger and larger values for y given larger x values. – King Squirrel Jun 09 '21 at 06:50
  • Does that not also mean we would have a line with slope zero? All predicted y values at each x would be the mean of y values? – King Squirrel Jun 09 '21 at 06:59