1

I have a set of observations (x,y).

I want to use x values to predict y.

I plot a simple regression and this gives me an equation y = mx+c. This is the thin black line.

How do I construct confidence intervals around the value of y or any given x? (eg. like the red lines on the graph) - with the aim these red lines should theoretically contain 95% of the data.

enter image description here

Edit: Here is the solution graph using the accepted answer.

enter image description here

BYZZav
  • 125
  • 3
    A $95%$ confidence interval is NOT supposed to contain $95%$ of the data, NOR $95%$ of future observations not contained in the data on which the interval is based, NOR a region within which any data points will fall with $95%$ probability. That's not what confidence intervals are. Rather, a confidence interval for the slope of the line should have a $95%$ chance of containing the "true" slope. That's quite a different thing. – Michael Hardy Jun 06 '17 at 21:20
  • Also, when people draw pictures like the one you've drawn, with its red dotted lines, they're usually dealing with prediction intervals rather than confidence intervals. Prediction intervals in this kind of problem get wider as you move farther from the average $x$-value. I wonder whether prediction intervals are what you have in mind. – Michael Hardy Jun 06 '17 at 21:21
  • $\ldots,$and prediction intervals are typically much wider than confidence intervals. It seems quite probable to me that prediction intervals are what you have in mind. – Michael Hardy Jun 06 '17 at 21:23
  • Thanks very much for your explanation - I am after prediction intervals rather than confidence intervals as you point out. – BYZZav Jun 06 '17 at 21:31

1 Answers1

1

What you're basically trying to get is a 95% confidence interval for $y_0$, a new point with $x$-coordinate equal to some value $x_0$, say. If we let $\mu(x_0)$ denote the true mean according to our linear model of points with $x=x_0$, then the formula for $y_0$ is:

$$y_0 = \mu(x_0) + \epsilon_0$$

Where $\epsilon_0$ is the (normally-distributed) random error iid across all points. What this basically means is that if you're trying to predict where a new point $y_0$ will lie, your guess will have randomness arising from your estimate $\hat{\mu}(x_0)$ of $\mu(x_0)$ but also some randomness arising from the error term $\epsilon_0$.

Therefore to get a confidence interval for $y_0$, we just need to study the variability of our estimate of $\mu(x_0)$, and then add a factor to account for the randomness in $\epsilon_0$. We know that the variability of $\epsilon_0$ is equal to $\sigma^2$ which is estimated by $s^2$, so the crux of the matter is to study the standard deviation of your estimate $\hat{\mu}(x_0)$ of the true mean $\mu(x_0)$ and then add in a factor $s$ to account for the error term.

First, express your estimated mean for a point at $x=x_0$ in terms of things you already know. Notice that, if $a$ is your estimate of the $y$-intercept and if $b$ is your estimate of the true slope, then:

$$\hat{\mu}(x_0) = a + x_0 b$$

$a$ and $b$ are random variables. If we assume normal errors in the model then $a$ and $b$ are themselves normally distributed. That means that the linear combination $\hat{\mu}(x_0) = a + x_0b$ will also be normally distributed. All that's left is to calculate the variance (standard deviation) of $\hat{\mu}(x_0)$ and then we can find the confidence interval using a $t$ critical value (since we also have to estimate the true variance of the model's error term).

To find the variance of $\hat{\mu}(x_0)$ we just need the variances of $a$ and $b$ and then their covariance. The standard deviations of $a$ and $b$ are:

$$Var(a) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{\sum (x_i-\bar{x})^2} \right),\quad Var(b) = \sigma^2 \frac{1}{\sum (x_i - \bar{x})^2}$$

Their covariance is:

$$Cov(a,b) = -\sigma^2 \frac{\bar{x}}{\sum (x_i - \bar{x})^2}$$

Doing some algebra, this means that

$$SD(\hat{\mu}(x_0)) = \sigma \sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}}$$

Now replacing $\sigma$ with $s$, the standard error calculated from your residuals and using the appropriate $t$ critical value for 95% confidence (with $n-2$ degrees of freedom!) yields that your confidence interval width is:

$$ t^* \cdot s \sqrt{\frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}} $$

But this is only for the estimated mean. To come full circle and build a prediction interval for a new point, you need to add in a factor of $s$ for the random error which takes the point off of the mean, in other words:

$$ t^* \cdot s \sqrt{1 + \frac{1}{n} + \frac{(x_0-\bar{x})^2}{\sum (x_i - \bar{x})^2}} $$

gogurt
  • 2,224
  • Thanks very much for this excellent answer - it was really helpful. – BYZZav Jun 06 '17 at 21:30
  • @MichaelHardy: whoa there... sorry. I made a mistake and I'll correct it. But by no means did I mean to mislead. Thanks for pointing that out. – gogurt Jun 07 '17 at 01:31