What are the nature and purpose of confidence intervals?
-
I will post my own answer below. I expect probably others will as well. – Michael Hardy Sep 08 '12 at 20:49
-
Is this question designed to address abstract duplicates? – Alex Becker Sep 08 '12 at 20:51
-
I know of no duplicates, and I don't know what "abstract" duplicates are. – Michael Hardy Sep 08 '12 at 21:06
-
See this meta question (or many others) for a discussion of abstract duplicates. – Alex Becker Sep 08 '12 at 21:09
-
1OK, now I can answer: No, it's not meant to address such a thing. – Michael Hardy Sep 08 '12 at 21:15
-
Very good question, that I hope people who ask a technical confidence interval question gets directed to. – André Nicolas Sep 08 '12 at 23:06
3 Answers
Consider the normal distribution $N(\mu,\sigma^2)$. Its probability density function is $x\mapsto\text{constant}\cdot\exp\left(\frac{-1}{2}\cdot\left(\frac{x-\mu}{\sigma}\right)^2\right)$. It has expected value $\mu$ and standard deviation $\sigma$. It puts probability about $0.95$ in the interval whose endpoints are $\mu\pm1.96\sigma$.
Suppose one cannot observe $\mu$ and must estimate based on a sample of $n_1$ independently chosen observations from a population with this distribution. One can show that the sample mean $\bar{X}=(X_1+\cdots+X_{n_1})/n_1$ has a normal distribution with mean $\mu$ and standard deviation $\sigma/\sqrt{n_1}$.
For the moment we assume, unrealistically, that $\sigma$ is known.
We have $$ \Pr\left(\mu-1.96\frac{\sigma}{\sqrt{n_1}} < \bar X < \mu+1.96\frac{\sigma}{\sqrt{n_1}}\right) = 0.95 $$ and hence $$ \Pr\left( \bar X - 1.96\frac{\sigma}{\sqrt{n_1}} <\mu< \bar X + 1.96\frac{\sigma}{\sqrt{n_1}}\right) = 0.95. $$ The interval whose endpoints are $$ \bar X \pm 1.96\frac{\sigma}{\sqrt{n_1}}\tag{1} $$ is a $95\%$ confidence interval for $\mu$.
It is tempting to say that if we observe that the two numbers $\bar X\pm 1.96\dfrac{\sigma}{\sqrt{n_1}}$ are, for example, $5$ and $9$, then $$ \Pr\left(5<\mu<9\right)=0.95.\tag{DANGER!!} $$ But suppose we take a second sample, this time of size $n_2$. We can form another confidence interval, using the new sample mean in the role of $\bar X$ and $n_2$ in the role of $n_1$ in $(1)$ above. What changes when we take a new sample is the endpoints of the interval. What does not change is $\mu$. The $0.95$ probability means that $95\%$ of the time, when we take another such sample, $\mu$ will be within the interval that we get. Some people take this fact to be an objection to the statement labeled "DANGER!!" above, saying that "$\Pr\left(5<\mu<9\right)=0.95$" should be considered true only if it is the case that $95\%$ of all values of $\mu$ are between $5$ and $9$, and that is clearly false, since there is only one value of $\mu$. This, however, depends on the meaning of probability. Even if one so defines probability that this objection is not valid, nonetheless the mathematics of probability do not lead to the conclusion that the statement labeled "DANGER!!" is true. In practice, though, people often act as if one should be $95\%$ sure of the statement $5<\mu<9$, given the evidence of the sample.
To say that one is "$95%$ confident" that $\mu$ lies within the confidence interval, as a term of art in statistics, means precisely that $95\%$ of the time, when one takes a new random sample of one or more observations, the interval one gets will contain $\mu$.
It was of course unrealistic to assume we know the population S.D. $\sigma$ but we have to estimate $\mu$ based on a sample. Suppose we estimate $\sigma$ based on the sample by using the square root of $$ S^2 = \frac{1}{n-1}\sum_{i=1}^n \left(X_i-\bar X\right)^2 $$ as the estimate of $\sigma$. Earlier we said that $$ \frac{\bar X-\mu}{\sigma/\sqrt{n}} $$ has a normal distribution with mean $0$ and variance $1$. The quantity $$ \frac{\bar X-\mu}{S/\sqrt{n}}\tag{2} $$ also has a distribution that does not depend on $\mu$ or $\sigma$. This is Student's t-distribution, introduced by the pseudonymous writer "Student", who was actually William Sealey Gossett. Thus we can find $A$ such that $(2)$ is in the interval bounded by $\pm A$ with probability $0.95$. It is a bigger number than $1.96$; formerly one looked it up in a table; now one uses software (it depends not only on the "$0.95$" but also on the sample size $n$). We get $$ \bar X \pm A\frac{S}{\sqrt{n}} $$ as endpoints of a confidence interval.
Similarly one can find a confidence interval for $\sigma^2$ by observing that the distribution of $$ \frac{(n-1)S^2}{\sigma^2} $$ does not depend on the unobservables $\mu$ and $\sigma$; this is a chi-square distribution. From $$ A<\frac{(n-1)S^2}{\sigma^2}<B $$ we get $$ \frac{B}{(n-1)S^2} < \sigma^2 < \frac{A}{(n-1)S^2} $$ and that is a confidence interval.
Many other examples exist.
-
Since Michael Chernick mentions that some weird things can happen, I say a little bit more. Suppose you actually know the value of $\mu$. Then given any $95%$ confidence interval, the (epistemic) probability that $\mu$ is in the interval contains $\mu$, given what you know, would be either $0$ or $1$. But it would still qualify as a $95%$ confidence interval by the definition above. Now suppose that you don't know the exact value of $\mu$, but some values of $\mu$, given what you know, are more probable than others---say you have a probability distribution concentrated near.... – Michael Hardy Sep 09 '12 at 02:36
-
....one particular point. Then if you got a $95%$ confidence interval far from that point, the epistemic probability, given what you knew before, plus the data on which the interval is based, that $\mu$ is in that interval, would be much smaller than $0.95$, although it would be bigger than what it was before the data showed up. In that case, you should use the conditional probability distribution of $\mu$ given the data, found via Bayes' formula. Next, we observe that some additional information about the value of $\mu$ can come from the data itself, rather than from the...... – Michael Hardy Sep 09 '12 at 02:39
-
.....confidence interval. For example, suppose you have a random sample from a population uniformly distributed on the interval from $0$ to $\theta>0$. Then one can readily find values of $A$ and $B$ such that the interval from $A$ times the sample mean to $B$ times the sample mean is a $95%$ confidence interval, by the definition above. But what if that interval fails to include the largest observed value, which is to the right of the interval? In that case, obviously the data themselves tell you that $\theta$ is not in the interval! But it still satisfies the definition. Next..... – Michael Hardy Sep 09 '12 at 02:43
-
Next, consider a population uniformly distributed on the interval from $\theta-1/2$ to $\theta+1/2$, and one has a sample of size $2$. Then the interval bounded by the two observations is a $50%$ confidence interval. But: to be "$50%$ confident" that $\theta$ is in that interval, in any common-sense sense of the word "confident", if the two observations are seen to differ by $0.001$, is absurd. And to be only $50%$ confident, by any commen-sense sense of the term, if the observations differ by $0.999$, is equally absurd. Yet the interval satisfies the definition. – Michael Hardy Sep 09 '12 at 02:49
-
In this last example, Fisher's technique of conditioning on an ancillary statistic, where the latter is just the distance between the two observations, gives a reasonable answer. The ancillary statistic seems to take into account information that would be disregarded by reposing $50%$ confidence in the $50%$ confidence interval. But "information" in this instance doesn't mean what is usually called Fisher information, nor information in the sense of Fisher's concept of sufficiency. Just what it does mean would bear examination. I don't have an answer to that one. – Michael Hardy Sep 09 '12 at 02:51
Michael Hardy: I don't think anyone could give a better description of what the definition of a confidence interval is, so I will not try to give one of my own. But what I can add to the conversation is the fact that they are not unique and they are not always exact. There are exact confidence intervals and there are asymptotic ones.
Take the success parameter for a Bernoulli trial. If we have n iid Bernoulli random variables we can construct an exact confidence interval using the Clopper-Pearson method for evaluating cumulative binomial probabilities. There are also normal approximations to the binomial and so approximate confidence intervals can be constructed using normal approximations. When the sample size is large these approximate interval will have close to the advertised coverage. Coverage is the actual probability that if you repeat the procedure the new interval will contain the parameter (putting it in Michael Hardy's terms).
Since there can be more than one confidence interval for a parameter based on a random sample, how do we determine which one to use. Efron calls confidence intervals accurate if the actual coverage is equal to or close to the advertised coverage. Accuracy is a property we want every confidence interval to have if we would want to use it. The exact and asymptotic binomial confidence intervals are accurate. A confidence interval would be correct (term due to Efron) if among all accurate intervals its expected length is the smallest.
- 4,639
- 2
- 19
- 24
Confidence intervals are set estimators of unknown parameters in statistics. They arise when we observe some data from a distribution that depends on some unknown parameter, and we use the data to obtain an interval estimator for the parameter. There are several different kinds of set estimators in statistics, but a confidence interval is one that is formed in such a way that, a priori, we have a certain level of "confidence" that the unknown parameter will fall within the interval we form (explained formally below).
Before examining this idea, it is worth noting that the concept of a "confidence interval" is actually part of a broader concept of a "confidence set", which may or may not be a single connected interval. In practice, most statistical problems involving set estimators for unknown parameters give rise to estimators where the set is a single connected interval, so in most cases it is appropriate to refer to a "confidence interval". However, there are some cases where the set estimator consists of disconnected parts, so it is actually best to step back and look generally at "confidence sets" which may or may not give single intervals.
Formal definition: Suppose we are going to observe some data vector $\mathbf{x} \in \mathscr{X}$ and we want to use this to form a confidence set for an unknown parameter $\theta \in \Theta$. Suppose we want to be able to form a confidence set for any specified "confidence level" $1-\alpha$. To do this, let $\mathscr{P}(\Theta)$ denote the power set containing all subsets of the parameter space $\Theta$. Formally, a confidence set is a mapping $\text{CI}: \mathscr{X} \times [0,1] \rightarrow \mathscr{P}(\Theta)$ that obeys the following "confidence property":$^\dagger$
$$\mathbb{P}(\theta \in \text{CI}(\mathbf{X}, \alpha) | \theta) \geqslant 1-\alpha \quad \quad \quad \text{for all } \theta \in \Theta \text{ and } 0 \leqslant \alpha \leqslant 1.$$
This confidence property gives us an a priori guarantee of the coverage probability for the estimator. It says that, regardless of the true parameter value, this parameter will fall within the (random) confidence set with probability no less than the specified confidence level. (In cases involving continuous data, we usually form a confidence set that exactly achieves the required confidence level, so in this case the confidence property holds with equality instead of as an inequality.)
The above definition shows the requirements for a set estimator that is a confidence set. The confidence property gives an a priori guarantee that coverage of the parameter occurs with the stipulated minimum probability (the confidence level), but it is important to note that once we observe the data vector $\mathbf{x}$, the estimated confidence set $\text{CI}(\mathbf{x}, \alpha)$ is no longer random, so the event $\theta \in \text{CI}(\mathbf{x}, \alpha)$ is now deterministic when conditioning on $\theta$. Thus, if we say that we have 95% "confidence" that $\theta \in \text{CI}(\mathbf{x}, \alpha)$, we are referring to an a prioiri minimum probability of 95% that the (random) set estimator will cover the parameter. By appeal to the law of large numbers, this is often framed by saying that if we were to repeatedly form the confidence set from random data, then in the long run, the parameter will fall within these intervals with proportion no less than the confidence level.
Now, if $\text{CI}(\mathbf{x}, \alpha)$ is a single connected interval, we generally call this a "confidence interval" rather than a "confidence set". In most statistical problems this is what occurs, so it is quie common for people to refer to confidence intervals without realising that it is possible to get a set estimator that is not a single connected interval. It is also worth noting that in many statistical problems, we appeal to approximating probability results (e.g., based on the central limit theorem) and so the confidence intervals formed in such problems might not have the exact coverage level that the confidence property requires. (In such cases, the confidence property holds if we condition on the approximating distribution.)
How confidence sets are formed: Confidence sets are usually formed using a pivotal quantity, which is a function of the data and parameter that has a distribution that is invariant to the paramater. We start by writing a probability statement for an event involving the pivotal quantity and then we "invert" this event to frame it as a coverage requirement for the parameter.
To see how this works, consider some pivotal quantity $Q(\mathbf{X}, \theta)$ and note that ---by definition--- this has a distribution that does not depend on $\theta$. Suppose we take a set $\mathcal{Q}(\alpha)$ that has coverage probability no greater than $\alpha$ over the distribution of the pivotal quantity. (Note that this set does not depend on the parameter, since the distribution of the pivotal quantity does not depend on the parameter.) Then we have:
$$\begin{align} 1-\alpha &\leqslant \mathbb{P}(Q(\mathbf{X}, \theta) \in \mathcal{Q}(\alpha)) \\[6pt] &= \mathbb{P}(\theta \in \text{CI}(\mathbf{X}, \alpha)), \\[6pt] \end{align}$$
where we define the confidence set by:
$$\text{CI}(\mathbf{x}, \alpha) \equiv \{ \theta \in \Theta | Q(\mathbf{x}, \theta) \in \mathcal{Q}(\alpha) \}.$$
The transition from the event $Q(\mathbf{X}, \theta) \in \mathcal{Q}(\alpha)$ to the equivalent event $\theta \in \text{CI}(\mathbf{X}, \alpha)$ is the "inversion" used in forming the confidence set. With a sensible choice of the initial coverage set $\mathcal{Q}(\alpha)$ we can get a confidence set that has good properties, such as optimising to get the smallest possible width (see related discussion here). The vast majority of confidence intervals used in statistics are formed from simple pivotal quantities that usually involve some kind of standardisation of a point estimator for the parameter.
An example: The above theory is quite abstract, and it can be ellucidated by a simple example. Following the other answer, I will use the example of a confidence interval for the mean parameter from normally distributed data. Suppose weare going to observe data $X_1,...,X_n \sim \text{IID N}(\mu, \sigma^2)$, and consider the following well-known pivotal quantity:
$$\sqrt{n} \cdot \frac{\bar{X}_n - \mu}{S_n} \sim \text{St}(\text{df} = n-1).$$
(Here $\bar{S}_n$ and $S_n$ denote the sample mean and sample standard deviation respectively, and the distribution on the right-hand-side is Student's T distribution.) Let $t_{n-1,\alpha/2}$ denote the critical point of the Student's T distribution with $n-1$ degrees-of-freedom and upper tail area $\alpha/2$. Choosing any value $0 \leqslant \alpha \leqslant 1$ we can use this pivotal quantity to obtain the probability result:
$$\begin{align} 1-\alpha &= \mathbb{P} \Bigg( - t_{n-1,\alpha/2} \leqslant \sqrt{n} \cdot \frac{\bar{X}_n - \mu}{S_n} \leqslant t_{n-1,\alpha/2} \Bigg) \\[6pt] &= \mathbb{P} \Bigg( \bar{X}_n - \frac{t_{n-1,\alpha/2}}{\sqrt{n}} \cdot S_n \leqslant \mu \leqslant \bar{X}_n + \frac{t_{n-1,\alpha/2}}{\sqrt{n}} \cdot S_n \Bigg) \\[6pt] &= \mathbb{P} \Big( \mu \in \text{CI}(\mathbf{X}, \alpha) \Big), \\[6pt] \end{align}$$
where:
$$\text{CI}(\mathbf{x}, \alpha) \equiv \Bigg[ \bar{x}_n \pm \frac{t_{n-1,\alpha/2}}{\sqrt{n}} \cdot s_n \Bigg].$$
This shows that the function $\text{CI}$ obeys the required confidence property, which establishes it as a confidence interval. With "confidence level" $1-\alpha$ the true mean parameter will fall within this confidence interval (with this interpreted as set out above).
$^\dagger$ In order to ensure that all events can be ascribed probabilities, we also require that each $\theta \in \text{CI}(\mathbf{X})$ is a measureable event.
- 4,079