0

Let's say I have data points $0.1$, $0.2$, $0.3$ coming from a normal distribution with mean $\mu$ and standard deviation $1$.

If I want to test the hypotheses $H_0: \mu = 0.15$ vs. $H_1: \mu > 0.15$, then the test statistic is $$T = \frac{\hat{\mu} - \mu}{1/\sqrt{n}} $$
where $\hat{\mu}$ is the sample mean. Under the null hypothesis,
$$T = \frac{\hat{\mu} - 1.5}{1/\sqrt{3}} \sim \mathcal{N}(0,1). $$

With my data, I know my observed test statistic is $T = \sqrt{3}\times 0.05$.

Now my confusion here is the rationale for why we do the following:
The p-value is $p = \mathbb{P}(Z > \sqrt{3}\times 0.05)$ where $Z$ is standard normal. Why do we look at the probability that the theoretical statistic exceeds the observed test statistic? Why not when it doesn't exceed?

A similar question with the p-value is when we do the two-sided test ($H_1: \mu \neq 0.15$). The p-value will evaluate to
$$p = 2\mathbb{P}(Z > \sqrt{3}\times 0.05)$$
but if the two-sided alternative is more likely than the one-sided alternative (It's either $>$ or $<$ opposed to $\neq$), shouldn't we intuitively be more likely to reject in the two-sided test? But our p-value seems to give the opposite result with the factor of 2, as we reject if p is small and this factor of 2 makes it harder to reject. Why is there a discrepancy in my intuition?

OneGapLater
  • 820
  • 1
  • 9
  • 22

1 Answers1

1

as we reject if p is small

This is probably the root of your confusion. The p-value doesn't determine whether we accept or reject the null hypothesis; the p-value is a descriptor of the test itself. The p-value tells us "how good" the test is (in a certain sense of "how good"), even before the we perform the test; the result of the test will determine whether we accept or reject, but the p-value gives us information about how much we should "trust" that acceptance or rejection.

The p-value is the probability of a false negative, i.e., the probability that we will reject a true null hypothesis because the test should an abnormal result. In a two-sided test, there are twice as many ways that a true null hypothesis could show an abnormal result; as you correctly stated, this makes rejection much more likely, and since the p-value is the probability of rejection, this makes the p-value higher. Since the p-value is higher, we will less likely to "trust" the rejection -- it could more easily have been a statistical anomaly under the null hypothesis.

As for your first question, your calculations are conceptually backward. Conceptually, we first determine the p-value, and then we run the test and observe the statistic. It's true that we can do it the other way around, first observing the statistic and then determining what the lowest p-value is such that the statistic still tells us to reject, but this is conceptually much more awkward (and that's probably why you're having trouble with the intuition). In any case, the reason p-value calculation requires us to look at the probability the test statistic will exceed a certain threshold is that we are calculating the probability of rejection given a true null hypothesis, and rejection occurs upon exceeding that threshold. In the conceptually awkward post-observation calculation of lowest possible p-value, we just set the threshold to be exactly the observed value (since that is the highest possible threshold we could set at which the observation would still tell us to reject) and go through with that calculation.

BallBoy
  • 14,472
  • Thanks for your answer.
    Throughout university, I have always been taught the process: Pick a significance level $\alpha$ (0.05 in this case) -> Choose a test statistic, then calculate it's p-value -> compare the p-value with the level. If $p$ is lower than the level then we reject null, else accept the null. Is this what you mean by doing it backwards?
    – OneGapLater Aug 05 '19 at 00:39
  • Your first paragraph says that $p$ doesn't determine whether we accept or reject, but I've always used it to compare with the significance level (so I have been using it to accept or reject). I thought that the significance level tells us "how good" the test is (we pick small levels). – OneGapLater Aug 05 '19 at 00:41
  • @OneGapLater I guess I've been using "p-value" as synonymous with "significance level." If you ever hear "p-value of a test" that's how it's being used. – BallBoy Aug 05 '19 at 00:44
  • @OneGapLater If you're talking about p-value of a test statistic, it means "the lowest possible significance level at which we would still reject," as I mentioned in the answer. Since that's the definition, then you can compare it with your chosen significance level to accept or reject. But that's really just a conceptually less clear (in my opinion) way of checking whether your observed statistic is inside or outside the "accept" range determined by the significance level. – BallBoy Aug 05 '19 at 00:47
  • @OneGapLater You can think of the p-value of the observed statistic as a measurement of how anomalous the result is (under the null hypothesis). If it's too anomalous (p-value lower than $\alpha$), you'll reject the null hypothesis. You measure how anomalous it is by checking how many possible outcomes exceed -- i.e., are even more anomalous than -- it. And in a two-sided test, you're more likely to get anomalous results (since you can get them on either side), which makes them actually less anomalous (higher p-value). – BallBoy Aug 05 '19 at 00:50
  • @OneGapLater Perhaps my last comment gives a clearer perspective than my answer. – BallBoy Aug 05 '19 at 00:50