What is the principle due to which the QQ plots work and give a straight line if the sample data belongs to that distribution?

Question

I understand that if I have a sample of data which follow Normal Distribution then if I plot the sample data's quantiles against the normal theoretical quantiles then I will observe that the points are closely following a straight line. But why are these points following the straight line? What is the principle based on which this method is working?

if they are the same distribution then they have the same quantiles — Gennaro Marco Devincenzis, Feb 17 '22 at 20:43

BruceET · Answer 1 · 2022-02-18T01:46:59.503

Take a sample x of size $n = 100$ from a normal population as an example. Sampling and computations in R:

set.seed(2022)
x = rnorm(100, 50, 7)
summary(x)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 29.70   46.71   51.20   50.97   56.06   70.21

An empirical CDF (ECDF) of the sample is made by sorting the sample from smallest to largest. Starting at height $0$ on the left, we make a graph that jumps up by $1/n$ at each observation and reaching height $1$ at the right. For a reasonably large sample, the ECDF is similar to the CDF of the population. In the plot below the ECDF is the black 'stairstep' and the CDF is the orange curve.

plot(ecdf(x))
curve(pnorm(x, 50, 7), add=T, lwd=2, col="orange")

The idea of a normal probability plot (Q-Q plot) is to use the quantile function (inverse CDF) to straighten the orange curve so it becomes a straight line with slope $\sigma = 7$ and y-intercept $\mu = 50.$

plot(qqnorm(x))
abline(a = 50, b=7, lwd=2, col="orange")

Sometimes in software, the reference line of the Q-Q plot runs through the theoretical quartiles and the quartiles of the data. Often, as in my example, there is little difference in the orientation of the two lines.

The main idea of a Q-Q plot is for the data to lie approximately along a line (with some tolerance for a few points near the maximum and minimum that may stray from the linear pattern).

plot(qqnorm(x));  qqline(x, col="red", lwd=2)

In some countries (including those in North America) it has become customary to plot data quantiles on the horizontal axis. Then using the quartiles to make the line is a slight simplification.

plot(qqnorm(x, datax=T))
qqline(x, datax=T, col="red", lwd=2)

In statistical practice, normal Q-Q plots are often used to check whether a sample comes from a normal distribution. If not, the departure of the plotted points can be easy to see. (Especially for small samples, many statisticians prefer to look at such plots instead of relying on formal goodness-of-fit tests, which may have poor power to detect non-normality.)

Below are normal Q-Q plots of a sample from an exponential distribution and a sample from a uniform distribution.

R code for above plot:

set.seed(217)
par(mfrow=c(1,2))
 w = rexp(50)  # sample of size 50
 qqnorm(w, main="Norm Q-Q Plot; Exponential Data")
  qqline(w, col="blue", lwd=2)
 u = runif(50) # sample of size 50
 qqnorm(u, main="Norm Q-Q Plot: Uniform Data")
  qqline(u, col="blue", lwd=2)
par(mfrow=c(1,1))

Note: The same idea of turning ECDFs into linear plots is used to make Q-Q plots based on non-normal distributions.

What is the principle due to which the QQ plots work and give a straight line if the sample data belongs to that distribution?

1 Answers1