What's a Q-Q Plot?

[Note: I've made a Jupyter Notebook (Python) for this so that you can mess around with a few of these ideas yourself. The figures come from this notebook.]

⚫ ⚫ ⚫ ⚫

A Q-Q Plot, or a Quantile-to-Quantile Plot, is a probability plot that compares two datasets to see if they are from the same underlying distribution.

Suppose I have two datasets. Suppose one is from a normal distribution— let's say it's the GPA of all of the students attending some college — and one is another set of numbers that someone gives me that looks fairly similar, but I'm not sure if it's from the same distribution.

One thing I could do is histogram the data to see if the histograms look similar. Another thing I could do (and this is where the Q-Q plot comes in!) is to look at certain percentiles of the data and see if they match up with each other. (Note, quantiles are simply the same as the percentile divided by 100. So, the 75th percentile is quantile 0.75). Let's explore this idea a bit more.

The main idea here is that if we plot the $P$-th percentile of dataset 1 against the $P$-th percentile of dataset 2, they should make approximately the staight line $y = x$; this is because the percentiles for data taken from the same distribution should be approximately equal.

For example, here is the histogram of two datasets from the same distribtion (normal with $\mu = 0$ and $\sigma = 1$). The histograms look fairly similar.

Now, let's look at the Q-Q plot below. Each point corresponds to a percentile; in this case, the lower-left hand point is the 1st percentile, then the next closest point is the second percentile, and so forth, up until the upper-right point which is the 99th percentile. If we look at the 1st percentile point (lower-left) then, just to be clear, the $x$-coordinate of this point is the value of the 1st percentile of dataset 1 (which in this case is almost -2.2), and similarly the $y$-coordinate of this point is the value of the 1st percentile of dataset 2 (which is almost -2.2 as well).

Notice that these points makes almost a straight line, and that line is $y = x$. I've made a red dotted line at $y = x$ for reference.

This is a good indication that our datasets are from the same distribution. Why? Remember: we want the values of the percentiles to be almost equivalent; that's what the $y = x$ means in this case.

Here's two datasets which are not from the same distribution. Notice that they're fairly different on the histogram.

Moreover, the Q-Q plot looks strange:

Notice how far the points are off the line $y = x$, indicating that the percentiles don't quite match up.

The gist is that we can get an "at a glance" (or, if we compute the RSS from $y = x$, an actual score) view into if two datasets come from the same distribution.

⚫ ⚫ ⚫ ⚫

Two important things to note about Q-Q plots and why you should ever use them.

First, because we're only measuring percentiles, the datasets can have two different sizes: the first could have 1000 elements while the second could have 40 and it would make no difference to how we construct the Q-Q plot.

Second, it may become visually obvious that the datasets are from a similar distribution which is simply scaled or shifted or reflected.

For the second one, let's look at a two Q-Q plots between data coming from normal curves with different means and different standard deviations, respectively.

Homework: Mess around with different means and standard deviations and find a pattern in the Q-Q plot for when the two datasets different only in means or standard deviations for their theoretical distribution.

⚫ ⚫ ⚫ ⚫

Here's one additional cool thing. Say we know some dataset is from some distribution with some degree of freedom; we know the distribution but not the degree of freedom. We can iterate through the DFs and look at the "Q-Q plots" between our data and the percentiles of the theoretical distribution to see what DF fits the data well enough to $y = x$. Note: this is not actually a Q-Q plot — it's called a probability plot — but the idea is so similar that I figured I'd put it in here at the end.

The difference between the q-q plot and the probability plot is that in the q-q plot we're comparing two datasets; in the probability plot, at least one of these datasets is replaced by a theoretical distribution which is the actual representation of a distribution. For example, the normal curve would be an example of a theoretical distribution.

As an example, if we have our set of data which comes from a t-distribution with df = 7, we can iterate through the values from 1 to 30 and see which is the best fit. This actually worked fairly well when I tried it out; the example is in the jupyter notebook above.