qqnorm and ggplot2

nigol

unread,

Nov 14, 2009, 9:01:11 PM11/14/09

to ggplot2

I am new to R and ggplot2.
I would expect qqnorm(c(1,2,3,4)) and qplot(sample=c(1,2,3,4)) to
produce the same result, but unfortunately the location of the points
differ.
Is there a way to remove the discrepancy?
Thanks,
Dario

hadley wickham

unread,

Nov 15, 2009, 9:32:41 AM11/15/09

to nigol, ggplot2

Hmmm, I seem to have implemented the qq-plot differently to qqnorm.
Do you have a good reference for the construction of qqplots?

Hadley

> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> To post to this group, send email to ggp...@googlegroups.com
> To unsubscribe from this group, send email to
> ggplot2+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/ggplot2

--
http://had.co.nz/

Peter Flom

unread,

Nov 15, 2009, 10:23:35 AM11/15/09

to hadley wickham, nigol, ggplot2

hadley wickham <h.wi...@gmail.com> wrote

>Hmmm, I seem to have implemented the qq-plot differently to qqnorm.
>Do you have a good reference for the construction of qqplots?

They are described in one of Bill Cleveland's books ... I forget which book, either Visualizing Data or The Elements of Graphing Data. His code is available too, but I forget exactly where.

Peter

Peter L. Flom, PhD
Statistical Consultant
Website: www DOT peterflomconsulting DOT com
Writing; http://www.associatedcontent.com/user/582880/peter_flom.html
Twitter: @peterflom

nigol

unread,

Nov 15, 2009, 10:55:35 AM11/15/09

to ggplot2

It seems that qqnorm uses the approach mentioned in: <http://
www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#qqplot>.
I found that reference in wikipedia: <http://en.wikipedia.org/wiki/Q-
Q_plot>.

Here is an excerpt of that webpage:
The quantile-quantile (Q-Q) plot is constructed using the theoretical
cumulative distribution function, F(x), of the specified model. The
values in the sample of data, in order from smallest to largest, are
denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, x(i) is plotted
against F-1((i-0.5)/n).

Dario

hadley wickham

unread,

Nov 16, 2009, 12:08:21 AM11/16/09

to nigol, ggplot2

Ah, yes, that's the definition I used. Thanks!
Hadley

Kasper Daniel Hansen

unread,

Nov 19, 2009, 12:24:23 PM11/19/09

to hadley wickham, nigol, ggplot2

The best reference I know of on this is Cleveland's graphing data. He
has clearly spend a lot of time thinking about this.

While the essentials of a qqplot are pretty easy, there are a
surprising number of important refinements as far as I remember. I
should say that based on my recollection, Cleveland focuses on qqplot
between two empirical distributions (where things gets especially
fuzzy when they have different sample sizes)

Kasper

JiHO

unread,

Jan 6, 2010, 6:25:47 PM1/6/10

to hadley wickham, nigol, ggplot2

Sorry to revive an old topic but I hit this difference today. To
remind you of the discussion, the question was about the discrepancy
between qqnorm in base R and stat="qq" in ggplot2.

Interesting bit:

> > It seems that qqnorm uses the approach mentioned in: <http://
> > www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#qqplot>.
> > I found that reference in wikipedia: <http://en.wikipedia.org/wiki/Q-
> > Q_plot>.
> >
> > Here is an excerpt of that webpage:
> > The quantile-quantile (Q-Q) plot is constructed using the theoretical
> > cumulative distribution function, F(x), of the specified model. The
> > values in the sample of data, in order from smallest to largest, are
> > denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, x(i) is plotted
> > against F-1((i-0.5)/n).
>

> Ah, yes, that's the definition I used. Thanks!

So I tried to pinpoint the difference with qqnorm more precisely and
here is the result

library("ggplot2")
d <- 1:10

# base R
quartz()
qqnorm(d)
# recreate manually
n <- length(d)
sampleQuant <- quantile(d, seq(0, 1, length=n))
normQuant <- qnorm(ppoints(n))
plot(normQuant, sampleQuant)
# -> perfect match

# ggplot2
quartz()
qplot(sample=d)
qplot(normQuant, sampleQuant)
# -> not quite
sampleQuant <- quantile(d, ppoints(n))
qplot(sample=d)
qplot(normQuant, sampleQuant)
# -> match

So it seems that, for the *sample* quantiles, the probabilities are:
- 0:(n-1)/n in base R, which is equivalent to plotting the sample
sorted in increasing order
- (1:n - 0.5)/n in ggplot2

According to the definition above, base R is right ((1:n - 0.5)/n is
used for the inverse CDF *only*), ggplot is wrong. Hadley, would you
care to comment on the difference?

JiHO
---
http://maururu.net

hadley wickham

unread,

Jan 8, 2010, 2:02:25 PM1/8/10

to JiHO, nigol, ggplot2

stat_qq just uses the output from ppoints. It seems like there are
some differences of opinion about exactly quantiles the default should
be - if you have a specific choice in mind, I think you should just
supply them using the quantiles argument so that there is no
confusion.

Hadley

--
http://had.co.nz/

JiHO

unread,

Jan 11, 2010, 7:50:59 AM1/11/10

to hadley wickham, nigol, ggplot2

On Fri, Jan 8, 2010 at 20:02, hadley wickham <h.wi...@gmail.com> wrote:
> stat_qq just uses the output from ppoints. It seems like there are
> some differences of opinion about exactly quantiles the default should
> be - if you have a specific choice in mind, I think you should just
> supply them using the quantiles argument so that there is no
> confusion.

to jump back on that (sorry, a bit late again), the opinions might be
different indeed, but the reference you said you used :

It seems that qqnorm uses the approach mentioned in:

<http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#qqplot>.

I found that reference in wikipedia:
<http://en.wikipedia.org/wiki/Q- Q_plot>.

Here is an excerpt of that webpage: The
quantile-quantile (Q-Q) plot is constructed using the
theoretical cumulative distribution function, F(x),
of the specified model. The values in the sample of
data, in order from smallest to largest, are denoted
x(1), x(2), ..., x(n). For i = 1, 2, ....., n, x(i)
is plotted against F-1((i-0.5)/n).

is what base-R does, not what ggplot actually does. It might be that
ggplot is right, I am not knowledgeable enough to know that, but I
think it should be justified on the help page of stat_qq, with an
appropriate reference. And, at least because it is different from the
default behaviour of R (however wrong it may be), it should be
strongly argumented.

Thanks in advance,

JiHO
---
http://maururu.net

hadley wickham

unread,

Jan 11, 2010, 9:59:07 AM1/11/10

to JiHO, nigol, ggplot2

> is what base-R does, not what ggplot actually does. It might be that
> ggplot is right, I am not knowledgeable enough to know that, but I
> think it should be justified on the help page of stat_qq, with an
> appropriate reference. And, at least because it is different from the
> default behaviour of R (however wrong it may be), it should be
> strongly argumented.

Ok, I finally get what you're complaining about and I think you're
right - I was using the quantiles for both the sample and theoretical
distributions, when I should only be using them for the theoretical.

Here's what the new code will look like:

data <- remove_missing(data, na.rm, "sample", name = "stat_qq")

sample <- sort(data$sample)
n <- length(sample)

# Compute theoretical quantiles
if (is.null(quantiles)) {
quantiles <- ppoints(n)
} else {
stopifnot(length(quantiles) == n)
}
theoretical <- safe.call(distribution, list(p = quantiles, ...))

data.frame(sample, theoretical)

But that means you can no longer specify (e.g.) quantiles = seq(0.05,
0.95, by = 0.05)) if you have a large dataset and only want to see
select quantiles. I think that was my original intention, which I now
see is a bit different to qqnorm.

Hadley

--
http://had.co.nz/

JiHO

unread,

Jan 28, 2010, 6:17:25 AM1/28/10

to hadley wickham, nigol, ggplot2

Sorry, I did not see that you replied until today. I hope you still
remember the discussion a little bit. I think your approach of just
showing a few quantiles instead of the whole dataset is important too.
It's just that, when the number of quantiles is n = number of data
points, the quantiles should just be the sorted sample.
I think one issue here is the definition of sample quantiles. In the
current version of ggplot, quantiles uses the default quantile
definition which is *continuous*. Therefore, even when the number of
quantiles is equal to the number of points, there is still
interpolation going on:

> d <- 1:10
> d
[1] 1 2 3 4 5 6 7 8 9 10
> quantile(d, ppoints(10))
6.097561% 15.85366% 25.60976% 35.36585% 45.12195% 54.87805% 64.63415% 74.39024%
1.548780 2.426829 3.304878 4.182927 5.060976 5.939024 6.817073 7.695122
84.14634% 93.90244%
8.573171 9.451220

In the case of a qq-plot, I think that what you want to see on the
sample side are your data points, directly, without interpolation. In
the "usual" case where the number of quantiles is n, you just need to
sort your data. In the more general case, when you want to be able to
display just a sub-sample of the data corresponding to certain
quantiles (and I think this is desirable), what you need is a
*discontinuous* method of computing your quantiles, i.e. something
that will *not* interpolate and just give you the closest data point.
That's methods 1, 2 and 3 or the quantile function:

> quantile(d, ppoints(10), type=1)
6.097561% 15.85366% 25.60976% 35.36585% 45.12195% 54.87805% 64.63415% 74.39024%
1 2 3 4 5 6 7 8
84.14634% 93.90244%
9 10
> quantile(d, ppoints(10), type=2)
6.097561% 15.85366% 25.60976% 35.36585% 45.12195% 54.87805% 64.63415% 74.39024%
1 2 3 4 5 6 7 8
84.14634% 93.90244%
9 10
> quantile(d, ppoints(10), type=3)
6.097561% 15.85366% 25.60976% 35.36585% 45.12195% 54.87805% 64.63415% 74.39024%
1 2 3 4 5 5 6 7
84.14634% 93.90244%
8 9

In this case, I'd say go for the simplest one (type=1). So the change
compared to the previous version of the code could just be:

diff --git a/R/stat-qq.r b/R/stat-qq.r
index 7219425..4dc51e1 100644
--- a/R/stat-qq.r
+++ b/R/stat-qq.r
@@ -19,7 +19,7 @@ StatQq <- proto(Stat, {
}

theoretical <- safe.call(distribution, list(p = quantiles, ...))

- sample <- quantile(data$sample, probs=quantiles, na.rm=na.rm)
+ sample <- quantile(data$sample, probs=quantiles, na.rm=na.rm, type=1)

data.frame(sample, theoretical, group = data$group[1])
}

But I am no authority on the matter and this might require more
profound thinking. In the default case (nb of quantiles = n) it
matches qqnorm at least.

JiHO
---
http://maururu.net

Kasper Daniel Hansen

unread,

Jan 28, 2010, 9:32:49 AM1/28/10

to JiHO, hadley wickham, nigol, ggplot2

Why don't you read the Cleveland reference I gave earlier (Graphing Data). He has spent a lot of time thinking profoundly about these issues. It may be that your solution is similar to his (I have not checked), but what is wrong with standing on the shoulder of giants.

One side note: you can use
qq <- qqplot(x, y, plot.it = FALSE)
to get the usual R coordinates as
qq$x, qq$y

Kasper

hadley wickham

unread,

Jan 28, 2010, 10:01:05 AM1/28/10

to Kasper Daniel Hansen, JiHO, nigol, ggplot2

> Why don't you read the Cleveland reference I gave earlier (Graphing Data). He has spent a lot of time thinking profoundly about these issues. It may be that your solution is similar to his (I have not checked), but what is wrong with standing on the shoulder of giants.

I don't think that reference is as useful as you think it is! I read
it a couple of weeks ago and I found it pretty superficial - it tells
you what Cleveland does but not why.

Hadley

--
http://had.co.nz/

Nicholas Lewin-Koh

unread,

Jan 28, 2010, 11:27:42 AM1/28/10

to hadley wickham, Kasper Daniel Hansen, JiHO, nigol, ggplot2

Yes, Just sent me back to the shelf too. He sites Chambers et al. (1983)
and refers the method he shows as "a convention in statistical science".
Quantiles are calculated by linear interpolation which is probably not a
big problem, as long as interpolation is local. Handling ties may get
some hackles up. But as Hadley says, the overall treatment is sketchy.
He does site Wilks and Gnanadesikan from 1968 in Biometrika
as the inventors of the qqplot, I would suspect more theoretical
justification is there.

Just my snipe

Nicholas

On Thu, 28 Jan 2010 09:01 -0600, "hadley wickham" <h.wi...@gmail.com>
wrote:

Reply all

Reply to author

Forward