xlim(), stat_bin, the group aesthetic and discrete scales

236 views
Skip to first unread message

John Rauser

unread,
May 11, 2010, 1:51:55 PM5/11/10
to ggp...@googlegroups.com
(Trying a more specific list, apologies to r-help@ subscribers that may have already seen this.)

I'm a new ggplot user and am really loving it so far, but I have run into a little trouble.  I'm hoping someone here can clarify.

I've been plotting histograms of discrete variables like so:

> df<-data.frame(names=c("Bob","Mary","Joe","Bob","Bob"))
> p<-ggplot(df,aes(names))
> p+geom_histogram()

But sometimes I want to constrain the categories (the factor in my real dataset has many, many levels).  When I do this in (what seems like) the simplest way I get an error:

> p+geom_histogram()+xlim("Bob","Mary")
Error in data.frame(count = as.numeric(tapply(weight, bins, sum, na.rm = TRUE))$
  arguments imply differing number of rows: 0, 1

If I use xlim() to specify all the levels, or all the levels plus a few extra levels, things work as expected:

> p+geom_histogram()+xlim("Bob","Mary","Joe")
> p+geom_histogram()+xlim("Bob","Mary","Joe","Frank")

I found the error surprising because I had been successfully plotting ..density.. instead of the default ..count.. and constraining the x-axis like so:

> p+geom_histogram(aes(y=..density..,group=1))+xlim("Bob","Mary")

After some experimentation, I've come to the conclusion that using the group aesthetic is the key to successfully using xlim() with stat_bin and discrete scales.  That is, the following does what I want:

p+geom_histogram(aes(group=1))+xlim("Bob","Mary")

I have two questions:

1) Is this expected behavior?  The interaction of the group aesthetic and discrete scale manipulation seems unintuitive to me, but perhaps this is because I'm a ggplot novice.

2) More generally, how do people debug gpplot commands?  When I get a cryptic error message like the above it's nearly always because I've asked for something unreasonable, but if often takes me a long time and lots of experimentation to spot my error. Do folks have any ggplot debugging tips?

Thanks very much for your help,

-J

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

Dennis Murphy

unread,
May 11, 2010, 4:35:49 PM5/11/10
to John Rauser, ggp...@googlegroups.com
On Tue, May 11, 2010 at 10:51 AM, John Rauser <jra...@gmail.com> wrote:
(Trying a more specific list, apologies to r-help@ subscribers that may have already seen this.)

I'm a new ggplot user and am really loving it so far, but I have run into a little trouble.  I'm hoping someone here can clarify.

I've been plotting histograms of discrete variables like so:

> df<-data.frame(names=c("Bob","Mary","Joe","Bob","Bob"))
> p<-ggplot(df,aes(names))
> p+geom_histogram()

Firstly, a histogram is intended for numeric data; you want a bar chart instead.
Secondly, xlim() [and ylim()] are *numeric* scales, not categorical. How do you scale 'Bob' and 'Mary'?
To put it another way, what's the numerical 'distance' between Bob and Mary that is necessary
for a histogram routine to rationally proceed? The fact that it 'worked' at all is because R assigns
numerical (integer) codes to different levels of factors (look at str(df)), where the codes are
assigned in lexicographic order.

To restrict the 'scale' in your bar chart, subset your data accordingly before constructing the plot
and then redefine the factor variable (names) to drop unused factor levels before plotting.

I respectfully suggest that you educate yourself about the concept of measurement scales, starting
with the distinctions among nominal, ordinal, interval and ratio scales of data. There is a copious amount
of material on the web to assist you in this enterprise.

As far as subselecting categories in R, look into the %in% function  [type ?%in% at the R prompt]

HTH,
Dennis

John Rauser

unread,
May 11, 2010, 5:24:57 PM5/11/10
to Dennis Murphy, ggp...@googlegroups.com
On Tue, May 11, 2010 at 1:35 PM, Dennis Murphy <djm...@gmail.com> wrote:

Firstly, a histogram is intended for numeric data; you want a bar chart instead.

Terminology aside, geom_histogram() is merely a synonym for geom_bar(stat="bin").  See the opening sentence of: http://had.co.nz/ggplot2/geom_histogram.html.  Also:

> p+geom_bar(stat="bin")

Secondly, xlim() [and ylim()] are *numeric* scales, not categorical. How do you scale 'Bob' and 'Mary'?

I don't think you are correct.  My understanding is that xlim() is simply a helper function that sets the limit parameter of whatever scale you've got.  stat_bin seems to be perfectly happy to work on categorical data, the online docs even provide nice examples (http://had.co.nz/ggplot2/stat_bin.html), as does Dr. Wickham's lovely book.  Further, the online docs also give nice examples of using limits= and xlim() with scale_discrete (http://had.co.nz/ggplot2/scale_discrete.html).  So, near as I can tell, these are equivalent:

> p+geom_bar(stat="bin")+scale_x_discrete(limits=c("Bob","Joe","Mary"))
> p+geom_bar(stat="bin")+xlim("Bob","Joe","Mary")

Indeed this gives me my (now worked around) error:

> p+geom_bar(stat="bin")+scale_x_discrete(limits=c("Bob","Joe"))

Error in data.frame(count = as.numeric(tapply(weight, bins, sum, na.rm = TRUE))$
  arguments imply differing number of rows: 0, 1



I respectfully suggest that you educate yourself...

Right... thanks for that.

I remain confused about the interaction between the group aesthetic and discrete scales (or some deeper thing that I can't yet fathom).  I have worked around my problem, so I'm mostly looking the understand more completely.  I'd be very appreciative if someone could help with that.

Best,

-J

Hadley Wickham

unread,
May 25, 2010, 1:23:49 PM5/25/10
to John Rauser, ggp...@googlegroups.com
Definitely not - this is a small bug in stat_bin which will be fixed
in the next version. I will probably rewrite stat_bin in some point
in the future to be much more efficient - currently it's a bit slow,
particularly if you have a large number of discrete values and haven't
set group = 1.

> 2) More generally, how do people debug gpplot commands?  When I get a
> cryptic error message like the above it's nearly always because I've asked
> for something unreasonable, but if often takes me a long time and lots of
> experimentation to spot my error. Do folks have any ggplot debugging tips?

Unfortunately not :( I look at the output of tr() and then add in
browser() statements where I think the problem is likely to be, but
that's probably not helpful for most people.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
Reply all
Reply to author
Forward
0 new messages