how to best deal with overplotting

677 views
Skip to first unread message

Mike

unread,
Nov 30, 2009, 4:14:03 PM11/30/09
to ggplot2
All,

I know that there are several different ways to deal with
overplotting with ggplot 2, but none of them seem to capture what I am
looking for, when "the rubber hits the road". My problem is as
follows:

1) I have data across 2 dimensions, x & y, with one legend already
consumed by a categorical factor
2) I have data that will span several orders of magnitude, sometimes
with sparse data, and sometimes with dense data.

What I am "normally" used to doing is to perform an x-y "point"
plot with different colors to represent the different categories, then
I will place a histogram below the x-axis and to the left of the y-
axis (broken down by factor in each case). I cannot seem to find a
nice, easy way to create a histogram inside of the plot. It can be
done, but not easily at all, to my knowledge / skill level.

I have tried the following methods:
o point plot, making the points smaller, turning on a little
jitter, and giving some alpha. I don't like this because (a) I REALLY
have to make the points tiny to make data points not pile on top of
each other which makes the sparse points hard to see, (b) I don't like
jitter because it is distorting the data, and (c) alpha similarly
makes certain points disappear.
o bin_2d plot, with the colour as the factor and the fill as
the ..count.. I don't like this because a particular bin is EITHER
one factor or the other, so a lot of data is lost. Plus, when I put
in two "aes" parameters like ..count.. and my factor, the legend for
count becomes empty. I could probably figure out what I am doing
wrong in that, though.
o I have also created a point plot like above, but with an overlay
of "stat_density2d", using "tile" and fill by density, to create an
effect similar to what Hadley has in his example. I then put a low
alpha on this plot so that it is mostly see-through and I can see the
points behind it. But again, this ends up making the graph just not-
quite-right. The density layer really just darkens things and it is
hard to see the variation in density.

Barring a solution that would relatively easily be able to create
a histogram, which would be ideal, the easiest other solution that I
see is to plot the points in a different ORDER. E.g., if I have 3
factors, A, B, and C, I would like to be able to plot so that A is
drawn first, then B, then C. Then draw the plot again so that C is
drawn first, then A, etc. In this manner, I can use multiple plots of
the same data set so that I can see how many points of the "other
factor" are hidden behind the factor that is being drawn last. I am
hoping this is feasible in it. Otherwise, I'll have to go back to non-
ggplot plots, and draw the data one factor at a time.

Thanks!
Mike Williamson

hadley wickham

unread,
Dec 1, 2009, 5:15:50 PM12/1/09
to Mike, ggplot2
Hi Mike,

I don't see how marginal histograms help you with overplotting - they
do not tell you anything about the joint distribution. The technique
that I'd try first is bin2d facetted by your classification variable.
Otherwise, you could do something like:

ggplot(mydf, aes(x, y)) +
geom_point(data = transform(mydf, catvar = NULL), colour = "grey80") +
geom_point(aes(colour = catvar)) +
facet_wrap(~ catvar)

to get a facetted scatterplot plot with all points in the background
of each plot.

Hadley
> --
> You received this message because you are subscribed to the ggplot2 mailing list.
> To post to this group, send email to ggp...@googlegroups.com
> To unsubscribe from this group, send email to
> ggplot2+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/ggplot2



--
http://had.co.nz/

Mike

unread,
Dec 2, 2009, 1:04:20 PM12/2/09
to ggplot2

Thanks for the advice! This method for faceting is very clever &
perfectly effective.

I am still looking for a nice means to put histograms with the
plots... I think I might just create a "wrapper function" that will
handle this.

Hadley gave me some further advice in some personal
communications, all of which was great. I just wanted to reply
quickly in this thread that this is a nice way to deal with
overplotting and different factors.

Thanks!
Mike

JiHO

unread,
Dec 6, 2009, 7:30:54 PM12/6/09
to Mike, ggplot2
On Wed, Dec 2, 2009 at 19:04, Mike <this....@gmail.com> wrote:
>    I am still looking for a nice means to put histograms with the
> plots... I think I might just create a "wrapper function" that will
> handle this.

It would no be exactly histograms but might also be less intrusive:
you could try geom_rug with an alpha value < 1. This will give you
"bands" along each axis which are darker where there is more data.

But overall, I of course agree with Hadley that this would only show
you where data is concentrated but won't help you determine where is
what.

JiHO
---
http://maururu.net
Reply all
Reply to author
Forward
0 new messages