geom_point to discard points when there are too many?

624 views
Skip to first unread message

Giovanni Azua

unread,
Sep 9, 2012, 11:42:36 AM9/9/12
to ggplot2
Hello,

Due to having a lot of data I get very cluttered plots using geom_point ... is there a clever option that avoids cluttering the plot due to too many data points? and that doesn't discard the extreme ones? I could do that by pre-filtering the data frame using subset but it would be a lot more convenient if ggplot2 had such an option ...

TIA,
Best regards,
Giovanni

Roman Luštrik

unread,
Sep 9, 2012, 11:45:19 AM9/9/12
to Giovanni Azua, ggplot2
See ?geom_jitter.

Cheers,
Roman




--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2



--
In God we trust, all others bring data.

Roman Luštrik

unread,
Sep 9, 2012, 11:49:35 AM9/9/12
to Giovanni Azua, ggplot2
Bah, senior moments starting eary. 

geom_jitter will not solve your problem, obviously. What you could try is changing the alpha value of your points. That way you will get darker clouds where there are more points. This is demonstrated here:  http://had.co.nz/ggplot2/geom_point.html

Cheers,
Roman

Giovanni Azua

unread,
Sep 9, 2012, 11:52:06 AM9/9/12
to ggplot2
Hi

Thank you for your prompt answers, indeed jittering doesn't solve my problem, besides, in this case having more points just clutters the plots and doesn't add any value:

PastedGraphic-1.pdf

Ben Bolker

unread,
Sep 9, 2012, 11:56:04 AM9/9/12
to ggp...@googlegroups.com
I believe there are a variety of discussions of this around on
StackOverflow and elsewhere (some are concerned with graphical clarity,
others with slowness of rendering and file size of very large
scatterplots) -- setting the transparency (alpha) as suggested is one
answer; others are geom_hexbin(), geom_density(), etc.. If you don't
want to thin your data set unconditionally (i.e. take a random subsample
of the whole thing) because you're worried about losing outlying points
(which are probably a lot less important than they look!), then one of
these data summaries might be the way to go. (geom_hexbin() in
particular is very efficient).


Ben Bolker

unread,
Sep 9, 2012, 11:59:57 AM9/9/12
to ggp...@googlegroups.com
On 12-09-09 11:52 AM, Giovanni Azua wrote:
> Hi
>
> Thank you for your prompt answers, indeed jittering doesn't solve my problem, besides, in this case having more points just clutters the plots and doesn't add any value:
>
>
>
>
>
> Best!
> Giovanni
>
>

Oops, I didn't look at your example, I was thinking of a true 2-D
scatterplot. I would consider

(1) dispensing with geom_point() entirely
(2) do something (outside ggplot) that subsets the data to only the
extremes, e.g. along the lines of

subdat <- ddply(origdata,splitvar,
function(x) {
lwr <- quantile(x$runtime,0.025)
upr <- quantile(x$runtime,0.975)
x[x$runtime>lwr | x$runtime>upr,]
})

... + geom_point(data=subdat)

...


David Johnston

unread,
Sep 9, 2012, 11:49:31 AM9/9/12
to Giovanni Azua, ggplot2
I do not believe that throwing away random data should be considered a
desirable feature.

If you are dealing with over-lapping points then you should try assigning an
alpha-level to either the geom itself or the color (alpha("blue", .10))

Otherwise you should probably consider alternative forms of visualization.
Binning with a tile-map would be worth considering.

A lot depends on the actual data...

David J.


Reply all
Reply to author
Forward
0 new messages