slow geom_boxplot performance on ~400,000 rows

505 views
Skip to first unread message

Jason Edgecombe

unread,
Jun 25, 2011, 4:17:12 PM6/25/11
to ggplot2
Hello everyone,

I'm trying to create a box & whiskers plot using ggplot2 on a data
frame with 2 columns and slightly less than 400,000 rows, but
performance is a problem. I've looked at
https://github.com/hadley/ggplot2/wiki/Case-Study%3A-Raman-Spectroscopic-Grading-of-Gliomas
t but I can't understand how to precompute the values for the boxplot

FYI, when running this through sweave, I get the following error
Warning: position_dodge requires non-overlapping x intervals

The data is from my Garmin GPS running watch, and records multiple
workouts and heart rate at per-second granularity (approximated). I'm
trying to show a boxplot of my heart rate over each second during a
workout as grouped by month.

Below is an example of what I'm trying to do, but strangely, my
example code runs in about 5 seconds, but my real code never seems to
finish (takes longer than 3 minutes) both with and without sweave.
================cut==================
require(ggplot2)
require(zoo)

row.count=400000
months=seq(as.Date("2010/1/1"), as.Date("2011/6/1"), by="mon")

df=data.frame(yearmon=sample(months, row.count, replace=T),
heart.rate=runif(row.count, min=80, max=160))

print(ggplot(df, aes(x=yearmon, y=heart.rate, group=yearmon))
+ geom_boxplot()
+ opts(title = "Heart rate per minute summarized by month")
+ opts(axis.text.x=theme_text(angle=-90, hjust=0))
+ xlab("Month")
+ ylab("Heart Beats per Minutes")
)

#boxplot(HeartRateBpm ~ yearmon, data = df)
================cut==================

In my real code, my data frame if named "hr.month" instead of "df".
Here is some, possibly, useful info:
> summary(as.factor(hr.month$yearmon))
2009-12-01 2010-01-01 2010-02-01 2010-03-01 2010-04-01 2010-05-01
2010-06-01
11308 17071 15393 13806 27418 40702
31987
2010-07-01 2010-08-01 2010-09-01 2010-10-01 2010-11-01 2010-12-01
2011-01-01
18659 15958 20659 25321 15313
7726 7692
2011-02-01 2011-03-01 2011-04-01 2011-05-01 2011-06-01
22331 21316 25773 38007 11028
> str(hr.month)
'data.frame': 387468 obs. of 2 variables:
$ yearmon : Date, format: "2009-12-01" "2009-12-01" ...
$ HeartRateBpm: num 103 102 102 101 101 ...


Thanks,
Jason

James Rome

unread,
Jun 26, 2011, 9:38:24 AM6/26/11
to Jason Edgecombe, ggplot2
ggplot2 is very slow. But it may be that it is throwing an error that
you cannot see.

Hello everyone,


Thanks,
Jason

--
James A. Rome
Phone: (865) 482-5643
E-mail: jame...@gmail.com
URL: http://jamesrome.net

Jason Edgecombe

unread,
Jun 26, 2011, 9:53:53 AM6/26/11
to ggp...@googlegroups.com

In https://github.com/hadley/ggplot2/wiki/Case-Study%3A-Raman-Spectroscopic-Grading-of-Gliomas , Hadley referred to a way of computing the statistics and then feeding the statistics, not the raw data, to ggplot. How can I do that?

Any other ways of speeding things up are welcome.

On 06/26/2011 09:38 AM, James Rome wrote:
> ggplot2 is very slow. But it may be that it is throwing an error that
> you cannot see.
>
> On 6/25/11 4:17 PM, Jason Edgecombe wrote:
>
> Hello everyone,
>

> I'm trying to create a box& whiskers plot using ggplot2 on a data

James McCreight

unread,
Jun 26, 2011, 12:00:02 PM6/26/11
to Jason Edgecombe, ggp...@googlegroups.com
The issue here is that you have a massive factor variable along the x-axis. I noted recently that this causes major slow downs. 

This later one is one i recommended for the examples on geom_boxplot, just to allude to the other thread i just posted on today.

I was recently in a similar position and found, only after the all the above, that line plots (different color lines at the hinges) contained as much information as boxplots. Looking at this many boxplots is a bit ridiculous, unless it's going to be put up on a wall i suppose.... just a side note.

James

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+unsubscribe@googlegroups.com
More options: http://groups.google.com/group/ggplot2



--
-
******************************************************************************
James McCreight                               
cell: (831) 261-5149
VoIP (to cell): (720) 897-7546

Jason Edgecombe

unread,
Jun 26, 2011, 9:32:17 PM6/26/11
to ggp...@googlegroups.com
That worked, thanks!

Is there a way to show the outliers like in the normal boxplot?

On 06/26/2011 12:00 PM, James McCreight wrote:
> The issue here is that you have a massive factor variable along the x-axis.
> I noted recently that this causes major slow downs.
> http://groups.google.com/group/ggplot2/browse_thread/thread/c35c47c7a3adf52f/d94d2a9c95176513?lnk=gst&q=mccreight+speed#d94d2a9c95176513
>
> I'm assuming it's in the calculations, cause precomputing them you'll get a
> massive speed up.
> http://groups.google.com/group/ggplot2/browse_thread/thread/3f002345b780db1f/f8282a77521e550f?lnk=gst&q=mccreight+boxplot#f8282a77521e550f
>
> This later one is one i recommended for the examples on geom_boxplot, just
> to allude to the other thread i just posted on today.
>
> I was recently in a similar position and found, only after the all the
> above, that line plots (different color lines at the hinges) contained as
> much information as boxplots. Looking at this many boxplots is a bit
> ridiculous, unless it's going to be put up on a wall i suppose.... just a
> side note.
>
> James
>
> On Sun, Jun 26, 2011 at 7:53 AM, Jason Edgecombe<ja...@rampaginggeek.com>wrote:
>

>> In https://github.com/hadley/**ggplot2/wiki/Case-Study%3A-**
>> Raman-Spectroscopic-Grading-**of-Gliomas<https://github.com/hadley/ggplot2/wiki/Case-Study%3A-Raman-Spectroscopic-Grading-of-Gliomas>, Hadley referred to a way of computing the statistics and then feeding the


>> statistics, not the raw data, to ggplot. How can I do that?
>>
>> Any other ways of speeding things up are welcome.
>>
>>
>> On 06/26/2011 09:38 AM, James Rome wrote:
>>
>>> ggplot2 is very slow. But it may be that it is throwing an error that
>>> you cannot see.
>>>
>>> On 6/25/11 4:17 PM, Jason Edgecombe wrote:
>>>
>>> Hello everyone,
>>>
>>> I'm trying to create a box& whiskers plot using ggplot2 on a data
>>> frame with 2 columns and slightly less than 400,000 rows, but
>>> performance is a problem. I've looked at

>>> https://github.com/hadley/**ggplot2/wiki/Case-Study%3A-**
>>> Raman-Spectroscopic-Grading-**of-Gliomas<https://github.com/hadley/ggplot2/wiki/Case-Study%3A-Raman-Spectroscopic-Grading-of-Gliomas>


>>> t but I can't understand how to precompute the values for the boxplot
>>>
>>> FYI, when running this through sweave, I get the following error
>>> Warning: position_dodge requires non-overlapping x intervals
>>>
>>> The data is from my Garmin GPS running watch, and records multiple
>>> workouts and heart rate at per-second granularity (approximated). I'm
>>> trying to show a boxplot of my heart rate over each second during a
>>> workout as grouped by month.
>>>
>>> Below is an example of what I'm trying to do, but strangely, my
>>> example code runs in about 5 seconds, but my real code never seems to
>>> finish (takes longer than 3 minutes) both with and without sweave.

>>> ================cut===========**=======
>>> require(ggplot2)
>>> require(zoo)
>>>
>>> row.count=400000
>>> months=seq(as.Date("2010/1/1")**, as.Date("2011/6/1"), by="mon")
>>>
>>> df=data.frame(yearmon=sample(**months, row.count, replace=T),


>>> heart.rate=runif(row.count, min=80, max=160))
>>>
>>> print(ggplot(df, aes(x=yearmon, y=heart.rate, group=yearmon))
>>> + geom_boxplot()
>>> + opts(title = "Heart rate per minute summarized by month")

>>> + opts(axis.text.x=theme_text(**angle=-90, hjust=0))


>>> + xlab("Month")
>>> + ylab("Heart Beats per Minutes")
>>> )
>>>
>>> #boxplot(HeartRateBpm ~ yearmon, data = df)

>>> ================cut===========**=======


>>>
>>> In my real code, my data frame if named "hr.month" instead of "df".
>>> Here is some, possibly, useful info:
>>>

>>> summary(as.factor(hr.month$**yearmon))

Reply all
Reply to author
Forward
0 new messages