Multicore and ggplot2

Brandon Hurr

unread,

Nov 2, 2011, 7:51:56 AM11/2/11

to ggplot2

I've only managed to upgrade one of my machines to 2.14, but I understand that it has better integration of multicore functions. I'm just curious because I'm currently waiting for 400 boxplots to finish plotting and saving... Has anyone used multicore functions and packages with ggplot2 and how did you get it to all work together? I've got 3 idle processors itching to help me out.

If you've got some examples I'd love to see them. Heck, upload it to the wiki while your at it...

Brandon

Justin Haynes

unread,

Nov 2, 2011, 10:24:15 AM11/2/11

to Brandon Hurr, ggplot2

I was running into the same issue. The solution I found was to use
the doMC package and the .parallel argument to hadley's plyr package.

It takes a bit of code rewriting, but you can get your data into the
appropriate format for plyr to split and then call a function that
does your plotting for each split.

My example was temperatures of components within a group of identical
machines where the x axis was the machine name and the y axis of the
box plot the temperatures.

However, I wound up moving away from that strategy since I couldn't
figure out how to number the plots sequentially while also providing
each with a descriptive file name.

If anyone else has thoughts on that issue I'd love to hear them too!
Also I'm curious why d_ply doesn't take the .parallel argument since
it seems taylor made for this application.

Hope that helps...

Justin

> --
> You received this message because you are subscribed to the ggplot2 mailing
> list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

Joshua Wiley

unread,

Nov 2, 2011, 12:44:03 PM11/2/11

to Brandon Hurr, ggplot2

Interesting question. I'm hoping Hadley or someone more familiar with
the inner workings of ggplot2 will weigh in here, but here is a rough
sketch of a couple of approaches I've taken in the past (if no one
posts anything better, I'll put together a couple examples later).

1) (and this is by far what I do most often, which is not that much
because my data tend to be small, but anyway): split the data, make
separate saved but unrendered plots, then use grid + viewports to lay
them all out in one big plot.

2) Preprocess the data so the manipulations are done manually outside
of ggplot2, then just pass in data ready to be rendered (e.g., for
boxplots, precalculate the min, lower/upper hinge, median, and max).
I hate this because it is utterly unflexible aside from aesthetics.

Those are the best I've come up with so far though, hopefully there is
a nicer way.

Cheers,

Josh

On Wed, Nov 2, 2011 at 4:51 AM, Brandon Hurr <brando...@gmail.com> wrote:

> --
> You received this message because you are subscribed to the ggplot2 mailing
> list.
> Please provide a reproducible example: http://gist.github.com/270442
>
> To post: email ggp...@googlegroups.com
> To unsubscribe: email ggplot2+u...@googlegroups.com
> More options: http://groups.google.com/group/ggplot2
>

--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Brandon Hurr

unread,

Nov 2, 2011, 1:31:14 PM11/2/11

to Joshua Wiley, ggplot2

Josh, your G+ post is actually what got me thinking about it in the first place after having tried doing something previously and failing miserably.

I think my problem is more up Justin's alley in that I'm plotting many columns of data in a for loop.
This following is the code I've been using to plot with chopped a bit for clarity:

for (i in 5:ncol(tRNAplot)){
title<-"longtitlegoeshere")

plot <- ggplot(data = tRNAplot, aes_string(x = "Ripeness", y = paste("as.numeric(", colnames(tRNAplot)[i], ")", sep=""), fill="Variety"))+
geom_boxplot()+
facet_grid(.~Tissue)+
opts(title=title, plot.title=theme_title(width=unit(15, "cm")))+
labs(y=paste(colnames(tRNAplot)[i]))

#save plot
ggsave(sprintf("GeneBoxplot.%s.%s.%s.%s.%s.png", "lots of labels go here", "and a number" ), width=6, height=6)
}

I think my previous efforts were using foreach and snow or doMC, but I had a hard time with the labeling and saving as Justin mentioned and eventually gave up after about a week and just bought a faster computer.

Minimally, it would be nice if you set it up your processors ahead of time and all you had to do was put an option in your ggplot() call (e.g. mc=TRUE) that it would do all the underlying calculations and rendering using all available cores. Even if it wasn't plotting each graph on a separate core I would think it would speed things up, but I haven't really got a good example to go on. Usually I trawl through the forums for bits of code I can assemble, but there doesn't seem to be much out there to build on. :/

Brandon

Justin Haynes

unread,

Nov 2, 2011, 2:00:42 PM11/2/11

to Brandon Hurr, Joshua Wiley, ggplot2

I pasted in a my old bit of code below... see if it helps at all.

library(doMC)
registerDoMC()
library(ggplot2)

dat<-data.frame(site=letters[1:4],t1=rnorm(20),t2=rnorm(20),t3=rnorm(20))
dat.melt<-melt(dat,id.vars='site')

dat.melt$variable<-levels(dat.melt$variable)[dat.melt$variable]

my.func<-function(df){
print(ggplot(df,aes(x=site,y=value))+
geom_boxplot())
}

png('/tmp/plots%d.png')

ddply(dat.melt,.(variable),my.func,.parallel=T)

or you can modify the my.func a bit to give better file names...
however I still can't figure out how to do both.

also its worth noting that with small data, there is no increase in
speed but once you're close to 1e6 rows it becomes a bit more
advantageous.

> dat<-data.frame(site=letters[1:4],t1=rnorm(1000),t2=rnorm(1000),t3=rnorm(1000))
> dat.melt<-melt(dat,id.vars='site')
>
> dat.melt$variable<-levels(dat.melt$variable)[dat.melt$variable]
>
> my.func<-function(df){
+ print(ggplot(df,aes(x=site,y=value))+
+ geom_boxplot())
+ }
>
> png('/tmp/plots%d.png')
>
> system.time(ddply(dat.melt,.(variable),my.func,.parallel=T))
user system elapsed
3.788 0.344 1.609
> system.time(ddply(dat.melt,.(variable),my.func,.parallel=F))
user system elapsed
2.792 0.020 2.816

> dat<-data.frame(site=letters[1:4],t1=rnorm(1000000),t2=rnorm(1000000),t3=rnorm(1000000))
> dat.melt<-melt(dat,id.vars='site')
>
> dat.melt$variable<-levels(dat.melt$variable)[dat.melt$variable]
>
> my.func<-function(df){
+ print(ggplot(df,aes(x=site,y=value))+
+ geom_boxplot())
+ }
>
> png('/tmp/plots%d.png')
>
> system.time(ddply(dat.melt,.(variable),my.func,.parallel=T))
user system elapsed
50.275 1.964 28.228
> system.time(ddply(dat.melt,.(variable),my.func,.parallel=F))
user system elapsed
46.947 0.248 47.305

Brandon Hurr

unread,

Nov 2, 2011, 2:15:45 PM11/2/11

to Justin Haynes, Joshua Wiley, ggplot2

also its worth noting that with small data, there is no increase in
speed but once you're close to 1e6 rows it becomes a bit more
advantageous.

That's kind of disappointing really... Perhaps the speed ups in ggplot2 0.9 will just make things a bit better so I don't have to think about it so much?

Leonardo Salas

unread,

Nov 2, 2011, 2:56:54 PM11/2/11

to Brandon Hurr, ggplot2

Hi Brandon,

I’d like to add myself to the list of people interested in seeing how ggplot2 was used with multicore functions. Related to this issue, since you are saving the plots as images on your HD, and since access to the device is serial, would that not be the main performance bottleneck? I’d also be interested to hear how to quickly render ggplot2 graphs on web browsers.

Leo

--

Joshua Wiley

unread,

Nov 3, 2011, 2:44:21 AM11/3/11

to Leonardo Salas, Brandon Hurr, ggplot2

Hi All,

I wrote a script and benchmarked regular versus byte compiled ggplot2
(+dependencies) and single versus triple core use of ggplot2. I drew
from Brandon's example, so it is basically just the same plot on nine
different columns of data so the speedup from parallelizing is large
and easy to implement. If you are interested, I just uploaded a page
with all the scripts and the log files as well as the final timing
results:

https://joshuawiley.com/R/ggplot2_benchmark.aspx

For those who are just interested in how to easily parallelize looping
through data/columns making plots, here is just the code for I used
for that:

###########

## define a function that makes the plots I want
## renders as PDFs, and returns the grob (so I get a list of grobs at the end)
## (I could theoretically combine into one with grid viewports or the like)
myPlot <- function(ycol, dat) {
p <- ggplot(data = dat, aes_string(x = "x", y = ycol, colour = "g3")) +
geom_point() +
stat_smooth(size = 2) +
facet_grid(g1 ~ g2) +
opts(title = paste("Plot of '", ycol, "' on 'x'", sep = ''))
ggsave(paste("Benchmark_of_", ycol, "_", format(Sys.time(),
"%H-%M-%S"), ".pdf", sep = ""),
plot = p, width=10, height=10)
return(p)
}
## initiate local cluster and push relevant packages and objects
cl <- makeCluster(getOption("cl.cores", 3))
clusterEvalQ(cl, {
library(ggplot2)
})
clusterExport(cl, varlist = list("mydf", "myPlot"))
## actually do it
results <- parLapply(cl, X = colnames(mydf)[2:10], fun = myPlot, dat = mydf)
## shut the cluster down
stopCluster(cl)

Cheers,

Josh

--

Brandon Hurr

unread,

Nov 3, 2011, 7:23:12 AM11/3/11

to Joshua Wiley, Leonardo Salas, ggplot2

Josh,

Thanks, I'll try and give it a go sometime in the next week. Over 400 plots I can see I'll get to keep a few minutes more of my life.

If anyone else has any more code or examples to share I'm very much open to seeing them.

Brandon

Brandon Hurr

unread,

Nov 5, 2011, 7:36:00 AM11/5/11

to Joshua Wiley, Leonardo Salas, ggplot2

I ran your example and it works really well Josh. Just need to implement it for my data. Literally cuts the time to plot in 1/2.

I accidentally left it as 3 cores when I ran it on my dual core i5 (Mac/Lion) and it used all 4 of the hyperthreads. I couldn't do anything else thanks to the spinning beachball, but it finished so much faster. That and those plots are almost art... at least I'd hang one on my wall.

/nerded out for the weekend.

Brandon Hurr

unread,

Nov 25, 2011, 4:57:13 AM11/25/11

to Joshua Wiley, Leonardo Salas, ggplot2

Just an update, it took about 30 min to update my plots to multicore this morning and it ran great. Saved about 10 minutes on the plotting. I think with a more simple case it would take much less time to update the code and it would save a lot of time. I'll try and use your example Josh to make a tutorial on the wiki if I can manage it.