Speed and memory optimisation in ggplot

598 views

Skip to first unread message

JiHO

unread,

Mar 24, 2009, 10:56:35 AM3/24/09

to ggplot2, hadley wickham

Hi everyone,

This question is definitely for Hadley but I figured that the answer
might interest some people here.

I recently dipped my toes into plyr which is another very hadley-
whickhamy package (hadley-whickhamy: adj.; that makes you wonder how
you were getting your work done before; syn. mind-blowing. but,
Hadley, see [1] at the bottom). The documentation currently says:

"The aim of this release is to provide a consistent and useful set of
tools for solving the split-apply-combine problem."

"The current major shortcoming of plyr is speed and memory usage.
[...] It is my aim to eventually implement these
functions in C for maximum speed and memory efﬁciency, so that they
are competitive with the built in operations. "

In ggplot2, the architecture seems to be quite stable now (you are
fixing the occasional bugs very quickly, so it does not seem require
major refactoring; you already rewrote some parts such as the scales
and legends handling IIRC; etc.). Feature-wise it might not brew
coffee (yet!) but it is the most complete plotting package there is.
So I feel that the last shortcoming of ggplot2 is also speed and (to a
lesser extent) memory usage. Therefore, I was wondering what your
plans were regarding the development of ggplot2, and its optimization
in particular? Are there low hanging fruits in terms of improving
efficiency?

The heavy use of proto makes for a code base quite difficult to
approach. I would love to help (as might others) particularly in
streamlining/optimizing what could be streamlined. I don't know C but
I know Fortran, and if there is some number crunching to do, with the
kind of 2D data we usually deal with, Fortran is both efficient and
easy to program with. Do you think that it would be worthwhile to
write a developer's guide to ggplot2 or that your time would be better
spent optimizing the thing yourself?

Some concrete examples of where I find ggplot slow:

# Increase the size of the data to get a more stressful (hence
representative) test case
dim(volcano)
v = volcano
multi = 6
require("plyr")
v = aaply(v,1,function(X){approx(x=1:length(X), y=X, n=length(X)*multi)
$y})
v = aaply(v,2,function(X){approx(x=1:length(X), y=X, n=length(X)*multi)
$y})
v = t(v)
# NB: why is the transposition necessary? bug in plyr?
dim(v)

# Prepare the data for both base graphics and contour, so that the
comparison is done on the plotting code only
require("reshape")
require("ggplot2")
volM = v
volD = melt(v)
names(volD) = c("x", "y", "z")

# Compare contouring
Rprof("contour.out")
contour(volM, nlevels=10)
Rprof(NULL)

Rprof("geom_contour.out")
ggplot(volD) + geom_contour(aes(x=x,y=y,z=z), bins=10)
Rprof(NULL)

lapply(summaryRprof(filename = "contour.out"), head)
lapply(summaryRprof(filename = "geom_contour.out"), head)

# Compare tile mapping
Rprof("image.out")
image(volM, col=heat.colors(100))
Rprof(NULL)

Rprof("filled_contour.out")
filled.contour(volM)
Rprof(NULL)

Rprof("geom_tile.out")
ggplot(volD) + geom_tile(aes(x=x,y=y,fill=z))
Rprof(NULL)

lapply(summaryRprof(filename = "image.out"), head)
lapply(summaryRprof(filename = "filled_contour.out"), head)
lapply(summaryRprof(filename = "geom_tile.out"), head)

system("rm -f contour.out geom_contour.out image.out
filled_contour.out geom_tile.out")

From this it seems that ggplot spends a lot of time in deparse and
data conversion functions (list, data.frame). I guess that's the cost
of flexibility but could there be room for improvement here? The
printing via grid might be slower too but if the print method appears
high in the total, it does not seem to be taking so long in itself.

I hope this will be somewhat helpful at some point. I must say that,
being a heavy user of tile and contours in particular, I sometimes
wander back to built-in functions for large plotting jobs (i.e.
plotting a 1000 x 500 matrix at several hundred successive time steps
in the output of a numerical model before encoding them in a movie.
The difference between base and ggplot here can be measured in minutes
or even hours).

Cheers,

JiHO
---
http://jo.irisson.free.fr/

[1] In terms of "marketing" though, I think you could improve the
description of plyr. "plyr is a set of tools that solves a common set
of problems: you need to break a big problem down into manageable
pieces, operate on each pieces and then put all the pieces back
together" is indeed a good description of what it does. However it is
very theoretical and not very appealing IMHO. At first, I was not
really sure what the package was doing exactly. You may want to
rephrase that to start with common use cases in R (e.g. "you need to
apply a formula to every to every factor-group of a data.frame, to
every element of a list, to every row of a matrix") before continuing
towards a more general point of view ("break the problem into
manageable pieces", etc.) and insist on the consistency and
predictability of plyr compared to base functions. Just my 2 cents.

hadley wickham

unread,

Mar 24, 2009, 9:53:39 PM3/24/09

to JiHO, ggplot2

> This question is definitely for Hadley but I figured that the answer might
> interest some people here.
>
> I recently dipped my toes into plyr which is another very hadley-whickhamy
> package (hadley-whickhamy: adj.; that makes you wonder how you were getting
> your work done before; syn. mind-blowing. but, Hadley, see [1] at the
> bottom). The documentation currently says:
>
> "The aim of this release is to provide a consistent and useful set of tools
> for solving the split-apply-combine problem."
>
> "The current major shortcoming of plyr is speed and memory usage. [...] It
> is my aim to eventually implement these
> functions in C for maximum speed and memory efﬁciency, so that they are
> competitive with the built in operations. "
>
> In ggplot2, the architecture seems to be quite stable now (you are fixing
> the occasional bugs very quickly, so it does not seem require major
> refactoring; you already rewrote some parts such as the scales and legends
> handling IIRC; etc.). Feature-wise it might not brew coffee (yet!) but it is
> the most complete plotting package there is. So I feel that the last
> shortcoming of ggplot2 is also speed and (to a lesser extent) memory usage.
> Therefore, I was wondering what your plans were regarding the development of
> ggplot2, and its optimization in particular? Are there low hanging fruits in
> terms of improving efficiency?

Definitely. The main thing I need to do is convert the main data
pipeline to use plyr, and then optimise any cases that are
particularly slow.

> The heavy use of proto makes for a code base quite difficult to approach. I
> would love to help (as might others) particularly in streamlining/optimizing
> what could be streamlined. I don't know C but I know Fortran, and if there
> is some number crunching to do, with the kind of 2D data we usually deal
> with, Fortran is both efficient and easy to program with. Do you think that
> it would be worthwhile to write a developer's guide to ggplot2 or that your
> time would be better spent optimizing the thing yourself?

If you wanted to spend time doing this, the most useful thing would be
to spend time optimising plyr (or at least figuring out where it's
slow). That way you help everyone that's using plyr too :) The plyr
code should be much more approachable, and I really want to make it
easy to understand so if there is something you don't understand I'm
motivated to fix it.

I've also learned a lot about writing R code since I started ggplot2,
and do need to go back and refactor it.

> Some concrete examples of where I find ggplot slow:
>
> # Increase the size of the data to get a more stressful (hence
> representative) test case
> dim(volcano)
> v = volcano
> multi = 6
> require("plyr")

> v = aaply(v,1,function(X){approx(x=1:length(X), y=X, n=length(X)*multi)$y})
> v = aaply(v,2,function(X){approx(x=1:length(X), y=X, n=length(X)*multi)$y})

I'm sure there's lots of room for improvement. You also might want to
have a look at my profr package, which provides some visualisations
that make it a little easier to understand profiling output
(particularly what is nested within what).

> I hope this will be somewhat helpful at some point. I must say that, being a
> heavy user of tile and contours in particular, I sometimes wander back to
> built-in functions for large plotting jobs (i.e. plotting a 1000 x 500
> matrix at several hundred successive time steps in the output of a numerical
> model before encoding them in a movie. The difference between base and
> ggplot here can be measured in minutes or even hours).

Hmm, I would expect your code above to be fairly fast. I'll add
looking into it to my to do list.

> [1] In terms of "marketing" though, I think you could improve the
> description of plyr. "plyr is a set of tools that solves a common set of
> problems: you need to break a big problem down into manageable pieces,
> operate on each pieces and then put all the pieces back together" is indeed
> a good description of what it does. However it is very theoretical and not
> very appealing IMHO. At first, I was not really sure what the package was
> doing exactly. You may want to rephrase that to start with common use cases
> in R (e.g. "you need to apply a formula to every to every factor-group of a
> data.frame, to every element of a list, to every row of a matrix") before
> continuing towards a more general point of view ("break the problem into
> manageable pieces", etc.) and insist on the consistency and predictability
> of plyr compared to base functions. Just my 2 cents.

Thanks - that's a good idea!

Hadley

--
http://had.co.nz/

JiHO

unread,

Mar 24, 2009, 10:39:38 PM3/24/09

to hadley wickham, ggplot2

Thanks for the answer Hadley. As I told you I intended to have a
better look at plyr so I will try to dive into the code when I can.

On 2009-March-24 , at 21:53 , hadley wickham wrote:

>> I hope this will be somewhat helpful at some point. I must say
>> that, being a
>> heavy user of tile and contours in particular, I sometimes wander
>> back to
>> built-in functions for large plotting jobs (i.e. plotting a 1000 x
>> 500
>> matrix at several hundred successive time steps in the output of a
>> numerical
>> model before encoding them in a movie. The difference between base
>> and
>> ggplot here can be measured in minutes or even hours).
>
> Hmm, I would expect your code above to be fairly fast. I'll add
> looking into it to my to do list.

Maybe there is something specific to my system that make it
particularly slow. I did not remember that tiling was so slow earlier,
in the initial versions of ggplot2 (but then again, I used it on
several different machines, outputting to X11 or quartz, etc.).

Anyhow, on my current system (MacBook[*] with RAM half busy,
outputting to the quartz device, from R running in a terminal) I get:

Contouring:
0.54 contour
5.78 geom_contour

Tiling:
3.04 image
5.36 filled contour
28.76 geom_tile

So ggplot is quite a bit slower in both cases. It takes roughly 30 sec
to make a dim(v) = 522 x 366 tile. You can imagine what printing 100s
of 1000x500 tiles would mean then!

Outputting to other non-displaying devices speeds up the base methods
but not really ggplot:

CairoPDF:
Contouring
0.5
5.52
Tiling
1.04
3.3
28.56

pdf():
Tiling
0.8
2.34
25.58

png():
Tiling
1.5
2.68
26.88

so if there is some slowness in the display, it is not currently
limiting ggplot.

That's all just one time tests of course. I should replicate that but
some things are already quite obvious.

Thanks in advance for your work on all that.

[*] Machine details
(I just discovered `system_profiler | more` and it's cool)
ProductName: Mac OS X
ProductVersion: 10.5.6
Processor Name: Intel Core 2 Duo
Processor Speed: 2.16 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 2 GB
Bus Speed: 667 MHz

JiHO
---
http://jo.irisson.free.fr/

hadley wickham

unread,

Mar 25, 2009, 12:19:13 AM3/25/09

to JiHO, ggplot2

> Thanks for the answer Hadley. As I told you I intended to have a better look
> at plyr so I will try to dive into the code when I can.

Great :)

> Maybe there is something specific to my system that make it particularly
> slow. I did not remember that tiling was so slow earlier, in the initial
> versions of ggplot2 (but then again, I used it on several different
> machines, outputting to X11 or quartz, etc.).
>
> Anyhow, on my current system (MacBook[*] with RAM half busy, outputting to
> the quartz device, from R running in a terminal) I get:
>
> Contouring:
> 0.54 contour
> 5.78 geom_contour
>
> Tiling:
> 3.04 image
> 5.36 filled contour
> 28.76 geom_tile

It will definitely be slower, but it shouldn't be this much slower!

Thanks for the report.

Hadley

--
http://had.co.nz/

Reply all

Reply to author

Forward

0 new messages