Feature proposals

248 views
Skip to first unread message

Kohske Takahashi

unread,
Apr 24, 2012, 1:43:21 PM4/24/12
to ggplot2-dev
Here is two feature proposals:

1. Replace data.frame with data.table
Not sure but this may reduce the calculation time for plot generation.

2. Implement interface to extract some stats
stat_bin, stat_smooth etc perform statistical manipulation on the way
to generate final result.
Consistent interface to access the midstream results may be useful in
some cases.
Maybe
stat_bin(outp = TRUE)
than, the results of ggplot_build include the midstream results of stat_bin.

--
Kohske Takahashi <takahash...@gmail.com>

Assistant Professor,
Research Center for Advanced Science and Technology,
The University of  Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html

Brandon Hurr

unread,
Apr 24, 2012, 1:48:33 PM4/24/12
to Kohske Takahashi, ggplot2-dev
I don't know about #1, but #2 I have seen requested more often than I can remember. I think lots of people would like access to the summary data output from the stat functions. Whether or not that's a good idea should be up to you, Winston and Hadley probably. 

Winston Chang

unread,
Apr 24, 2012, 2:48:10 PM4/24/12
to Kohske Takahashi, ggplot2-dev
On Tue, Apr 24, 2012 at 12:43 PM, Kohske Takahashi <takahash...@gmail.com> wrote: 
2. Implement interface to extract some stats
stat_bin, stat_smooth etc perform statistical manipulation on the way
to generate final result.
Consistent interface to access the midstream results may be useful in
some cases.
Maybe
stat_bin(outp = TRUE)
than, the results of ggplot_build include the midstream results of stat_bin.


I think this is a good idea. I think others have argued that people shouldn't rely on their graphing package to do the statistics for them, and I can see the merits of this... but on the other hand, it would be nice if people could at least get the data from the stat so that they can examine what the stat does to their data. (If this sort of option existed, we might have caught that nasty duplicate-dropping bug in facet_grid earlier.)

In order to inspect the output of stats, I've been modifying the stat code itself to  print out the data. It would be nice to have a cleaner way of doing this.


With regard to extracting models from stat_smooth, I started writing some code to do the opposite: you would pass a list of model objects to it, and it would draw prediction lines for each model, but I haven't gotten very far on it yet.

-Winston

Hadley Wickham

unread,
Apr 24, 2012, 3:51:00 PM4/24/12
to Kohske Takahashi, ggplot2-dev
> 1. Replace data.frame with data.table
> Not sure but this may reduce the calculation time for plot generation.

I'm not excited about doing this because I'm not a big fan of the
data.table interface (but the speed is lovely)

> 2. Implement interface to extract some stats
> stat_bin, stat_smooth etc perform statistical manipulation on the way
> to generate final result.
> Consistent interface to access the midstream results may be useful in
> some cases.
> Maybe
> stat_bin(outp = TRUE)
> than, the results of ggplot_build include the midstream results of stat_bin.

I think the best way to do this is have the stats be a thin wrapper
around a regular R function - then if you wanted to use them yourself,
you can just use the regular function.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Dennis Murphy

unread,
Apr 24, 2012, 4:00:53 PM4/24/12
to Kohske Takahashi, ggplot2-dev
Hi Kohske:

On Tue, Apr 24, 2012 at 10:43 AM, Kohske Takahashi
<takahash...@gmail.com> wrote:
> Here is two feature proposals:
>
> 1. Replace data.frame with data.table
> Not sure but this may reduce the calculation time for plot generation.

data.table objects have a data frame attribute, so one should be able
to use them directly in ggplot2 as input data frames; in fact, Matthew
Dowle (the primary developer of data.table) has several unit tests for
ggplot() in his test suite. If the suggestion is to use the data.table
package to produce ggplots, that's an entirely different issue which
would entail a significant rewrite of the code for the stat_*
functions (at least). I have a hard time imagining Hadley to be
enthusiastic about such a proposal. One of the practical disadvantages
I see with it is that data.table AFAIK only converts data frames into
data tables - I don't know how it would handle list input, for
example. I've never seen a case where data.table takes a generic list
as input (e.g., a list of model objects) and outputs a data.table.
One of the practical advantages of plyr and reshape is that the scope
of input and output classes is quite broad - data.table isn't as
versatile in that respect. Since one often needs to pre-process data
for graphics, this is IMO a significant limitation.

I'm a big fan of data.table and am in awe of how fast it can be
relative to other data processing functions/packages in R, but I still
use plyr and reshape2 far more often than data.table in practice
because plyr and reshape2 can handle all types of base data classes in
R (particularly lists) whereas data.table is limited (in my
experience, at least) to converting input data frames to data tables
and outputting data tables with a data frame attribute. The other
reason I use plyr/reshape2 more often in practice is because the
syntax of data.table is not always 'R-like' and you have to be rather
careful how you use data.table to reap its advantages.

Since ggplot() requires a data frame as input, there is no conceptual
issue re data tables as mentioned above, but the advantages of a data
table are lost when used as an input data frame. This means that to
use a data.table to its full advantage, the ggplot2 code would have to
be made 'data.table aware'.

>
> 2. Implement interface to extract some stats
> stat_bin, stat_smooth etc perform statistical manipulation on the way
> to generate final result.
> Consistent interface to access the midstream results may be useful in
> some cases.
> Maybe
> stat_bin(outp = TRUE)
> than, the results of ggplot_build include the midstream results of stat_bin.

As Brandon mentioned, certain people are adamant about wanting this
feature, but from my perspective, all it does is slow down the
processing of a ggplot by forcing data manipulation and modeling
inside the ggplot() call. I can understand why people would want all
the stat_* output produced in the generation of a ggplot, but at the
same time, if you want those extra goodies, then you have no reason to
complain about how long it takes to produce a ggplot, since it's the
equivalent of asking ggplot2 to do your modeling *and* graphics for
you. Convenient, but quite possibly dangerous in the wrong hands.

If you want to improve speed, one suggestion would be to convert data
frames to lists in the internal code since lists carry a lot less
overhead than do data frames. This would be reasonable as long as you
don't need the attributes of a data frame, but it's likely to require
a major rewrite of certain existing code with all the attendant
headaches, and I question whether it's worth the performance gain.

Having just seen Hadley's contribution to this thread, I understand
his desire to decouple tasks (modularity) and then have a 'factory'
package to assemble the pieces into a ggplot. This is consistent with
his emerging philosophy on the subject and one that makes a lot of
sense to me, especially in the future as parallel
programming/processing becomes more routine.

My 2c,
Dennis

Kohske Takahashi

unread,
Apr 24, 2012, 4:08:39 PM4/24/12
to Hadley Wickham, ggplot2-dev
>> 1. Replace data.frame with data.table
>> Not sure but this may reduce the calculation time for plot generation.
>
> I'm not excited about doing this because I'm not a big fan of the
> data.table interface (but the speed is lovely)
>

Ok, I imagine all functions for data.frame is available for data.table
and so we don't need to change the manipulation.
But I need to inspect more.
Anyway, this proposal is trivial.

>> 2. Implement interface to extract some stats
>

> I think the best way to do this is have the stats be a thin wrapper
> around a regular R function - then if you wanted to use them yourself,
> you can just use the regular function.
>

May I ask a bit more what you intend?

kohske

> Hadley
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/

--

Kohske Takahashi

unread,
Apr 24, 2012, 4:23:59 PM4/24/12
to Dennis Murphy, ggplot2-dev
Hi, thanks.

I imagine almost all function that can be used with data.frame is also
available for data.table.
At least plyr function so.
But not sure. I need to inspect more.
Anyway, thanks for the information about data.table.

>
>>
>> 2. Implement interface to extract some stats
>> stat_bin, stat_smooth etc perform statistical manipulation on the way
>> to generate final result.
>> Consistent interface to access the midstream results may be useful in
>> some cases.
>> Maybe
>> stat_bin(outp = TRUE)
>> than, the results of ggplot_build include the midstream results of stat_bin.
>
> As Brandon mentioned, certain people are adamant about wanting this
> feature, but from my perspective, all it does is slow down the
> processing of a ggplot by forcing data manipulation and modeling
> inside the ggplot() call. I can understand why people would want all
> the stat_* output produced in the generation of a ggplot, but at the
> same time, if you want those extra goodies, then you have no reason to
> complain about how long it takes to produce a ggplot, since it's the
> equivalent of asking ggplot2 to do your modeling *and* graphics for
> you. Convenient, but quite possibly dangerous in the wrong hands.
>

I think that the extraction will not slow down significantly.
It does only record the output. There will not be additional modeling etc.

> If you want to improve speed, one suggestion would be to convert data
> frames to lists in the internal code since lists carry a lot less
> overhead than do data frames. This would be reasonable as long as you
> don't need the attributes of a data frame, but it's likely to require
> a major rewrite of certain existing code with all the attendant
> headaches, and I question whether it's worth the performance gain.

data.frame itself is just a list having "data.frame" as its attributes.
So I think that the difference of speed would be subtle.

>
> Having just seen Hadley's contribution to this thread, I understand
> his desire to decouple tasks (modularity) and then have a 'factory'
> package to assemble the pieces into a ggplot. This is consistent with
> his emerging philosophy on the subject and one that makes a lot of
> sense to me, especially in the future as parallel
> programming/processing becomes more routine.
>

The extraction will not change the current general structure.
It does just record the results of fitting etc.


So, ok, I will write where I think this extraction is useful soon.

Thanks anyway.

kohske

Kohske Takahashi

unread,
Apr 24, 2012, 4:39:26 PM4/24/12
to Dennis Murphy, ggplot2-dev
Example of the extraction of stats:

c <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)
c + stat_smooth(method=lm) + geom_point()

r <- dlply(mtcars, .(cyl), function(x) lm(wt~mpg, data = x) )
l_ply(r, function(x) print(summary(x)$r.sq))

so in this case I want to visually explorer the data, and just want to
know the r^2 of the fitting.
In this case, facet has only 1 dim and stat is quite simple.
But things will easily be complex, e.g., facet_grid(a+b~c+d), etc.
And also the ggplot2 stats and manually written stats may be different
due to error or some misunderstandings.

Just in case, I don't mean that users want to perform hard statistical
analysis by ggplot2.
Main thing is visual exploration, and but sometimes I want to know
some numerical information about the visual information.
This is because visualization is a part of data exploration. ggplot2
boosts this process.

After the visual inspection, I will go to the next step by a hard
statistical analysis.
At this stage, thanks to the visualization, I will go on focused way.

kohske


2012年4月25日5:23 Kohske Takahashi <takahash...@gmail.com>:

Brian Diggs

unread,
Apr 24, 2012, 5:40:36 PM4/24/12
to ggplo...@googlegroups.com
On 4/24/2012 1:39 PM, Kohske Takahashi wrote:
> Example of the extraction of stats:
>
> c<- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)
> c + stat_smooth(method=lm) + geom_point()
>
> r<- dlply(mtcars, .(cyl), function(x) lm(wt~mpg, data = x) )
> l_ply(r, function(x) print(summary(x)$r.sq))

I could see it working something like this:

p <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl) +
stat_smooth(method=lm) + geom_point()

In the built version, the data corresponding to the stat is already there:

> head(ggplot_build(p)$data[[1]],10)
x y ymin ymax se PANEL group
1 21.40000 2.759828 2.306109 3.213546 0.2005690 1 1
2 21.55823 2.745576 2.299987 3.191165 0.1969752 1 1
3 21.71646 2.731324 2.293762 3.168887 0.1934272 1 1
4 21.87468 2.717073 2.287426 3.146719 0.1899278 1 1
5 22.03291 2.702821 2.280975 3.124667 0.1864796 1 1
6 22.19114 2.688569 2.274401 3.102737 0.1830856 1 1
7 22.34937 2.674317 2.267697 3.080937 0.1797488 1 1
8 22.50759 2.660066 2.260857 3.059274 0.1764724 1 1
9 22.66582 2.645814 2.253873 3.037755 0.1732599 1 1
10 22.82405 2.631562 2.246735 3.016389 0.1701150 1 1

There could be an additional list, stat maybe, which has the appropriate
models, or whatever transformation was made before turning it into a
data frame with aesthetics. Effectively, ggplot_build(p)$stat[[1]]
would be equivalent to the r above.

I don't see how to do it offhand, but it would be under the mechanism
that is part of calculate_stats. That would have to send back both the
transformed data (data frame) and whatever appropriate intermediate
objects there are (lm objects for stat_smooth(method="lm"); loess
objects for stat_smooth(method="loess"); tables (?) for stat_sum;
whatever-it-is-represents-a-boxplot for boxplots; etc.)

>>>> Kohske Takahashi<takahashi.kohske-Re...@public.gmane.org>


>>>>
>>>> Assistant Professor,
>>>> Research Center for Advanced Science and Technology,
>>>> The University of Tokyo, Japan.
>>>> http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html
>>
>>
>>
>> --

>> Kohske Takahashi<takahashi.kohske-Re...@public.gmane.org>


>>
>> Assistant Professor,
>> Research Center for Advanced Science and Technology,
>> The University of Tokyo, Japan.
>> http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html
>
>
>


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

Dennis Murphy

unread,
Apr 24, 2012, 5:41:10 PM4/24/12
to Kohske Takahashi, ggplot2-dev
Hi Kohske:

data.table is an extension of a data frame in the sense that the [i,
j] indices can be generalized to expressions (not necessarily
logical). Like data.frame indexing, it also allows the use of extra
arguments, which Matthew has cleverly exploited in several ways.
However, the internal structure of a data table is not the same as
that of a data frame. For one thing, data tables generally have one or
more keys which are used to perform binary search (the key advantage
of a data table, pardon the pun, that makes them much faster to
process than ordinary data frames). To exploit that advantage in
ggplot2, the package would have to be made 'data.table aware' and
capable of using the keys of a data table. In contrast, if a data
table is input as a data frame, the keys are essentially unbound in
the conversion and the advantages of a data table are lost.

To get the speedups you envision with data.table, it seems to me you'd
have to use data.table in the stat_* function code, which creates a
package dependency based on code over which Hadley has no control. I
don't see that happening...

I have no idea how much effort it would take to make ggplot2
data.table aware, but it would add complexity to the code at a time
when Hadley is striving to untangle parts of ggplot2 that can be
generalized and decoupled/decentralized into separate packages. This
is consistent with modular programming practice (which I think is the
point of the last line of Hadley's e-mail that you were asking about
earlier). Moreover, the long-range goal is to convert all the ggplot2
code to S3 so that it encourages contributions from others. The proto
package is a significant entry barrier for many otherwise competent R
programmers to contribute to ggplot2 (as is lack of knowledge of the
grid package). data.table has its own rather idiosyncratic syntax
which is not 'intuitively obvious', sufficiently so that it may deter
some from contributing to the ggplot2 project.

Dennis

Dennis Murphy

unread,
Apr 24, 2012, 6:52:16 PM4/24/12
to Kohske Takahashi, ggplot2-dev
Hi:

On Tue, Apr 24, 2012 at 1:39 PM, Kohske Takahashi
<takahash...@gmail.com> wrote:
> Example of the extraction of stats:
>
> c <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)
> c + stat_smooth(method=lm) + geom_point()
>
> r <- dlply(mtcars, .(cyl), function(x) lm(wt~mpg, data = x) )
> l_ply(r, function(x) print(summary(x)$r.sq))

In addition to Brian's comments, of which I'm in agreement, I could
also see where one might use data.table to extract groupwise
statistics; using your example,

library('data.table')
mtcarsDT <- data.table(mtcars, key = 'cyl')
mtcarsR2 <- mtcarsDT[, as.list(summary(lm(wt ~ mpg, data =
.SD))$r.sq), by = cyl]
> mtcarsR2
cyl V1
[1,] 4 0.5086326
[2,] 6 0.4645102
[3,] 8 0.4229655

[It's not intuitively obvious why you need as.list() as a wrapper or
what data = .SD means ('subdata'), but fortunately I've printed out
the data.table test suite :) This is what I meant about the
idiosyncracies of data.table syntax. ]

Since a data.table object inherits from data.frame, you can use a
data.table object as input to ggplot2 if you wish. If you want, you
can use data.table to produce ggplot2 graphics by group similarly to
d_ply(), which may be a bit faster sometimes since a data.table sorts
such that data with the same key value is in contiguous memory. Here's
a toy example:

tstDT <- data.table(gp = rep(1:3, each = 20), x = rep(1:20, 3),
y = 4 + 0.6 * rep(1:20, 3) + rnorm(60), key = 'gp')
tstDT[, print(ggplot(.SD, aes(x = x, y = y)) + geom_point() +
geom_smooth(method = 'lm')), by = gp]

It takes about the same time as d_ply() for this example, which is pretty small:

> system.time(tstDT[, print(ggplot(.SD, aes(x = x, y = y)) + geom_point() +
+ geom_smooth(method = 'lm')), by = gp])
user system elapsed
0.56 0.02 0.58
> system.time(d_ply(tstDT, .(gp), ggfn))
user system elapsed
0.52 0.03 0.55

where

ggfn <- function(dt) {
require('ggplot2')
print(ggplot(dt, aes(x = x, y = y)) + geom_point() +
geom_smooth(method = 'lm') )
}

Using data.table in conjunction with ggplot2 is perfectly sensible to
me as a data analysis strategy, especially when data.table outperforms
plyr.

Is that what you had in mind?

Dennis

>
> so in this case I want to visually explorer the data, and just want to
> know the r^2 of the fitting.
> In this case, facet has only 1 dim and stat is quite simple.
> But things will easily be complex, e.g., facet_grid(a+b~c+d), etc.
> And also the ggplot2 stats and manually written stats may be different
> due to error or some misunderstandings.

We've never seen that problem on the ggplot2 list :D


>
> Just in case, I don't mean that users want to perform hard statistical
> analysis by ggplot2.
> Main thing is visual exploration, and but sometimes I want to know
> some numerical information about the visual information.
> This is because visualization is a part of data exploration. ggplot2
> boosts this process.

Absolutely agree.

Hadley Wickham

unread,
Apr 26, 2012, 10:25:27 AM4/26/12
to Kohske Takahashi, Dennis Murphy, ggplot2-dev
On Tue, Apr 24, 2012 at 2:39 PM, Kohske Takahashi
<takahash...@gmail.com> wrote:
> Example of the extraction of stats:
>
> c <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)
> c + stat_smooth(method=lm) + geom_point()
>
> r <- dlply(mtcars, .(cyl), function(x) lm(wt~mpg, data = x) )
> l_ply(r, function(x) print(summary(x)$r.sq))
>
> so in this case I want to visually explorer the data, and just want to
> know the r^2 of the fitting.

There's not going to be anyway to extract R^2 from current ggplot2
code, because nothing returns the model object, just the predictions
from it.

> In this case, facet has only 1 dim and stat is quite simple.
> But things will easily be complex, e.g., facet_grid(a+b~c+d), etc.

But that shouldn't make it much harder - you'd just have .(a, b, c, d).

> And also the ggplot2 stats and manually written stats may be different
> due to error or some misunderstandings.

Well ideally, the stat functions would be easily accessible,
documented and well-tested. They need to be separated cleanly from
the stat ggplot2 object to make this sort of thing easier.

Another option would be have a function like:

d <- c + stat_smooth(method=lm) + geom_point()
ggplot_data(d)

and then you could extract the processed data from it. But this is
pretty similar to ggplot_build(d)$data.

Kohske Takahashi

unread,
Apr 26, 2012, 10:49:49 AM4/26/12
to Hadley Wickham, Dennis Murphy, ggplot2-dev
Hi,

>> In this case, facet has only 1 dim and stat is quite simple.
>> But things will easily be complex, e.g., facet_grid(a+b~c+d), etc.
>
> But that shouldn't make it much harder - you'd just have .(a, b, c, d).

Yes of course, but things will be more complex by color, size, fill, etc, e.g.,
ggplot(data, aes(x, y, colour = fa, size = fb, linetype = fc)) + ... +
facet_grid(e+f~g+d)

And this kind of things actually happens durgin visual exploration.

So I'm not sure which interface is best, but I think definitely this
feature is useful.
I will propose the interface when I will come up with a good one.

kohske


>
>> And also the ggplot2 stats and manually written stats may be different
>> due to error or some misunderstandings.
>
> Well ideally, the stat functions would be easily accessible,
> documented and well-tested.  They need to be separated cleanly from
> the stat ggplot2 object to make this sort of thing easier.
>
> Another option would be have a function like:
>
> d <- c + stat_smooth(method=lm) + geom_point()
> ggplot_data(d)
>
> and then you could extract the processed data from it.  But this is
> pretty similar to ggplot_build(d)$data.
>
> Hadley
>
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/



Hadley Wickham

unread,
Apr 26, 2012, 11:41:08 AM4/26/12
to Kohske Takahashi, Dennis Murphy, ggplot2-dev
On Thu, Apr 26, 2012 at 8:49 AM, Kohske Takahashi
<takahash...@gmail.com> wrote:
> Hi,
>
>>> In this case, facet has only 1 dim and stat is quite simple.
>>> But things will easily be complex, e.g., facet_grid(a+b~c+d), etc.
>>
>> But that shouldn't make it much harder - you'd just have .(a, b, c, d).
>
> Yes of course, but things will be more complex by color, size, fill, etc, e.g.,
> ggplot(data, aes(x, y, colour = fa, size = fb, linetype = fc)) + ... +
> facet_grid(e+f~g+d)
>
> And this kind of things actually happens durgin visual exploration.

And then you just add those variables too ;)

Kohske Takahashi

unread,
Apr 26, 2012, 11:44:31 AM4/26/12
to Hadley Wickham, Dennis Murphy, ggplot2-dev
>> And this kind of things actually happens durgin visual exploration.
>
> And then you just add those variables too ;)

Yes of course :-(
Reply all
Reply to author
Forward
0 new messages