Non-atomic columns with ddply

909 views
Skip to first unread message

Stavros Macrakis

unread,
May 9, 2011, 6:44:05 PM5/9/11
to manipulatr
I am having some trouble getting ddply/transform to create a column which contains a *list* of items.

Consider the test data frame

    df <- data.frame(a=1:5,b=c(1,1,2,3,3))

Suppose I want to add a column to it which contains, for each b, the number of rows with that value of b.  That is very easy:

    ddply(df,.(b),transform,q=length(a))

But now suppose I want column q to contain the ***list*** of a's having that value of b:

ddply(df,.(b),transform,q=list(a))
  a b X1.2 X3L X4.5
1 1 1    1  NA   NA
2 2 1    2  NA   NA
3 3 2   NA   3   NA
4 4 3   NA  NA    4
5 5 3   NA  NA    5

Apparently ddply is trying to be clever about the list and bursting it into columns, which is certainly often useful.  But what if I don't want that fancy extra functionality? -- I simply want the list itself in the column.

In data.frame, the "I" hack 'protects' the list from being interpreted as a recursive list of columns:

 data.frame(a=1:5,b=c(1,1,2,3,3),q=I(list(1:2,1:2,3,4:5,4:5)))
  a b    q
1 1 1 1, 2
2 2 1 1, 2
3 3 2    3
4 4 3 4, 5
5 5 3 4, 5

But I can't find the equivalent in ddply, though I tried a variety of things:

> ddply(df,.(b),transform,q=I(a))
  a b q
1 1 1 1
2 2 1 2
3 3 2 3
4 4 3 4
5 5 3 5
> ddply(df,.(b),transform,q=I(list(a)))
Error in data.frame(list(a = 1:2, b = c(1, 1)), q = list(1:2)) : 
  arguments imply differing number of rows: 2, 1
> encapsulate <- function(vvv) function() vvv    # create an opaque object
> ddply(df,.(b),transform,q=encapsulate(a))
Error in data.frame(list(a = 1:2, b = c(1, 1)), q = function ()  : 
  arguments imply differing number of rows: 2, 0
> ddply(df,.(b),transform,q=as.list(a))
  a b q.1L q.2L X3L q.4L q.5L
1 1 1    1    2  NA   NA   NA
2 2 1    1    2  NA   NA   NA
3 3 2   NA   NA   3   NA   NA
4 4 3   NA   NA  NA    4    5
5 5 3   NA   NA  NA    4    5

The problem is not in creating the right result -- after all, if we convert the list to a string before returning it, there's no problem:

> ddply(data.frame(a=1:5,b=c(1,1,2,3,3)),.(b),transform,q=as.character(list(a)))
  a b   q
1 1 1 1:2
2 2 1 1:2
3 3 2   3
4 4 3 4:5
5 5 3 4:5

-----------------------------------------------

OK, so let me try a different tack.  Suppose I want the top n (for fixed n).  The straightforward approach with explicit subscripts works:

> ddply(df,.(b),transform,q=a[1],r=a[2])
  a b q  r
1 1 1 1  2
2 2 1 1  2
3 3 2 3 NA
4 4 3 4  5
5 5 3 4  5

But now I think, why don't I use the auto-bursting functionality that got me into trouble before?  So:

> ddply(df,.(b),transform,q=a[1:2])
  a b  q
1 1 1  1
2 2 1  2
3 3 2  3
4 3 2 NA
5 4 3  4
6 5 3  5
Warning message:
In data.frame(list(a = 3L, b = 2), q = c(3L, NA)) :
  row names were found from a short variable and have been discarded

Oops.  That doesn't work.

At this point, I humbly ask for help from those who are more experienced with plyr!

Thanks!

           -s

Hadley Wickham

unread,
May 9, 2011, 7:04:18 PM5/9/11
to Stavros Macrakis, manipulatr
Hi Stavros,

For problems like this, it's usually easiest to diagnose by running dlply:

> dlply(df,.(b),transform,q=list(a))
$`1`
a b X1.2


1 1 1 1
2 2 1 2

$`2`
a b X3L
3 3 2 3

$`3`
a b X4.5


4 4 3 4
5 5 3 5

Which indicates it's not plyr that's at fault, but transform. Luckily
mutate (the plyr equivalent of transform) doesn't have this problem:

> dlply(df,.(b),mutate,q=list(a))
$`1`


a b q
1 1 1 1, 2
2 2 1 1, 2

$`2`
a b q
3 3 2 3

$`3`
a b q


4 4 3 4, 5
5 5 3 4, 5

And indeed ddply works fine with mutate:

> ddply(df,.(b),mutate,q=list(a))


a b q
1 1 1 1, 2
2 2 1 1, 2
3 3 2 3
4 4 3 4, 5
5 5 3 4, 5

If you don't like the behaviour of either mutate or transform, then
you'll need to manipulate the data frame directly.

Hadley

> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to
> manipulatr+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/manipulatr?hl=en.
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Stavros Macrakis

unread,
May 10, 2011, 10:06:20 AM5/10/11
to Hadley Wickham, manipulatr
Hadley,

Thank you for the quick and clear diagnosis and recommendation!

Truth be told, I hadn't noticed the mutate function in the plyr 1.4 announcement -- I was relying on your ggplot book and your 2009 paper to guide me through using plyr.  And somehow I assumed that 'transform' was part of the plyr package (I never use it otherwise).

May I suggest you update the "Split-apply-combine strategy" paper, incorporating mutate?

Thanks again for the help -- and of course the package itself!

            -s
Reply all
Reply to author
Forward
0 new messages