I am having some trouble getting ddply/transform to create a column which contains a *list* of items.
Consider the test data frame
df <- data.frame(a=1:5,b=c(1,1,2,3,3))
Suppose I want to add a column to it which contains, for each b, the number of rows with that value of b. That is very easy:
ddply(df,.(b),transform,q=length(a))
But now suppose I want column q to contain the ***list*** of a's having that value of b:
ddply(df,.(b),transform,q=list(a))
a b X1.2 X3L X4.5
1 1 1 1 NA NA
2 2 1 2 NA NA
3 3 2 NA 3 NA
4 4 3 NA NA 4
5 5 3 NA NA 5
Apparently ddply is trying to be clever about the list and bursting it into columns, which is certainly often useful. But what if I don't want that fancy extra functionality? -- I simply want the list itself in the column.
In data.frame, the "I" hack 'protects' the list from being interpreted as a recursive list of columns:
data.frame(a=1:5,b=c(1,1,2,3,3),q=I(list(1:2,1:2,3,4:5,4:5)))
a b q
1 1 1 1, 2
2 2 1 1, 2
3 3 2 3
4 4 3 4, 5
5 5 3 4, 5
But I can't find the equivalent in ddply, though I tried a variety of things:
> ddply(df,.(b),transform,q=I(a))
a b q
1 1 1 1
2 2 1 2
3 3 2 3
4 4 3 4
5 5 3 5
> ddply(df,.(b),transform,q=I(list(a)))
Error in data.frame(list(a = 1:2, b = c(1, 1)), q = list(1:2)) :
arguments imply differing number of rows: 2, 1
> encapsulate <- function(vvv) function() vvv # create an opaque object
> ddply(df,.(b),transform,q=encapsulate(a))
Error in data.frame(list(a = 1:2, b = c(1, 1)), q = function () :
arguments imply differing number of rows: 2, 0
> ddply(df,.(b),transform,q=as.list(a))
a b q.1L q.2L X3L q.4L q.5L
1 1 1 1 2 NA NA NA
2 2 1 1 2 NA NA NA
3 3 2 NA NA 3 NA NA
4 4 3 NA NA NA 4 5
5 5 3 NA NA NA 4 5
The problem is not in creating the right result -- after all, if we convert the list to a string before returning it, there's no problem:
> ddply(data.frame(a=1:5,b=c(1,1,2,3,3)),.(b),transform,q=as.character(list(a)))
a b q
1 1 1 1:2
2 2 1 1:2
3 3 2 3
4 4 3 4:5
5 5 3 4:5
-----------------------------------------------
OK, so let me try a different tack. Suppose I want the top n (for fixed n). The straightforward approach with explicit subscripts works:
> ddply(df,.(b),transform,q=a[1],r=a[2])
a b q r
1 1 1 1 2
2 2 1 1 2
3 3 2 3 NA
4 4 3 4 5
5 5 3 4 5
But now I think, why don't I use the auto-bursting functionality that got me into trouble before? So:
> ddply(df,.(b),transform,q=a[1:2])
a b q
1 1 1 1
2 2 1 2
3 3 2 3
4 3 2 NA
5 4 3 4
6 5 3 5
Warning message:
In data.frame(list(a = 3L, b = 2), q = c(3L, NA)) :
row names were found from a short variable and have been discarded
Oops. That doesn't work.
At this point, I humbly ask for help from those who are more experienced with plyr!
Thanks!
-s