Re: Warning about incorrectly reordered output in aaply [resurrected old thread]

131 views

Skip to first unread message

Steve Lianoglou

unread,

Jul 8, 2011, 5:43:48 PM7/8/11

to Hadley Wickham, Michael, manipulatr

Hi,

Sorry to bring up an old thread, but I think I've hit the same
situation that was described below, which is that the result of ddply
is reordering the output of a data.frame, when I thought it wasn't
meant not to.

Btw, I'm using R-2.13 and plyr 1.5.2

Consider a simplified data.frame I'm using, where I want to transform
the "count" column to be a fraction of the counts for a given
entrez.id (a character identifier):

=============
library(plyr)
df <- structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1"), pos = c(792472L, 879631L,
879722L, 879896L, 901037L, 910460L, 910533L, 949862L, 991442L,
1105662L, 1169036L, 1169501L, 1170339L, 1227349L, 1246998L, 1264215L,
1376089L, 1378211L), entrez.id = c("643837", "148398", "148398",
"148398", "339451", "84069", "84069", "9636", "375790", "554210",
"126792", "126792", "126792", "6339", "126789", "80772", "64856",
"64856"), count = c(3L, 35L, 4L, 9L, 55L, 2L, 4L, 74L, 348L,
0L, 86L, 16L, 69L, 41L, 83L, 189L, 11L, 0L)), .Names = c("chr",
"pos", "entrez.id", "count"), row.names = c(15L, 20L, 21L, 22L,
23L, 24L, 25L, 30L, 31L, 33L, 37L, 38L, 39L, 42L, 43L, 49L, 76L,
77L), class = "data.frame")

calcd <- ddply(df, .(entrez.id), summarise, fraction=count / sum(count))
all(calcd$entrez.id == df$entrez.id)
# [1] FALSE

all(calcd$entrez.id == sort(df$entrez.id))
## [1] TRUE
============

It happens that the entrez.id column is a character type (which is a
character of what looks like could be an integer).

We see that the ordering of calcd is in the order that is determined
by an ordering on .(entrez.id), and not the order which the data.frame
I sent into ddply was in.

If I'm not mistaken, I think we're supposed to expect the output of
ddply to be the same order as the input, no?

Thanks,

-steve

On Tue, Dec 28, 2010 at 11:50 AM, Hadley Wickham <had...@rice.edu> wrote:
> Hi all,
>
> This is now fixed in the development version of plyr, which I'm aiming
> to release later today. I've added more test cases to make sure it
> doesn't happen again.
>
> Regards,
>
> Hadley
>
> On Sat, Dec 11, 2010 at 11:10 AM, Michael <bra...@mit.edu> wrote:
>> Hi. I have discovered a potentially critical bug in the aaply
>> functions (and maybe some of the other **ply functions) that you might
>> want to be aware of. Hadley is aware of, and is fixing, this problem,
>> but he asked me to post what I found to the list.
>>
>> The problem is with aaply reordering output in unexpected ways. If
>> you have more than 10 items in a dimension, aaply has been permuting
>> the rows of output according to the character representation of the
>> row number (it's with columns, too). So if you have, say, 12 rows
>> that you need to process, output row 1 corresponds to input row 1,
>> output row 2 corresponds to input row 10, output 3 to input 11, output
>> 4 to input 12, output 5 to input 2, output 6 to input 3, and so
>> forth. If you are expecting the output of a **ply function to
>> correspond to the order of the input, you really need to check that.
>> Where it was killing me was that the signs of regression coefficients
>> weren't making sense. It was because the coefficient estimates were
>> matched to the wrong covariates. Yikes!
>>
>> There is a development version of plyr, 1.3, that I am now using (it's
>> up to Hadley if he wants to release that link to everyone).
>> Fortunately, the bug I described above is fixed in 1.3. But there is
>> another bug that remains: re-ordering the output according to
>> dimnames. If the row names are not in alphabetical order, the output
>> will be reordered so that the dimnames determine the order. Unless
>> you are expecting that behavior, this kind of automated reordering can
>> cause a world of hurt. For this, I have some test code (pasted below)
>> that illustrates the problem. What I don't know is whether these
>> tests fail under 1.2.1, because I am now using 1.3, and I can't risk
>> going back. If you are using 1.3, and maybe even 1.2.1, my
>> recommendation is that you remove ALL dimnames from any input arrays
>> to aaply. Under 1.3, I think all of the problems disappear with that
>> workaround. But if you are using 1.2.1, there may still be a problem
>> even without the dimnames. In either case, if you are getting wacky
>> results from any of the **ply functions, this might be a reason why.
>>
>> So here's the test code. The W, V, and M tests pass in 1.3, but will
>> probably fail under 1.2.1 (again, I can't really test that). The X,
>> Y, Z, P and Q tests should pass under 1.3, and would probably also
>> fail under 1.2.1.
>>
>> X <-
>> array(1:12,dim=c(3,4),dimnames=list(R=LETTERS[c(13,12,14)],C=LETTERS[c(20,17,25,19)]))
>> X.1.right <- rowSums(X)
>> X.1.aaply <- aaply(X,1,sum)
>> cat("X: ",all.equal(X.1.right, X.1.aaply),"\n")
>>
>> Y <-
>> array(1:60,dim=c(3,4,5),dimnames=list(R=c(13,12,14),C=c(20,17,25,19),L=c(1,3,5,2,4)))
>> Y.1.right <- colSums(Y,dim=2)
>> Y.1.aaply <- aaply(Y,3,sum)
>> cat("Y: ",all.equal(Y.1.right, Y.1.aaply),"\n")
>>
>> Z <-
>> array(1:12,dim=c(3,4),dimnames=list(R=c(13,12,14),C=c(20,17,25,19)))
>> Z.1.right <- colSums(Z)
>> Z.1.aaply <- aaply(Z,2,sum)
>> cat("Z: ",all.equal(Z.1.right,Z.1.aaply),"\n")
>>
>> W <- array(1:12,dim=c(3,4))
>> W.1.right <- colSums(W)
>> W.1.aaply <- aaply(W,2,sum)
>> cat("W: ",all.equal(W.1.right, W.1.aaply),"\n")
>>
>> V <- array(1:(13*14),dim=c(13,14))
>> V.1.right <- colSums(V)
>> V.1.aaply <- aaply(V,2,sum)
>> cat("V: ",all.equal(V.1.right, V.1.aaply),"\n")
>>
>> M <- array(1:(5*9*14),dim=c(5,9,14))
>> M.1.right <- colSums(M,dim=2)
>> M.1.aaply <- aaply(M,3,sum)
>> cat("M: ",all.equal(M.1.right, M.1.aaply),"\n")
>>
>> P <- array(1:(5*9*14),dim=c(5,9,14),dimnames=list(A=1:5,B=1:9,C=14:1))
>> P.1.right <- colSums(P,dim=2)
>> P.1.aaply <- aaply(P,3,sum)
>> cat("P: ",all.equal(P.1.right, P.1.aaply),"\n")
>>
>> Q <- array(1:(5*9*14),dim=c(5,9,14),dimnames=list(A=5:1,B=1:9,C=1:14))
>> Q.1.right <- colSums(Q,dim=2)
>> Q.1.aaply <- aaply(Q,3,sum)
>> cat("Q-dim3: ",all.equal(Q.1.right,Q.1.aaply),"\n")
>>
>> Q.2.right <- rowSums(Q)
>> Q.2.aaply <- aaply(Q,1,sum)
>> cat("Q-rowSums: ",all.equal(Q.2.right,Q.2.aaply),"\n")
>>
>>
>>
>>
>> Michael Braun
>> MIT Sloan School of Management
>> bra...@mit.edu
>>
>> --
>> You received this message because you are subscribed to the Google Groups "manipulatr" group.
>> To post to this group, send email to manip...@googlegroups.com.
>> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>>
>>
>
>
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Hadley Wickham

unread,

Jul 10, 2011, 1:04:00 PM7/10/11

to Steve Lianoglou, Michael, manipulatr

Hi Steve,

No, there's no guarantee that ddply output will be in the same order
as the input, and in general, I don't see any possible way you could
guarantee that.