ddply etc. keeping row order

Kohske Takahashi

unread,

Jul 27, 2011, 12:47:24 PM7/27/11

to manipulatr

Hi,

is there a simple way to ddply keeping original row order?

for example,

> dput(d)
structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971), i = 1:6), .Names = c("g", "v", "i"), row.names = c(NA,
-6L), class = "data.frame")

> # data looks like:

> d
g v i
1 2 -1.9012711 1
2 2 -1.2086268 2
3 1 -1.1391327 3
4 1 0.1489980 4
5 2 -0.6942766 5
6 2 0.8725586 6

> works good, but original row order is not preserved.

> ddply(d, .(g), summarize, v=scale(v))
g v
1 1 -0.70710678
2 1 0.70710678
3 2 -0.99094901
4 2 -0.40348367
5 2 0.03276177
6 2 1.36167091

> # now, one workaround is explicitly indexing and reorder based the index after ddply

> d$i <- 1:nrow(d)
> arrange(ddply(d, .(g), summarize, v=scale(v), i=i), i)
g v i
1 2 -0.99094901 1
2 2 -0.40348367 2
3 1 -0.70710678 3
4 1 0.70710678 4
5 2 0.03276177 5
6 2 1.36167091 6

thanks in advance.

--
Kohske Takahashi <takahash...@gmail.com>

Research Center for Advanced Science and Technology,
The University of Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html

Hadley Wickham

unread,

Jul 28, 2011, 10:02:59 AM7/28/11

to Kohske Takahashi, manipulatr

See - http://groups.google.com/group/manipulatr/browse_thread/thread/91e7e9b6b507af9a

But I didn't realise I accidentally took my discussion with Steve off-list.

Here's a quick example of why you can't guarantee the same order in general:

library(plyr)
df <- data.frame(x = c("a", "b", "b", "a"), y = 1:4)

# It would be possible if the aggregation function returned the
# same number of rows
ddply(df, "x", identity)

# But not if it returned fewer rows:
ddply(df, "x", head, 1)

# Or more rows
ddply(df, "x", function(df) df[c(1,2,1), ])

> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to
> manipulatr+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/manipulatr?hl=en.
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Kohske Takahashi

unread,

Jul 28, 2011, 10:24:31 AM7/28/11

to Hadley Wickham, manipulatr

> See - http://groups.google.com/group/manipulatr/browse_thread/thread/91e7e9b6b507af9a

I missed it...

>
> But I didn't realise I accidentally took my discussion with Steve off-list.

> Here's a quick example of why you can't guarantee the same order in general:

got it.

I want something like this, but it does not look elegant...

ddply2 <- function (.data, .variables, .fun = NULL, ..., .progress = "none",
.drop = TRUE, .parallel = FALSE, .KEEP.ORDER = FALSE) {
.variables <- as.quoted(.variables)
pieces <- plyr:::splitter_d(.data, .variables, drop = .drop)
r <- ldply(.data = pieces, .fun = .fun, ..., .progress = .progress,
.parallel = .parallel)
if(.KEEP.ORDER) {
stopifnot(nrow(r)==nrow(.data))
r <- r[order(unlist(pieces$index)), ]
rownames(r) <- NULL
r
} else {
r
}
}

> ddply2(df, "x", I)
x y
1 a 1
2 a 4
3 b 2
4 b 3
> ddply2(df, "x", I, .KEEP.ORDER=TRUE)
x y
1 a 1
2 b 2
3 b 3
4 a 4
> ddply2(df, "x", head, 1)
x y
1 a 1
2 b 2
> ddply2(df, "x", head, 1, .KEEP.ORDER=TRUE)
Error: nrow(r) == nrow(.data) is not TRUE

--
Kohske Takahashi <takahash...@gmail.com>

Research Center for Advanced Science and Technology,
The University of Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html

Peter Meilstrup

unread,

Jul 28, 2011, 3:07:04 PM7/28/11

to Kohske Takahashi, Hadley Wickham, manipulatr

I would agree with Hadley that it does not make sense to make this an
option for ddply (since preserving order only applies to a special
class of worker functions, and plyr tries to be agnostic to what the
worker functions do.)

How about making your workaround into a separate function that can be
composed with ddply? This has an added advantage, that you could then
use it with other data frame functions that ordinarily disrupt order,
like merge().

keeping.order <- function(data, fn, ...) {
col <- as.character(gensym(envir=data))
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}

# Then,

d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,

-6L), class = "data.frame")

ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d

# This has the advantage that you can apply it to other data frame
functions like merge.

names <- data.frame(g=c(1, 2), name = c("Thomas", "Jim"))
merge(d, names) #does not preserve order of d
keeping.order(d, merge, names) #preserves order of d

# Peter

Kohske Takahashi

unread,

Jul 29, 2011, 2:56:07 AM7/29/11

to Peter Meilstrup, Hadley Wickham, manipulatr

good point.
thanks.

kohske

--
Kohske Takahashi <takahash...@gmail.com>

Research Center for Advanced Science and Technology,
The University of Tokyo, Japan.
http://www.fennel.rcast.u-tokyo.ac.jp/profilee_ktakahashi.html

Reply all

Reply to author

Forward