using aggregate on large data frames with several index variables

6 views
Skip to first unread message

greg.j...@weyerhaeuser.com

unread,
Mar 21, 2009, 6:26:13 PM3/21/09
to Forest-R
I have often been stymied by aggregate using large amounts of memory
when more a one or two index variables are needed. I have found a
hack that works to avoid this issue. If any of you have a more
elegant way to do this, let me know.

lets say we have a large data.frame that we want to aggregate using 4
index variables. We tried the straightforward way but ran out of
memory. Here we want the mean of x for unique combinations of id,
plot, class, and species:

result <- aggregate( list(a.mean=df$x), list(id=df$id, plot=df$plot,
class=df$class, species=df$species), mean )

Alas, we ran out of memory. Try this hack instead:

# create a vector of a character combination of all index variables
i <- paste(df$id, df$plot, df$class, df$species, sep=",")

# aggregate using the hybrid index "i"
result <- aggregate( list(a.mean=df$x), list(i=i), mean )

# unpack the index
i2 <- matrix(unlist(strsplit(result$i, ",") ), ncol=4, byrow=T )
result$id <- i2[,1] # in this case "id" was
a character
result$plot <- as.numeric(i2[,2])
result$class <- as.numeric(i2[,3])
result$species <- i2[,4] # species was a character too
result$i <- NULL # clean up after ourselves

Now we have a data.frame "result" with id, plot, class, species, and
a.mean
Reply all
Reply to author
Forward
0 new messages