Aggregating using dplyr

11 views
Skip to first unread message

Gouri Shankar Mishra

unread,
Sep 20, 2016, 6:22:43 PM9/20/16
to Davis R Users' Group
Hello All

This seems to be a very straightforward problem - but for some reason it is not working for me. 

I have a dataset (attached)

The ID variables are in columns 1 to 9
I have a number of "data variables" from columns 10 to 15. 

I want to aggregate by the different categories in the 9th column - "OWNRAGE.id". In other words, I want to collapse the different categories under OWNRAGE.id to a single category and sum all the values. 

I used the following code

idVars <- names(df[,1:9])
dataVars <- names(df[10:15])
by_vars = idVars[ idVars != "OWNRAGE.id" ]

library(dplyr)
data_aggregated <- group_by(df, one_of(by_vars)) %>% summarise_each(funs(sum), one_of(dataVars))

The above does not work and gives me an error
Error in eval(expr, envir, enclos) : 
  wrong result size (0), expected 84 or 1
In addition: Warning message:
In one_of(c("GEO.id", "GEO.id2", "GEO.annotation.id", "NAICS.id",  :
  Unknown variables: `GEO.id`, `GEO.id2`, `GEO.annotation.id`, `NAICS.id`, `NAICS.annotation.id`, `CBOGROUP.id`, `CBOSEX.id`, `YEAR.id`


I also tried the base aggregate function
data_aggregated <- aggregate( df[ dataVars], by=as.list( data_subset[ idVars ] ), sum )

This gives me an empty data frame. 

I will greatly appreciate any ideas. 
df.csv

Brandon Hurr

unread,
Sep 20, 2016, 7:11:07 PM9/20/16
to davi...@googlegroups.com
The answer seems to be to use standard evaluation for grouping. Does the output of this look right to you? 

 group_by_(df, .dots = by_vars) %>% summarise_each(funs(sum(., na.rm=TRUE)), -OWNRAGE.id)

Your csv had rownumbers so others might have to get rid of those first. But, essentially, you're grouping by everything you want to aggregate on and then summarize_each everything else except OWNRAGE.id. 

A mix of NSE and SE seems to get you there. Others may have better solutions. 

B


On Tue, Sep 20, 2016 at 3:22 PM, Gouri Shankar Mishra <gouri....@gmail.com> wrote:
summarise_each(funs(sum)


Matt Espe

unread,
Sep 21, 2016, 12:35:07 PM9/21/16
to Davis R Users' Group
Hi,

I am a little confused about what you are attempting to do here. 

If you merely want to aggregate by a single column (OWNRAGE.id):

aggregate(df[,-grep("\\.id", colnames(df))], list(df$OWNRAGE.id), sum, na.rm = TRUE)

However, it seem you might be trying to aggregate by all "ID" columns other than OWNRAGE.id. The trouble is, some of your "ID" columns are entirely NA. If you remove those columns, aggregate works as expected.

idVars <- names(df[,1:9])
dataVars 
<- names(df[10:15])

by_vars 
= idVars[ !idVars %in% c("GEO.id2", "GEO.annotation.id", "OWNRAGE.id") ]

aggregate(df[,dataVars], as.list(df[,by_vars]), sum, na.rm = TRUE)



There are more elegant ways to remove the NA columns and to get your ID columns to avoid issues with shifting columns due to row names, but the above should work. 

Matt

Gouri Shankar Mishra

unread,
Sep 21, 2016, 12:40:47 PM9/21/16
to Davis R Users' Group
Brandon - your suggestion wroked great. 

Matt - I guess I now understand why my code was not working. Thanks. I was trying to aggregate only by the one column (OWNRAGE.id). 
Reply all
Reply to author
Forward
0 new messages