Aggregating categorical variables

9 views
Skip to first unread message

Elise Hellwig

unread,
Jun 24, 2014, 3:40:57 PM6/24/14
to davi...@googlegroups.com
Hi,

So I have a set of survey data for a number of countries. Some of the questions have numeric answers, like how many minutes does it take you to get water (ex. 25). However, some have categorical answers, like what is your highest level of education completed (ex, primary). I would like to take all the data for a categorical variable for a specific country and assign that variable the value of the most frequent response for that country (I know its not very nuanced but we need to start simple here).

I know for variables with numeric values you can use something like

aggregate(variable, country, fun=mean)


to do that for you. However I am having trouble finding a good way to do it for the categorical variables.  Currently, I am using

aggregate(variable, country, FUN=function(t) names(which.max(table(t))))

but the output it gives is in a very weird form. Instead of having a data frame with x variables each with y observations (each one corresponding to a different country), it produces a list x long, and in each list are y variables each with one observation, that being the appropriate value for that variable for the country of interest. The weirdest thing is though that the variables aren't the countries, they are just numbers.

Thanks!
Elise

Vince S. Buffalo

unread,
Jun 24, 2014, 4:18:32 PM6/24/14
to davi...@googlegroups.com
Hi Elise, 

Are the numbers it returns integers? This is likely because your categorial data is a factor. Factors are just integer vectors with labels (e.g high, medium, low). Factors can be ordered (e.g. elementary school, high school, college) or unordered (green, yellow, pink). If your column is an ordered factor, max works automatically. So:

> set.seed(0)
> d <- data.frame(country=sample(letters[1:5], 10, replace=TRUE))
> d$edu <- factor(sample(c("high school", "college", "elementary school"), 10, replace=TRUE), levels=c("elementary school", "high school", "college"), ordered=TRUE)
> d
   country               edu
1        e       high school
2        b       high school
3        b       high school
4        c elementary school
5        e           college
6        b elementary school
7        e           college
8        e elementary school
9        d elementary school
10       d           college
> aggregate(d$edu, list(country=d$country), max)
  country                 x
1       b       high school
2       c elementary school
3       d           college
4       e           college
> d$edu
 [1] high school       high school       high school       elementary school
 [5] college           elementary school college           elementary school
 [9] elementary school college
Levels: elementary school < high school < college
> d$country
 [1] e b b c e b e e d d
Levels: b c d e

Note how countries are unordered and edu is ordered. R's generics for max, min, and comparison operators understand this ordering:

> d$edu < rev(d$edu)
 [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
> rev(d$edu)
 [1] college           elementary school elementary school college
 [5] elementary school college           elementary school high school
 [9] high school       high school
Levels: elementary school < high school < college
> d$edu
 [1] high school       high school       high school       elementary school
 [5] college           elementary school college           elementary school
 [9] elementary school college
Levels: elementary school < high school < college

So the solution is just make sure you're using ordered factors — R will take care of the rest! 

HTH,
Vince


--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

--
Vince Buffalo
Ross-Ibarra Lab (www.rilab.org)
Plant Sciences, UC Davis

Elise Hellwig

unread,
Jun 24, 2014, 5:48:52 PM6/24/14
to davi...@googlegroups.com
I think maybe I didn't specify what I want to do very well. I want to take the value for the variable that appeared the largest number of times for a country and make that the value for the country. So given this set of data below

       edu           country
1       none           a
2     higher           a
3       none           a
4    primary          a
5    primary          a
6  secondary        b
7  secondary        b
8  secondary        b
9    primary           b
10 secondary       b
11 secondary       c
12   primary         c
13      none          c
14   primary         c
15    higher          c
16 secondary      d
17    higher          d
18 secondary      d
19      none          d
20      none          d

country a would be non/primary, b would be secondary, c would be primary, and d would be secondary/none. In this case the variables have some sort of ordering but often times they don't (for example type of toilet or urban rural home location). Thanks!

Elise


You received this message because you are subscribed to a topic in the Google Groups "Davis R Users' Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/davis-rug/p1yynBhVE6w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to davis-rug+...@googlegroups.com.

Vince S. Buffalo

unread,
Jun 24, 2014, 5:59:21 PM6/24/14
to davi...@googlegroups.com
Oh, I see — I read your initial email too quickly (sorry). Is this what you're looking for?

aggregate(d$edu, list(country=d$country), function(x) levels(d$edu)[which.max(table(x))])
  country                 x
1       b       high school
2       c elementary school
3       d elementary school
4       e           college

But this doesn't handle ties — not sure how you want to handle that. Also, very subtle thing to note: this relies on the fact that table uses all factor levels (e.g. it doesn't drop unused factors – see below). Otherwise, this would most definitely not be safe.

> table(factor(letters[1:10], levels=letters))

a b c d e f g h i j k l m n o p q r s t u v w x y z
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Reply all
Reply to author
Forward
0 new messages