Batch find and replace

Rosemary Hartman

unread,

Jun 19, 2014, 11:16:07 AM6/19/14

to davi...@googlegroups.com

Hello all,

I have a data processing question that probably has an easy answer, but I can't figure it out. I have a data set on zooplankton that includes species/genus names, and I want to lump them into families or orders. I have a separate data frame of all the species and what orders they are in. I want to make a new column in my original data frame with the larger group that the species falls into. FOr example:

animalsdf = data.frame(length = c(5,4,3,5,6), weight = c(3,4,5,6,9), species = c("rat", "mouse", "lizard", "lizard", "turtle"))

groups = data.frame(species = c("rat","mouse","lizard","turtle"), class = c("mammal", "mammal", "reptile", "reptile"))

# I feel like I should be able to do something like this

llply(animalsdf$species, function(p) {
llply(groups, function(){sub(groups[,1], groups[,2], p)})
})

# But that doesn't work

# I also tried this

animalsdf$class = as.character(animalsdf$species)

for (x in 1:nrow(groups) {
for (i in 1:nrow(animalsdf)) {
animalsdf$class[i] = sub(groups[x,1], groups[x,2], animalsdf$class[i])
}
}
)

# But that didn't work either.

#I can get it to work for one class at a time by doing this:
for (i in 1:nrow(animalsdf)) {
animalsdf$class[i] = sub(groups[1,1], groups[1,2], animalsdf$class[i])
}

#I feel like I should be able to figure this out, but after three hours of trying I thought I'd let some fresh eyes look at it.

--

Rosemary Hartman
roseh...@gmail.com
301-538-0136

Dave Harris

unread,

Jun 19, 2014, 11:27:07 AM6/19/14

to davi...@googlegroups.com

merge(animalsdf, groups) should do the trick.

Dave

--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Dave Harris

unread,

Jun 19, 2014, 11:56:32 AM6/19/14

to Rosemary Hartman, davi...@googlegroups.com

Alternatively, you can do it manually with

cbind(

animalsdf,

groups[match(animalsdf$species, groups$species), ]

)

i.e., find the rows in groups that contain the species in animalsdf, pull them out in the proper order, and paste the results together with your original data frame.

You end up with two copies of the species column, which is actually a good thing because it lets you double-check that you did it right.

Dave

On Thu, Jun 19, 2014 at 8:29 AM, Rosemary Hartman <roseh...@gmail.com> wrote:

Oh my god that's awesome. I knew it was way too simple.

Patrick Kilduff

unread,

Jun 19, 2014, 12:24:11 PM6/19/14

to davi...@googlegroups.com

another way, if you like database syntax:

library(sqldf)

sqldf("SELECT * FROM animalsdf LEFT JOIN groups USING(species) ORDER BY class, species")

Patrick Kilduff

Department of Wildlife, Fish and Conservation Biology

University of California Davis

Vince S. Buffalo

unread,

Jun 19, 2014, 1:29:01 PM6/19/14

to davi...@googlegroups.com

It's worth noting why merge magically worked with minimal arguments — I've seen this lead to incorrect results and very confused users.

merge() does a join of all columns with the same name (see the default by argument is by = intersect(names(x), names(y)). So if these two dataframes had another set of columns with the same name (I'll use entry_date) as an example, it joins by this too:

> groups = data.frame(species = c("rat","mouse","lizard","turtle", "frog"), class = c("mammal", "mammal", "reptile", "reptile", "amphibia"), entry_year=c("2014", "2012", "2013", "2013", "2014"))

> animalsdf = data.frame(length = c(5,4,3,5,6), weight = c(3,4,5,6,9), species = c("rat", "mouse", "lizard", "lizard", "turtle"), entry_year=c("2014", "2014", "2013", "2012", "2011"))

> animalsdf

length weight species entry_year

1 5 3 rat 2014

2 4 4 mouse 2014

3 3 5 lizard 2013

4 5 6 lizard 2012

5 6 9 turtle 2011

> groups

species class entry_year

1 rat mammal 2014

2 mouse mammal 2012

3 lizard reptile 2013

4 turtle reptile 2013

5 frog amphibia 2014

> merge(animalsdf, groups)

species entry_year length weight class

1 lizard 2013 3 5 reptile

2 rat 2014 5 3 mammal

Holy inner join batman! So beware; I usually prefer to be a bit explicit:

> merge(animalsdf, groups, by.x="species", by.y="species")
species length weight entry_year.x class entry_year.y
1 lizard 3 5 2013 reptile 2013
2 lizard 5 6 2012 reptile 2013
3 mouse 4 4 2014 mammal 2012
4 rat 5 3 2014 mammal 2014
5 turtle 6 9 2011 reptile 2013

Call me paranoid (maybe), but being explicit prevents future headaches (this is true in all code).

I also added a new entry to groups — frogs/amphibia. This is to illustrate that this really is an inner join of the data (well, if you want to get technical, a natural join) and this entry of groups is lost since there are no frogs in animalsdf. There are some cases when you may want to do an outer join:

> merge(animalsdf, groups, by.x="species", by.y="species", all=TRUE)

species length weight entry_year.x class entry_year.y

1 lizard 3 5 2013 reptile 2013

2 lizard 5 6 2012 reptile 2013

3 mouse 4 4 2014 mammal 2012

4 rat 5 3 2014 mammal 2014

5 turtle 6 9 2011 reptile 2013

6 frog NA NA <NA> amphibia 2014

If you expect everything in groups to be in animalsdf when joined by "species", doing an outer join and then comparing the number of rows and NAs before and after can be used to spot problems. Also, you can tinker with all.x and all.y to get left- and right outer joins.

Anyways, hope this helps — a few times I've done careless merge/joins that have later lead to me wanting to merge my head with a wall and join group therapy, so maybe this will prevent someone these headaches. For complex cases or any moderately large dataset, Patrick Kilduff is right about sqldf (or using SQLite and RSQLite) which is made to do this stuff (and for slightly larger than moderately large data, you can use indexing to speed things up).

Vince

Vince Buffalo
Ross-Ibarra Lab (www.rilab.org)
Plant Sciences, UC Davis

David Harris

unread,

Jun 19, 2014, 1:30:43 PM6/19/14

to davi...@googlegroups.com

Great advice, Vince. Thanks for the follow-up.

Dave

Bonnie Dixon

unread,

Jun 19, 2014, 1:55:58 PM6/19/14

to davi...@googlegroups.com

Here is a way to do it using the dplyr package:

require(dplyr)

left_join(animalsdf, groups)

(But, since I am not up on the details of all different kinds of joins that Vince describes, this solution may need to be scrutinized more closely for why it works and whether it would work in all cases like this.)

Bonnie

Reply all

Reply to author

Forward