Batch find and replace

13 views
Skip to first unread message

Rosemary Hartman

unread,
Jun 19, 2014, 11:16:07 AM6/19/14
to davi...@googlegroups.com
Hello all,

I have a data processing question that probably has an easy answer, but I can't figure it out. I have a data set on zooplankton that includes species/genus names, and I want to lump them into families or orders. I have a separate data frame of all the species and what orders they are in. I want to make a new column in my original data frame with the larger group that the species falls into. FOr example:

animalsdf = data.frame(length = c(5,4,3,5,6), weight = c(3,4,5,6,9), species = c("rat", "mouse", "lizard", "lizard", "turtle"))

groups = data.frame(species = c("rat","mouse","lizard","turtle"), class = c("mammal", "mammal", "reptile", "reptile"))

# I feel like I should be able to do something like this
llply(animalsdf$species, function(p) {
  llply(groups, function(){sub(groups[,1], groups[,2], p)})
})

# But that doesn't work
# I also tried this

animalsdf$class = as.character(animalsdf$species)

for (x in 1:nrow(groups) {
    for (i in 1:nrow(animalsdf)) {
    animalsdf$class[i] = sub(groups[x,1], groups[x,2], animalsdf$class[i])
}
}
)
# But that didn't work either.

#I can get it to work for one class at a time by doing this:
  for (i in 1:nrow(animalsdf)) {
    animalsdf$class[i] = sub(groups[1,1], groups[1,2], animalsdf$class[i])
}

#I feel like I should be able to figure this out, but after three hours of trying I thought I'd let some fresh eyes look at it.

--

Dave Harris

unread,
Jun 19, 2014, 11:27:07 AM6/19/14
to davi...@googlegroups.com
merge(animalsdf, groups) should do the trick.

Dave


--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Dave Harris

unread,
Jun 19, 2014, 11:56:32 AM6/19/14
to Rosemary Hartman, davi...@googlegroups.com
Alternatively, you can do it manually with 

cbind(
  animalsdf, 
  groups[match(animalsdf$species, groups$species), ]
)

i.e., find the rows in groups that contain the species in animalsdf, pull them out in the proper order, and paste the results together with your original data frame.  

You end up with two copies of the species column, which is actually a good thing because it lets you double-check that you did it right.

Dave


On Thu, Jun 19, 2014 at 8:29 AM, Rosemary Hartman <roseh...@gmail.com> wrote:
Oh my god that's awesome. I knew it was way too simple.

Patrick Kilduff

unread,
Jun 19, 2014, 12:24:11 PM6/19/14
to davi...@googlegroups.com
another way, if you like database syntax:

library(sqldf)

sqldf("SELECT * FROM animalsdf LEFT JOIN groups USING(species) ORDER BY class, species")
Patrick Kilduff

Department of Wildlife, Fish and Conservation Biology
University of California Davis

Vince S. Buffalo

unread,
Jun 19, 2014, 1:29:01 PM6/19/14
to davi...@googlegroups.com
It's worth noting why merge magically worked with minimal arguments — I've seen this lead to incorrect results and very confused users. 

merge() does a join of all columns with the same name (see the default by argument is by = intersect(names(x), names(y)). So if these two dataframes had another set of columns with the same name (I'll use entry_date) as an example, it joins by this too:

> groups = data.frame(species = c("rat","mouse","lizard","turtle", "frog"), class = c("mammal", "mammal", "reptile", "reptile", "amphibia"), entry_year=c("2014", "2012", "2013", "2013", "2014"))
> animalsdf = data.frame(length = c(5,4,3,5,6), weight = c(3,4,5,6,9), species = c("rat", "mouse", "lizard", "lizard", "turtle"), entry_year=c("2014", "2014", "2013", "2012", "2011"))
> animalsdf
  length weight species entry_year
1      5      3     rat       2014
2      4      4   mouse       2014
3      3      5  lizard       2013
4      5      6  lizard       2012
5      6      9  turtle       2011
> groups
  species    class entry_year
1     rat   mammal       2014
2   mouse   mammal       2012
3  lizard  reptile       2013
4  turtle  reptile       2013
5    frog amphibia       2014
> merge(animalsdf, groups)
  species entry_year length weight   class
1  lizard       2013      3      5 reptile
2     rat       2014      5      3  mammal

Holy inner join batman! So beware; I usually prefer to be a bit explicit:

> merge(animalsdf, groups, by.x="species", by.y="species")
  species length weight entry_year.x   class entry_year.y
1  lizard      3      5         2013 reptile         2013
2  lizard      5      6         2012 reptile         2013
3   mouse      4      4         2014  mammal         2012
4     rat      5      3         2014  mammal         2014
5  turtle      6      9         2011 reptile         2013

Call me paranoid (maybe), but being explicit prevents future headaches (this is true in all code).

I also added a new entry to groups — frogs/amphibia. This is to illustrate that this really is an inner join of the data (well, if you want to get technical, a natural join) and this entry of groups is lost since there are no frogs in animalsdf. There are some cases when you may want to do an outer join:

> merge(animalsdf, groups, by.x="species", by.y="species", all=TRUE)
  species length weight entry_year.x    class entry_year.y
1  lizard      3      5         2013  reptile         2013
2  lizard      5      6         2012  reptile         2013
3   mouse      4      4         2014   mammal         2012
4     rat      5      3         2014   mammal         2014
5  turtle      6      9         2011  reptile         2013
6    frog     NA     NA         <NA> amphibia         2014

If you expect everything in groups to be in animalsdf when joined by "species", doing an outer join and then comparing the number of rows and NAs before and after can be used to spot problems. Also, you can tinker with all.x and all.y to get left- and right outer joins.

Anyways, hope this helps — a few times I've done careless merge/joins that have later lead to me wanting to merge my head with a wall and join group therapy, so maybe this will prevent someone these headaches. For complex cases or any moderately large dataset, Patrick Kilduff is right about sqldf (or using SQLite and RSQLite) which is made to do this stuff (and for slightly larger than moderately large data, you can use indexing to speed things up).

Vince



Vince Buffalo
Ross-Ibarra Lab (www.rilab.org)
Plant Sciences, UC Davis

David Harris

unread,
Jun 19, 2014, 1:30:43 PM6/19/14
to davi...@googlegroups.com
Great advice, Vince. Thanks for the follow-up. 

Dave

Bonnie Dixon

unread,
Jun 19, 2014, 1:55:58 PM6/19/14
to davi...@googlegroups.com
Here is a way to do it using the dplyr package:

require(dplyr)
left_join(animalsdf, groups)

(But, since I am not up on the details of all different kinds of joins that Vince describes, this solution may need to be scrutinized more closely for why it works and whether it would work in all cases like this.)

Bonnie
Reply all
Reply to author
Forward
0 new messages