It's worth noting why merge magically worked with minimal arguments — I've seen this lead to incorrect results and very confused users.
merge() does a join of all columns with the same name (see the default by argument is by = intersect(names(x), names(y)). So if these two dataframes had another set of columns with the same name (I'll use entry_date) as an example, it joins by this too:
> groups = data.frame(species = c("rat","mouse","lizard","turtle", "frog"), class = c("mammal", "mammal", "reptile", "reptile", "amphibia"), entry_year=c("2014", "2012", "2013", "2013", "2014"))
> animalsdf = data.frame(length = c(5,4,3,5,6), weight = c(3,4,5,6,9), species = c("rat", "mouse", "lizard", "lizard", "turtle"), entry_year=c("2014", "2014", "2013", "2012", "2011"))
> animalsdf
length weight species entry_year
1 5 3 rat 2014
2 4 4 mouse 2014
3 3 5 lizard 2013
4 5 6 lizard 2012
5 6 9 turtle 2011
> groups
species class entry_year
1 rat mammal 2014
2 mouse mammal 2012
3 lizard reptile 2013
4 turtle reptile 2013
5 frog amphibia 2014
> merge(animalsdf, groups)
species entry_year length weight class
1 lizard 2013 3 5 reptile
2 rat 2014 5 3 mammal
Holy
inner join batman! So beware; I usually prefer to be a bit explicit:
> merge(animalsdf, groups, by.x="species", by.y="species")
species length weight entry_year.x class entry_year.y
1 lizard 3 5 2013 reptile 2013
2 lizard 5 6 2012 reptile 2013
3 mouse 4 4 2014 mammal 2012
4 rat 5 3 2014 mammal 2014
5 turtle 6 9 2011 reptile 2013
Call me paranoid (maybe), but being
explicit prevents future headaches (this is true in all code).
I also added a new entry to groups — frogs/amphibia. This is to illustrate that this really is an inner join of the data (well, if you want to get technical, a
natural join) and this entry of groups is lost since there are no frogs in animalsdf. There are some cases when you may want to do an
outer join:
> merge(animalsdf, groups, by.x="species", by.y="species", all=TRUE)
species length weight entry_year.x class entry_year.y
1 lizard 3 5 2013 reptile 2013
2 lizard 5 6 2012 reptile 2013
3 mouse 4 4 2014 mammal 2012
4 rat 5 3 2014 mammal 2014
5 turtle 6 9 2011 reptile 2013
6 frog NA NA <NA> amphibia 2014
If you expect everything in groups to be in animalsdf when joined by "species", doing an outer join and then comparing the number of rows and NAs before and after can be used to spot problems. Also, you can tinker with all.x and all.y to get
left- and
right outer joins.
Anyways, hope this helps — a few times I've done careless merge/joins that have later lead to me wanting to merge my head with a wall and join group therapy, so maybe this will prevent someone these headaches. For complex cases or any moderately large dataset, Patrick Kilduff is right about sqldf (or using SQLite and RSQLite) which is made to do this stuff (and for slightly larger than moderately large data, you can use indexing to speed things up).
Vince