I wanted to bring to everybody's attention bug #58 whereby grouping in the reduce phase can be incomplete (no records lost, but some groups are split). This happens when using factors as keys, which is a very common occurrence when using data frames. The workaround is simple, use characters instead. Different ways to achieve this are discussed in this
thread. I am still thinking what the long term solution should be.