First keep in mind examples are in general not optimizes but kept as simple as possible.
The number of reduce tasks defaults to one in many Hadoop distribution. You need to change that. One way is to supply this additional argument to mapreduce
backend.parameters = list(hadoop= list(D = "mapred.reduce.tasks=10"))
where 10 should be actually slightly less than the number of available reduce slots. This argument is deprecated because it was being abused, but I think this use case will be preserved by whatever replaces it. The memory issue points to something else, like a word with a very large count. Even then, with the combiner on, it should work. Nonetheless I would check that the pattern is appropriate for your file, that would be single space separated words. Another thing is to try with a smaller, local file, say 1 mb, on the local backend, do a debug(wc.map) and debug(wc.reduce) and see that it is doing what you expect it to do. Have you tried different data sizes and wrote down the different times-to-completion for the map phase and reduce pahse? How do they grow? That usually gives some insight. Another thing you can do is adding
rmr.str(word)
rmr.str(counts)
as the first two lines of the reduce function. It will write a lot in your std error, so use at smaller data sizes.
This is just an example after all and it hasn't been tested on more than one file. Pretty much the debugging guideline on the wiki can be used a little bit also a performance troubleshooting guide. There are other things you can try to do to move the work to the map phase, but I would first understand what the problem is instead of trying solutions. My suspicion is that your lines are short, vocabulary huge and uniformly distributed. In real life lines are longer, vocabulary smaller and with a zipf distribution. That could explain speed issues but not memory issues, so we need to dig deeper to figure out what's wrong.
Antonio