Hi,
I am not 100% sure what is going on here but I noticed that you specify make input format with only three arguments, but it's not a guarantee that the default for the third will work with any combination, and in particular having a mode of text and the default sequence file format is not tested and doesn't make a lot of sense to me. You may want to set that to NULL. I guess we could have a bit more logic in there that selects that default based on mode, it's a few lines of code but it is just not clear what cases to cover. The idea is that there is a easy version of make.output.format, with a string format argument, in your case that would be make.input.format("csv", sep = ",') and the advanced version with the three arguments, a function, a mode (text or binary) and a java class. The steps are as follows
your keyval pair -> R format function -> some output machinery that depends on the mode -> a java class to read the data back into hadoop.
Unfortunately the last step is missing in the local implementation, as we don't have any java components there to execute the class. It is not a deal breaker, you can do almost anything without, but it's the biggest divergence between the two backends. I understand the frustration, but we can't let perfection get in the way of the good and the two backends are almost equivalent, with the notable exception of these IO classes that are simply ignored.
Now on why the combiner breaks things. When you are using the default java class SequenceFileFormat or some such you are expected to generate typedbytes on the R side, not text. What I think is happening here is that if you do that in local mode, the java side is ignored so no problem. When you are using hadoop and the combiner is off, java doesn't understand a thing of what you are writing out but it doesn't matter because it doesn't have to read it back in. When you turn on the combiner, the data has to be read back in to go into the reduce phase and if java can't make the distinction between key and value, which it can't because you didn't use the typedbytes format, it will protest.
So in practice I suggest you use
make.output.format("csv", sep = ",")
just to see if it works with the combiner on. If that does, you may still want to use your format because of speed (this has been improved in 1.3, but for now the csv reader and writer are pretty sluggish). In that case, just add the argument streaming.format = NULL which will tell streaming to revert to its default behavior, which is text format. If neither of this works, I suspect we may be looking at a streaming bug. I don't have your data so please give it a spin and let me know what you find. I am traveling so I may not be as responsive as I would like, sorry about that.
Antonio