Hi Haiying,
I am facing similar issues with the elephant bird libraries. I am actually having a rather complex protobuf message, for which I had to change Rahul Ravindran repeatedFieldFix branch, and do some updates on that. For example, I had issues with fields declared as enums, and byte arrays, as well as messages which have some specific levels of nesting. Still, Rahul's code branch proved a very good starting point.
Not about the issues I am facing, I had a simple count(*) on one 52 TB table, where data is compressed with lzo. On a quite large cluster (200+) mappers start to timeout and they eventually throw OOM errors. I ave tried with an XmX up to 32 GB, without an obvious result. I feel that the code has some very nasty memory leak someplace, or we are just missing something obvious in the configs.
If you can share your message and some sample data, I would gladly try it out with the current codebase I have, just to see if I get anything different. I have been messing around with the codebase for more than 3 weeks know, and, while all basic queries work fine (this wasn't working out of the box, even with all of Rahul's fixes), there seem to be serious performance issues.
Did you try to partition your data, to see if it makes a difference?
Also, you say that with pig everything worked fine?