Sorting Protocol Buffers with Hadoop MapReduce

373 views
Skip to first unread message

Owen O'Malley

unread,
Dec 16, 2010, 6:18:17 PM12/16/10
to Protocol Buffers
All,
   I'm hooking in ProtoBuf (as well as Avro, and Thrift) into Hadoop MapReduce. In order for that to make sense, I need to be able to sort on the protobuf messages. Hadoop uses compare function over the bytes of two serialized objects. Obviously, I could just use a memcmp, but that will lead to a sort order that is hard to explain to users. This function should lead to the obvious sort order, which will be much easier to understand.

The rough idea is that I iterate over the fields sorted to be in id order and compare them based on their type. If a message is missing a field that has a default value, the default value is used. 

The code is here: http://bit.ly/f8Scfo

It compiles and works in simple testing. I need to do more testing, but I thought I'd see if anyone here would be willing to take a look at my code for correctness and soundness.

-- Owen

Kenton Varda

unread,
Dec 22, 2010, 7:00:50 PM12/22/10
to Owen O'Malley, Protocol Buffers
Hmm, you might want to consider parsing both messages into DynamicMessages (or into the generated message classes if you have them compiled in) and then using a protobuf-reflection-based algorithm (not to be confused with Java reflection; see Message#getField()).  Your current code is very much coupled to the wire format, and relies on tags being in a particular order (which is not technically guaranteed).  On the other hand, your code is probably faster than a reflection-based algorithm would be.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

Reply all
Reply to author
Forward
0 new messages