Serialising protobuf with zlib encoding

Stephen Knox

unread,

Feb 23, 2015, 10:49:19 AM2/23/15

to elephant...@googlegroups.com

Hi all,

Newbie here.

I am interested in parsing a protobuf file format (OSM PBF described here) in Hadoop, which led me to elephant-bird. I have looked through the source and issues and found that everything was originally built for LZO compression, but that there have been efforts to accommodate other compression formats in the code base (MultipleInputFormat). The MultipleInputFormat class currently only implements LZO compression. The format I am interested in parsing uses zlib compression for data blocks.

So my questions are

- As a high level task, how complicated would it be to add zlib encoding to the elephant-bird library (pointers only, I realise I have a lot more RTFMing / UTSLing to do)?

- If I did manage to implement this, would gzip compression be something that was welcomed into the library subject to suitable levels of documentation and test coverage (I am assuming from the MultipleInputFormat documentation and pull requests that this is a yes, but I wonder if I am possibly missing something about the characteristics of LZO vs zlib within Hadoop and about how file blocks are split)

Many thanks for your help

Stephen Knox

Raghu Angadi

unread,

Feb 23, 2015, 12:16:03 PM2/23/15

to elephant...@googlegroups.com, Stephen Knox

Of course, contributions are very much welcome.

Can you describe your use case a bit better? specifically, by 'zlib comression' do you men sequence files with zlib compression? What is your app going be written in (Hive, Pig, Scalding or Java MR, etc).

unlike most other file formats, LZOP files supported in MultiInputFormat are split using a separate 'index file' (if you have a larger input.lzo file, it requires input.lzo.index file to split it in Hadoop). This makes it different from other file formats in fundamental way.

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev.
For more options, visit https://groups.google.com/d/optout.

Stephen Knox

unread,

Feb 24, 2015, 10:36:58 AM2/24/15

to elephant...@googlegroups.com, stephe...@gmail.com

Hi Raghu,

The use case is a file format where the headers are un-encoded, but most of the data is contained within blocks which are compressed with zlib, then uncompressed at parse time.

This is an example of the data parts of the format as protobuf:

 message Blob {
   optional bytes raw = 1; // No compression
   optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
   optional bytes zlib_data = 3;
   // optional bytes lzma_data = 4; // PROPOSED.
   // optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
 }

The app will probably be written in Pig.

This does sound fundamentally different from what is already in elephant-bird - perhaps there is not much value as using it as a basis?

Stephen

Reply all

Reply to author

Forward