Newbie here.
I am interested in parsing a protobuf file format (OSM PBF described
here) in Hadoop, which led me to elephant-bird. I have looked through the source and issues and found that everything was originally built for LZO compression, but that there have been efforts to accommodate other compression formats in the code base (
MultipleInputFormat). The MultipleInputFormat class currently only implements LZO compression. The format I am interested in parsing uses zlib compression for data blocks.
So my questions are
- As a high level task, how complicated would it be to add zlib encoding to the elephant-bird library (pointers only, I realise I have a lot more RTFMing / UTSLing to do)?
- If I did manage to implement this, would gzip compression be something that was welcomed into the library subject to suitable levels of documentation and test coverage (I am assuming from the MultipleInputFormat documentation and pull requests that this is a yes, but I wonder if I am possibly missing something about the characteristics of LZO vs zlib within Hadoop and about how file blocks are split)