Output compression with LzoTraits

21 views
Skip to first unread message

Phil Kallos

unread,
Apr 22, 2014, 10:38:27 PM4/22/14
to cascadi...@googlegroups.com
I am trying to build a scalding Job that reads in LZO data and writes it to LZO compressed TSV format. I am able to successfully read in LZO compressed data and output it as uncompressed TSV. Next step is to flip on LZO compression output. To do this I've used some of the code from LzoTraits, as follows

case class LzoTsvSource(p: String) extends FixedPathSource(p: String) with LzoTsv

I am able to run this against my local hadoop v.1.0.3 installation and the output format is correctly written to HDFS as TSV with LZO compression. Fantastic!

However when I run this job inside Amazon EMR with hadoop v1.0.3, the Job runs successfully but does not actually write any files to the output directory (save for the _SUCCESS file). I've poked around and my local instance is configured practically identically (as far as mapred-site.xml, core-site.xml ...). 

Does anybody have any pointers as to how I could debug? Job logs show that 
INFO cascading.flow.hadoop.FlowMapper: sinking to: Hfs["LzoTextDelimited[[..]]"]
and
INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor

Alternatively, it seems like I should be able to specify -Dmapred.output.compress=true , but then I found https://github.com/twitter/scalding/issues/533 which seems to indicate that this may not behabve as expected.

Any guidance is much appreciated, thanks!
Reply all
Reply to author
Forward
0 new messages