Scoobi Sequence files in /tmp/scoobi-user directory not compressed?

25 views

Skip to first unread message

shil...@gmail.com

unread,

Oct 13, 2014, 3:06:16 PM10/13/14

to scoob...@googlegroups.com

In a multi-step job, Scoobi is writing the output of each step as sequence files to the /tmp/scoobi-<user>/ directory but these sequence files are not compressed. Is there a way to enable compression for these temporary sequence files?

Here are the parameters I see that scoobi sets:

mapreduce.output.fileoutputformat.outputdir /tmp/scoobi-user/ReportingProcess$-1009-192939-832207653/tmp-out-step_2_of_6

scoobi.output.213:405.format org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

scoobi.output.213:405.key org.apache.hadoop.io.NullWritable

scoobi.output.213:405.value BSd857af22-5a4f-4cf2-b5b4-0c1a66abfb6f

scoobi.output.231:451.format org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

scoobi.output.231:451.key org.apache.hadoop.io.NullWritable

scoobi.output.231:451.value BS6b1148ad-0c5e-4c07-bb3b-47f3165acaf3

The compression params are also set:

mapreduce.map.output.compress true job.xml ? mapred-site.xml

mapreduce.map.output.compress.codec com.hadoop.compression.lzo.LzoCodec job.xml ? mapred-site.xml

mapreduce.output.fileoutputformat.compress true job.xml ? mapred-site.xml

mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.GzipCodec job.xml ? mapred-site.xml

mapreduce.output.fileoutputformat.compress.type BLOCK

Eric Torreborre

unread,

Oct 14, 2014, 9:34:56 PM10/14/14

to scoob...@googlegroups.com

There is no way to do that at the moment but that seems very doable.

I can't unfortunately spend much time on Scoobi at the moment but pull requests are welcome as usual :-)

I suspect that if you modify the "Bridge.create" method: https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/core/DataSink.scala#L225

and just call compressWith on the newly created bridge, then you should see those intermediate files being compressed.