Tsv output compression in scalding

584 views
Skip to first unread message

John

unread,
Nov 14, 2012, 6:20:17 AM11/14/12
to cascadi...@googlegroups.com
Hello,

I am struggling with a simple task : I would like to compress the entire Tsv output of a scalding job.

Basically, the equivalent of this plain old mapreduce job configuration:
        conf.setOutputFormat(TextOutputFormat.class);
        TextOutputFormat.setCompressOutput(conf, true);
        TextOutputFormat.setOutputCompressorClass(conf, BZip2Codec.class);

I tried to override the config method in my scalding job like this:
       override def config(implicit mode: Mode) = super.config ++ Map("mapred.output.compress" -> "true", "mapred.output.compression.codec" -> "org.apache.hadoop.io.compress.BZip2Codec")

But it somehow gets ignored and when I look in the job configuration through the jobtracker web interface it says mapred.output.compress = false

Any help would be greatly appreciated, thanks in advance

John

Oscar Boykin

unread,
Nov 14, 2012, 7:19:38 PM11/14/12
to cascadi...@googlegroups.com
What version of scalding are you using?

I expect this to work too, so I'm a little confused.


John

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/L46JE-L3LDUJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.



--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco

Koert Kuipers

unread,
Nov 14, 2012, 8:19:11 PM11/14/12
to cascadi...@googlegroups.com
i noticed that scalding in general ignores the compression setting in our mapred-site.xml

we are using 0.8.1

John

unread,
Nov 15, 2012, 9:51:33 AM11/15/12
to cascadi...@googlegroups.com
I am using scalding 0.8.0 with scala 2.9.2

Koert

unread,
Jun 22, 2013, 5:42:25 PM6/22/13
to cascadi...@googlegroups.com
i ran into this issue again, and realized its because cascading's TextLine and TextDelimited have by default sinkCompression set to Compress.DISABLE, which means they override whatever was set in the jobConf. i find this confusing, however i dont think this will change in Cascading.

how about if in scalding we have all the Source classes that use TextLine or TextDelimited set sinkCompression to Compress.DEFAULT? that way i think (have to check) the sinks will respect mapred.compress.output



On Wednesday, November 14, 2012 8:19:12 PM UTC-5, Koert wrote:
i noticed that scalding in general ignores the compression setting in our mapred-site.xml

we are using 0.8.1

On Wed, Nov 14, 2012 at 7:19 PM, Oscar Boykin <> wrote:
What version of scalding are you using?

I expect this to work too, so I'm a little confused.
On Wed, Nov 14, 2012 at 3:20 AM, John <> wrote:
Hello,

I am struggling with a simple task : I would like to compress the entire Tsv output of a scalding job.

Basically, the equivalent of this plain old mapreduce job configuration:
        conf.setOutputFormat(TextOutputFormat.class);
        TextOutputFormat.setCompressOutput(conf, true);
        TextOutputFormat.setOutputCompressorClass(conf, BZip2Codec.class);

I tried to override the config method in my scalding job like this:
       override def config(implicit mode: Mode) = super.config ++ Map("mapred.output.compress" -> "true", "mapred.output.compression.codec" -> "org.apache.hadoop.io.compress.BZip2Codec")

But it somehow gets ignored and when I look in the job configuration through the jobtracker web interface it says mapred.output.compress = false

Any help would be greatly appreciated, thanks in advance

John

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/L46JE-L3LDUJ.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

Călin-Andrei Burloiu

unread,
Aug 1, 2013, 4:01:06 AM8/1/13
to cascadi...@googlegroups.com
I opened on issue on Github for the Scalding project.


On Sunday, June 23, 2013 12:42:25 AM UTC+3, Koert wrote:
i ran into this issue again, and realized its because cascading's TextLine and TextDelimited have by default sinkCompression set to Compress.DISABLE, which means they override whatever was set in the jobConf. i find this confusing, however i dont think this will change in Cascading.

how about if in scalding we have all the Source classes that use TextLine or TextDelimited set sinkCompression to Compress.DEFAULT? that way i think (have to check) the sinks will respect mapred.compress.output


On Wednesday, November 14, 2012 8:19:12 PM UTC-5, Koert wrote:
i noticed that scalding in general ignores the compression setting in our mapred-site.xml

we are using 0.8.1

On Wed, Nov 14, 2012 at 7:19 PM, Oscar Boykin <> wrote:
What version of scalding are you using?

I expect this to work too, so I'm a little confused.
On Wed, Nov 14, 2012 at 3:20 AM, John <> wrote:
Hello,

I am struggling with a simple task : I would like to compress the entire Tsv output of a scalding job.

Basically, the equivalent of this plain old mapreduce job configuration:
        conf.setOutputFormat(TextOutputFormat.class);
        TextOutputFormat.setCompressOutput(conf, true);
        TextOutputFormat.setOutputCompressorClass(conf, BZip2Codec.class);

I tried to override the config method in my scalding job like this:
       override def config(implicit mode: Mode) = super.config ++ Map("mapred.output.compress" -> "true", "mapred.output.compression.codec" -> "org.apache.hadoop.io.compress.BZip2Codec")

But it somehow gets ignored and when I look in the job configuration through the jobtracker web interface it says mapred.output.compress = false

Any help would be greatly appreciated, thanks in advance

John

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/L46JE-L3LDUJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

Ahsan Rabbani

unread,
Jun 10, 2014, 11:48:53 AM6/10/14
to cascadi...@googlegroups.com
Has it been resolved yet? Could you please post the link to the issue?

Ahsan Rabbani

unread,
Jun 10, 2014, 12:29:18 PM6/10/14
to cascadi...@googlegroups.com
Answering my question - looks like it hasn't been resolved.

m.orazow

unread,
Jun 17, 2014, 1:39:47 PM6/17/14
to cascadi...@googlegroups.com
I did a workaround about this. Apparently in Scalding we  need to change only single line to enable the compressed output from delimited text.

This line,
HadoopSchemeInstance(new CHTextDelimited(fields, null, skipHeader, writeHeader, separator, strict, quote, types, safe))
is creating Cascading TextDelimited class with sinkCompression set to null. So Cascading automatically disables the compression.
Method setSinkCompression sets the sinkCompression of this TextLine object. If null, compression will remain disabled. 

I created a small project with this change compressed Tsv output, https://github.com/morazow/WordCount-Compressed

I also submitted pull request to Scalding source, which changes above 'null' to 'TextLine.Compress.DEFAULT'. https://github.com/twitter/scalding/pull/903
With this change, jobConf configuration changes should be working.
Reply all
Reply to author
Forward
0 new messages