Scalding: Reading CSV from Snappy-compressed HDFS

ach...@box.com

unread,

Jul 26, 2013, 6:17:27 PM7/26/13

to cascadi...@googlegroups.com

How can I tell Scalding to read a CSV file from an HDFS file that was compressed by Snappy?

This is my Scalding code: http://pastie.org/private/qezw1rtkrhrocyjydeona

Command: hadoop jar target/scalding-jobs-0.0.1.jar com.adelbertc.scalding.job.DummyJob --hdfs --input /adelbertc/ds+2013-07-25 --output /adelbertc/dummy

Error:

cascading.tuple.TupleException: unable to read from input identifier: hdfs://nameservice1/adelbertc/ds=2013-07-25/2013-07-26_1374860564-m-00001
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 37, got: 1:SEQ !org.apache.hadoop.io.NullWritable org.apache.hadoop.io.Text�z�֐��YɼS?�  �� �

I'm guessing because my files are Snappy compressed and I didn't tell the job that.. if it helps below is the output I see on the machine where I ran the hadoop command before it failed: http://pastie.org/private/dznuqca6xmyv2nxtvo4wa

Oscar Boykin

unread,

Jul 29, 2013, 12:19:15 PM7/29/13

to cascadi...@googlegroups.com

We use lzo compressed data usually, and to read that we use the code in scalding-commons with Lzo in the trait and class names (sorry on mobile now).

To read snappy data I think you are going to have to write a cascading SnappyScheme to handle the decompression. Maybe someone has already written one for cascading?

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

--

Oscar Boykin :: @posco :: http://twitter.com/posco

Sergey Malov

unread,

Mar 19, 2014, 11:55:42 AM3/19/14

to cascadi...@googlegroups.com

Hi,

Is anything changed as far as Snappy compression support is concerned ? Perhaps it is possible through some MR settings, as MR jobs read snappy files with no problems ?

Thank you

Sergey Malov

On Monday, July 29, 2013 12:19:15 PM UTC-4, Oscar Boykin wrote:

We use lzo compressed data usually, and to read that we use the code in scalding-commons with Lzo in the trait and class names (sorry on mobile now).

To read snappy data I think you are going to have to write a cascading SnappyScheme to handle the decompression. Maybe someone has already written one for cascading?

On Friday, July 26, 2013, wrote:

How can I tell Scalding to read a CSV file from an HDFS file that was compressed by Snappy?

This is my Scalding code: http://pastie.org/private/qezw1rtkrhrocyjydeona

Command: hadoop jar target/scalding-jobs-0.0.1.jar com.adelbertc.scalding.job.DummyJob --hdfs --input /adelbertc/ds+2013-07-25 --output /adelbertc/dummy

Error:

cascading.tuple.TupleException: unable to read from input identifier: hdfs://nameservice1/adelbertc/ds=2013-07-25/2013-07-26_1374860564-m-00001
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 37, got: 1:SEQ !org.apache.hadoop.io.NullWritable org.apache.hadoop.io.Text�z�֐��YɼS?�  �� �

I'm guessing because my files are Snappy compressed and I didn't tell the job that.. if it helps below is the output I see on the machine where I ran the hadoop command before it failed: http://pastie.org/private/dznuqca6xmyv2nxtvo4wa

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.

Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

Oscar Boykin

unread,

Mar 25, 2014, 12:25:48 PM3/25/14

to cascadi...@googlegroups.com

I have not read with snappy, but I think you can add a snappy compression support at the hadoop layer. Normal hadoop options can be passed with:

-Dsomeoption=somevalue

by putting those args BEFORE the job name when you pass to scalding.Tool.

Alternatively, you can set common options inside a job base class by overriding the config method on Job:

http://twitter.github.io/scalding/#com.twitter.scalding.Job

To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

Visit this group at http://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/a39417cf-6855-49b8-b57b-c1624df6c85d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sergey Malov

unread,

Mar 25, 2014, 3:21:51 PM3/25/14

to cascadi...@googlegroups.com

Yes, that's exactly how I did it, works fine.

Sergey Malov

Oren Benjamin

unread,

Mar 28, 2014, 10:51:24 AM3/28/14

to cascadi...@googlegroups.com

Hi Sergey,

Were you able to load your snappy compressed file successfully? If so could you elaborate on your solution?