Scalding: Reading CSV from Snappy-compressed HDFS

959 views
Skip to first unread message

ach...@box.com

unread,
Jul 26, 2013, 6:17:27 PM7/26/13
to cascadi...@googlegroups.com
How can I tell Scalding to read a CSV file from an HDFS file that was compressed by Snappy?


Command: hadoop jar target/scalding-jobs-0.0.1.jar com.adelbertc.scalding.job.DummyJob --hdfs --input /adelbertc/ds+2013-07-25 --output /adelbertc/dummy

Error:

cascading.tuple.TupleException: unable to read from input identifier: hdfs://nameservice1/adelbertc/ds=2013-07-25/2013-07-26_1374860564-m-00001
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 37, got: 1:SEQ !org.apache.hadoop.io.NullWritable org.apache.hadoop.io.Text�z�֐��YɼS?�  �� �


I'm guessing because my files are Snappy compressed and I didn't tell the job that.. if it helps below is the output I see on the machine where I ran the hadoop command before it failed: http://pastie.org/private/dznuqca6xmyv2nxtvo4wa





Oscar Boykin

unread,
Jul 29, 2013, 12:19:15 PM7/29/13
to cascadi...@googlegroups.com
We use lzo compressed data usually, and to read that we use the code in scalding-commons with Lzo in the trait and class names (sorry on mobile now).

To read snappy data I think you are going to have to write a cascading SnappyScheme to handle the decompression. Maybe someone has already written one for cascading?
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
Oscar Boykin :: @posco :: http://twitter.com/posco

Sergey Malov

unread,
Mar 19, 2014, 11:55:42 AM3/19/14
to cascadi...@googlegroups.com
Hi,
Is anything changed as far as Snappy compression support is concerned ? Perhaps it is possible through some MR settings, as MR jobs read snappy files with no problems ?

Thank you

Sergey Malov


On Monday, July 29, 2013 12:19:15 PM UTC-4, Oscar Boykin wrote:
We use lzo compressed data usually, and to read that we use the code in scalding-commons with Lzo in the trait and class names (sorry on mobile now).

To read snappy data I think you are going to have to write a cascading SnappyScheme to handle the decompression. Maybe someone has already written one for cascading?

On Friday, July 26, 2013, wrote:
How can I tell Scalding to read a CSV file from an HDFS file that was compressed by Snappy?


Command: hadoop jar target/scalding-jobs-0.0.1.jar com.adelbertc.scalding.job.DummyJob --hdfs --input /adelbertc/ds+2013-07-25 --output /adelbertc/dummy

Error:

cascading.tuple.TupleException: unable to read from input identifier: hdfs://nameservice1/adelbertc/ds=2013-07-25/2013-07-26_1374860564-m-00001
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 37, got: 1:SEQ !org.apache.hadoop.io.NullWritable org.apache.hadoop.io.Text�z�֐��YɼS?�  �� �


I'm guessing because my files are Snappy compressed and I didn't tell the job that.. if it helps below is the output I see on the machine where I ran the hadoop command before it failed: http://pastie.org/private/dznuqca6xmyv2nxtvo4wa





--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-user+unsubscribe@googlegroups.com.
To post to this group, send email to cascading-user@googlegroups.com.

Oscar Boykin

unread,
Mar 25, 2014, 12:25:48 PM3/25/14
to cascadi...@googlegroups.com
I have not read with snappy, but I think you can add a snappy compression support at the hadoop layer. Normal hadoop options can be passed with:

-Dsomeoption=somevalue

by putting those args BEFORE the job name when you pass to scalding.Tool.

Alternatively, you can set common options inside a job base class by overriding the config method on Job:



To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

Sergey Malov

unread,
Mar 25, 2014, 3:21:51 PM3/25/14
to cascadi...@googlegroups.com
Yes, that's exactly how I did it, works fine.

Sergey Malov

Oren Benjamin

unread,
Mar 28, 2014, 10:51:24 AM3/28/14
to cascadi...@googlegroups.com
Hi Sergey,

Were you able to load your snappy compressed file successfully?  If so could you elaborate on your solution?

Thanks,
    -- Oren
Reply all
Reply to author
Forward
0 new messages