How to use fourMC compressed data as an inputSource in cascading.

28 views
Skip to first unread message

Neha Kumari

unread,
Nov 9, 2017, 2:23:38 AM11/9/17
to cascading-user
The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that? 

Ken Krugler

unread,
Nov 9, 2017, 9:14:44 AM11/9/17
to cascadi...@googlegroups.com
If the text file is compressed (as <name>.gz) then Cascading will handle that automatically.

If there’s a Hadoop sequence file (not exactly the same as a Cascading SequenceFile) that’s compressed by standard Hadoop output formats, then Cascading will be able to read as long as the same compression codec used is available when you run Cascading.

If it’s a custom compression format, you’ll have to create your own input source code.

— Ken

On Nov 8, 2017, at 11:23 PM, Neha Kumari <neha6...@gmail.com> wrote:

The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that? 

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2b72fa3d-43b1-4f26-8ea4-59419b6d007d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Neha Kumari

unread,
Nov 9, 2017, 10:17:17 AM11/9/17
to cascading-user
Hey Ken, thanks for your reply. The input file is fourMc file i.e. abc.4mc - will cascading automatically handle this?

Ken Krugler

unread,
Nov 9, 2017, 11:36:24 AM11/9/17
to cascadi...@googlegroups.com
On Nov 9, 2017, at 7:17 AM, Neha Kumari <neha6...@gmail.com> wrote:

Hey Ken, thanks for your reply. The input file is fourMc file i.e. abc.4mc - will cascading automatically handle this?

If you set up your source format (Scheme) as a WritableSequenceFile then it should work, though you need to know the types used for the keys and values.

— Ken


On Thursday, November 9, 2017 at 12:53:38 PM UTC+5:30, Neha Kumari wrote:
The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that? 

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Neha Kumari

unread,
Nov 10, 2017, 2:21:27 AM11/10/17
to cascading-user
@Ken - that gave an error for me :
Error: java.io.IOException: hdfs://path/to/my/input/4mcfile/-r-00000.4mc not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1853) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:753) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Ken Krugler

unread,
Nov 10, 2017, 9:15:37 AM11/10/17
to cascadi...@googlegroups.com
Are you processing the .4mc file on the same Hadoop system that created it?

I.e. do you have the required 4mz codecs installed on the system where you’re running the Cascading workflow below?

— Ken
 
On Nov 9, 2017, at 11:21 PM, Neha Kumari <neha6...@gmail.com> wrote:

@Ken - that gave an error for me :
Error: java.io.IOException: hdfs://path/to/my/input/4mcfile/-r-00000.4mc not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1853) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:753) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)



--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

--------------------------------------------

Reply all
Reply to author
Forward
0 new messages