How to use fourMC compressed data as an inputSource in cascading.

Neha Kumari

unread,

Nov 9, 2017, 2:23:38 AM11/9/17

to cascading-user

The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that?

Ken Krugler

unread,

Nov 9, 2017, 9:14:44 AM11/9/17

to cascadi...@googlegroups.com

If the text file is compressed (as <name>.gz) then Cascading will handle that automatically.

If there’s a Hadoop sequence file (not exactly the same as a Cascading SequenceFile) that’s compressed by standard Hadoop output formats, then Cascading will be able to read as long as the same compression codec used is available when you run Cascading.

If it’s a custom compression format, you’ll have to create your own input source code.

— Ken

On Nov 8, 2017, at 11:23 PM, Neha Kumari <neha6...@gmail.com> wrote:

The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that?

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2b72fa3d-43b1-4f26-8ea4-59419b6d007d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Neha Kumari

unread,

Nov 9, 2017, 10:17:17 AM11/9/17

to cascading-user

Hey Ken, thanks for your reply. The input file is fourMc file i.e. abc.4mc - will cascading automatically handle this?

Ken Krugler

unread,

Nov 9, 2017, 11:36:24 AM11/9/17

to cascadi...@googlegroups.com

On Nov 9, 2017, at 7:17 AM, Neha Kumari <neha6...@gmail.com> wrote:

Hey Ken, thanks for your reply. The input file is fourMc file i.e. abc.4mc - will cascading automatically handle this?

If you set up your source format (Scheme) as a WritableSequenceFile then it should work, though you need to know the types used for the keys and values.

— Ken

On Thursday, November 9, 2017 at 12:53:38 PM UTC+5:30, Neha Kumari wrote:
The schemes supported by cascading is TextDelimited and SequenceFile. The input which I want to give is compressed and I want my Globhfs to understand and process that input . How can I do that?

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/d4c16fc4-8fcd-4f43-b0fc-762201f1842e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Neha Kumari

unread,

Nov 10, 2017, 2:21:27 AM11/10/17

to cascading-user

@Ken - that gave an error for me :

Error: java.io.IOException: hdfs://path/to/my/input/4mcfile/-r-00000.4mc not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1853) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:753) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Ken Krugler

unread,

Nov 10, 2017, 9:15:37 AM11/10/17

to cascadi...@googlegroups.com

Are you processing the .4mc file on the same Hadoop system that created it?

I.e. do you have the required 4mz codecs installed on the system where you’re running the Cascading workflow below?

— Ken

On Nov 9, 2017, at 11:21 PM, Neha Kumari <neha6...@gmail.com> wrote:

@Ken - that gave an error for me :
Error: java.io.IOException: hdfs://path/to/my/input/4mcfile/-r-00000.4mc not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1853) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1776) at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:753) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/f28a393a-81fe-4400-91a5-04b9977c8d02%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--------------------------------------------

http://about.me/kkrugler

+1 530-210-6378

Reply all

Reply to author

Forward