Using CDH4

Philippe Laflamme

unread,

Jun 13, 2012, 12:47:54 PM6/13/12

to cascadi...@googlegroups.com

Hi,

I'm trying to figure out if Cascading 2.0 will run on CDH4. CDH4 offers MRv1 jobs by packaging Hadoop 0.20.2 (with patches)[1]. I know cascading only supports stable APIs (Hadoop 1.0.x), but if I'm not mistaken, it can also run on 0.20.2 (?)

I'm using the cascading samples for testing. I was able to run the logparser and loganalyzer samples. The wordcount sample is causing problems:

with mapred.reducer.new-api and mapred.mapper.new-api set to false, I get the following error on the console:

java.lang.ClassCastException: cascading.tap.hadoop.io.MultiInputSplit cannot be cast to org.apache.hadoop.mapred.FileSplit
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:373)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:263)

(strangely TextInputFormat in 1.0.2 also casts to FileSplit, so the execution path has to be different somehow)

with mapred.reducer.new-api and mapred.mapper.new-api set to true, I get the following error in the Jobtracker logs:

java.io.IOException: Type mismatch in key from map: expected cascading.tuple.Tuple, recieved org.apache.hadoop.io.LongWritable

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:861)

at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:576)

at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:88)

at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)

at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:120)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

I'm not looking for this to be fixed in Cascading. I'm looking for some guidance as whether the Wordcount sample "should" work in the MR1 0.20.2 environment. If so, it might be one of the CDH4 patches which could be reported as a CDH issue.

Thanks,
Philippe
[1]https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information#CDHVersionandPackagingInformation-CDHVersion4.0.0Packaging

Chris K Wensel

unread,

Jun 14, 2012, 3:03:25 AM6/14/12

to cascadi...@googlegroups.com

From the download page

http://www.cascading.org/downloads/

These releases support stable releases of Apache Hadoop 0.20.2, 0.20.205.0, and Hadoop 1.0.x.

that means for every stable Cascading release we run the Cascading regressions on every one of those versions from the Apache maven repository.

The examples themselves are actually tested with the default Cascading hadoop dependencies after every stable build, where the default is 1.0.2 currently but should work perfectly fine on 0.20.2 etc.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/kKvRNwOOrsYJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Philippe Laflamme

unread,

Jun 14, 2012, 9:25:44 AM6/14/12

to cascadi...@googlegroups.com

Ok I understand. Now should we run jobs using the "mapred.*.new-api" setting set to true or false? I'd like to track down this problem with the correct settings.

Thanks,

Philippe

Philippe Laflamme

unread,

Jun 14, 2012, 1:37:24 PM6/14/12

to cascadi...@googlegroups.com

Some update if someone finds this information useful...

It seems that the wordcount sample fails with CDH4 when there are cluster map/reduce jobs and local jobs involved in a Cascade. The wordcount sample outputs the results locally using Lfs(). If I switch these to Hfs(), things work.

Locally, Hadoop think the job's InputFormat is the default (TextInputFormat), but Cascading said it was MultiInputFormat. I've confirmed that the JobConf contains the correct value for the property (mapred.input.format.class) in both cases. So I don't know how that setting gets lost. Any ideas anyone?

Here's the diff between the JobConf properties if that helps in any way: https://gist.github.com/2931655

Philippe

Andrew Purtell

unread,

Jun 14, 2012, 6:34:47 PM6/14/12

to cascadi...@googlegroups.com

CDH4's MR1 package is a forward port of the mapreduce client, jobtracker, and tasktracker code onto Hadoop 2 HDFS and core. For some core and mapred package classes I believe the inheritance hierarchy changes, classes became interfaces and vice versa. So while this MR1 package is on source level compatible with MRv1 apps and Cascading's MR job planner, it seems necessary to recompile Cascading against CDH4 MR1 classes.

Philippe Laflamme

unread,

Jun 14, 2012, 11:03:39 PM6/14/12

to cascadi...@googlegroups.com

Yep, that seems to fix the problem. Thanks!

Philippe Laflamme

unread,

Jun 15, 2012, 1:04:08 PM6/15/12

to cascadi...@googlegroups.com

Hi,

I've forked cascading and modified its dependencies to build against CDH4. Things mostly work, except for the Serialization tests: I get 7 failures, all in the cascading.tuple.hadoop.SerializedPipesPlatformTest class.

The error reported in every test is the following:

Exception: unable to load serializer for: [B from: org.apache.hadoop.io.serializer.SerializationFactory

(complete stacktrace here: https://gist.github.com/2937585)

So it seems that the serializer for BytesWritable is not registered correctly. Or it is, but it's not being picked up everywhere in the Flow/Cascade.

Anyone have an idea what may be the problem?

Thanks,

Philippe

Chris K Wensel

unread,

Jun 15, 2012, 1:44:31 PM6/15/12

to cascadi...@googlegroups.com

Hey Philippe

I know this isn't of much help. But we are working on a "certification" program so non Apache distributions can test Cascading before they make a release of their proprietary distributions.

By not using the Apache distribution, users have forgone community support for Apache Hadoop (see the number of questions that get deflected back to the distribution mail list from the apache list).

The same is somewhat true for Cascading.

So we want to make sure each vendor tests their distribution with Cascading, and if the tests pass and any issues arrive with Cascading afterwards, it is much less likely to be a distribution issue and instead a Cascading issue. Which is something I'm empowered to actually help with.

So you might ping the vendor and ask them why they broke Hadoop.. grin.

that said, I suspect someone else on the list can probably give better guidance than I.

chris

Philippe Laflamme

unread,

Jun 15, 2012, 2:02:48 PM6/15/12

to cascadi...@googlegroups.com

I completely understand. I don't wish to take up to much bandwidth on this mailing-list nor am I expecting a complete solution, but then, if I'm to report something to Cloudera, I'd like some insight from this community first. Maybe someone has a feel of what the issue may be...

Also, since CDH is pretty popular (at least, around me it is), I thought it would be beneficial to the Cascading community to get such a "port". For example, making the jars available with some kind of classifier (e.g.: -cdh4).

Anyway, thanks for listening :)

Cheers,

Philippe

Chris K Wensel

unread,

Jun 15, 2012, 2:34:33 PM6/15/12

to cascadi...@googlegroups.com

That's a great idea.

But I would think Cloudera should be responsible for that since they generally get paid, where the community/apache stuff is volunteer. they should be making things easier for people.

maybe I can morph the "certification" scripts to also make it easier for the distros to publish binary compatible releases (if a recompile is necessary). but if it requires a fork, then we get into a complicated issue (esp if there is no way to merge the differences).

That said, do ping them. They know what patches they back (or speculatively forward) ported. that will be your easiest path I think.

ckw

Philippe Laflamme

unread,

Jun 15, 2012, 2:54:11 PM6/15/12

to cascadi...@googlegroups.com

I hadn't thought about that before (expecting Cloudera to provide Cascading packages). In fact, that's pretty much what they do: they package Hadoop but also a whole bunch of libs and tools that use Hadoop. Cascading should definitely be part of that.

Pinged: https://issues.cloudera.org/browse/DISTRO-401

If anyone cares and has an idea, don't hesitate to contribute to the issue.

Cheers,

Philippe

Jacob

unread,

Jun 24, 2012, 6:12:12 AM6/24/12

to cascadi...@googlegroups.com

Yes I am starting a new project and would love to use Cascading as its an elegant abstraction and enables me to unit test away from HDFS.

However my use case is tens and thousands of small files and events arriving every day which need to be held in a segregated fashion in a few thousand buckets, each on which does not grow that much on a daily basis. Whilst I have seen the advice that appending is achievable outside of Hadoop I can see that if I do this I will be moving most of my data in and out of HDFS on a nightly basis.

Instead I have been leaning towards Hadoop 2 / CDH4 as I already have append working and there are a lot of other features I would like to leverage in the future such as namenode HA and federation. In short if there is a path by which Cascading can run on Hadoop 2 either now or in the not too distant future I would be extremely interested and happy to help. I will definitely add to the below issue.

Jacob

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
Chris K Wensel
ch...@concurrentinc.com

http://concurrentinc.com

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
Chris K Wensel
ch...@concurrentinc.com

http://concurrentinc.com

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
Chris K Wensel
ch...@concurrentinc.com

http://concurrentinc.com

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

Ted Dunning

unread,

Jun 24, 2012, 1:19:52 PM6/24/12

to cascadi...@googlegroups.com

Append has been removed from Hadoop again (append as in close the file and then re-open for append). There is an alternative that supports appends as well as full random write.

HA is a solved problem, but not with the recently announced approach in CDH4. To solve this, you need much more substantial efforts.

Hadoop still has the small file problem. There is an alternative that supports hundreds of billions of small files.

And Cascading already *is* part of a distribution for Hadoop.

You should check out MapR. There are additional things I haven't mentioned. (I have an interest in MapR's success, btw)

This is not to say that CDH should not include Cascading. They should. It is just to say that people who say that they need solutions to these problems should investigate additional options.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/shk-4AMWfgsJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

Chris K Wensel

unread,

Jun 25, 2012, 10:36:43 AM6/25/12

to cascadi...@googlegroups.com

Hadoop 2.0 will be supported around when it begins to become stable for production use. there are no indications it is anywhere close, as it was just recently labeled Alpha.

ckw

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/shk-4AMWfgsJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Cindy Li

unread,

Sep 7, 2012, 7:51:02 PM9/7/12

to cascadi...@googlegroups.com

Hi Philippe

By switching to Hfs(), did you change the wordcount code or just change the command line output to use HDFS?

Here is what I did, but I still got error. It seems output/urls wasn't created in HDFS correctly.

[cloudera@localhost wordcount]$ hadoop jar wordcount.jar data/url+page.200.txt hdfs:///user/cloudera/output local
12/09/07 16:11:46 INFO util.HadoopUtil: resolving application jar from found main method on: wordcount.Main
12/09/07 16:11:46 INFO planner.HadoopPlanner: using application jar: /home/cloudera/Cascading-2.0-SDK-20120822/source/wordcount/wordcount.jar
12/09/07 16:11:46 INFO property.AppProps: using app.id: 37481464AE115BB68BF9D659CA662E12
12/09/07 16:11:46 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 INFO hadoop.Hfs: forcing job to local mode, via source: Lfs["TextLine[['offset', 'line']->[ALL]]"]["data/url+page.200.txt"]"]
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 INFO planner.HadoopPlanner: using application jar: /home/cloudera/Cascading-2.0-SDK-20120822/source/wordcount/wordcount.jar
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 INFO planner.HadoopPlanner: using application jar: /home/cloudera/Cascading-2.0-SDK-20120822/source/wordcount/wordcount.jar
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 WARN conf.Configuration: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
12/09/07 16:11:47 INFO hadoop.Hfs: forcing job to local mode, via sink: Lfs["TextLine[['offset', 'line']->[ALL]]"]["local/urls"]"]
12/09/07 16:11:47 INFO planner.HadoopPlanner: using application jar: /home/cloudera/Cascading-2.0-SDK-20120822/source/wordcount/wordcount.jar
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 INFO hadoop.Hfs: forcing job to local mode, via sink: Lfs["TextLine[['offset', 'line']->[ALL]]"]["local/words"]"]
12/09/07 16:11:47 INFO cascade.Cascade: [import pages+url pipe+...] starting
12/09/07 16:11:47 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:47 INFO cascade.Cascade: [import pages+url pipe+...] parallel execution is enabled: true
12/09/07 16:11:47 INFO cascade.Cascade: [import pages+url pipe+...] starting flows: 4
12/09/07 16:11:47 INFO cascade.Cascade: [import pages+url pipe+...] allocating threads: 4
12/09/07 16:11:47 INFO cascade.Cascade: [import pages+url pipe+...] starting flow: import pages
12/09/07 16:11:48 INFO flow.Flow: [import pages] at least one sink does not exist
12/09/07 16:11:48 INFO flow.Flow: [import pages] starting
12/09/07 16:11:48 INFO flow.Flow: [import pages] source: Lfs["TextLine[['offset', 'line']->[ALL]]"]["data/url+page.200.txt"]"]
12/09/07 16:11:48 INFO flow.Flow: [import pages] sink: Hfs["SequenceFile[['url', 'page']]"]["hdfs:/user/cloudera/output/pages"]"]
12/09/07 16:11:48 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:48 INFO flow.Flow: [import pages] parallel execution is enabled: true
12/09/07 16:11:48 INFO flow.Flow: [import pages] starting jobs: 1
12/09/07 16:11:48 INFO flow.Flow: [import pages] allocating threads: 1
12/09/07 16:11:48 INFO flow.FlowStep: [import pages] starting step: (1/1) ...ser/cloudera/output/pages
12/09/07 16:11:48 WARN conf.Configuration: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
12/09/07 16:11:48 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
12/09/07 16:11:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/09/07 16:11:48 WARN snappy.LoadSnappy: Snappy native library is available
12/09/07 16:11:48 INFO snappy.LoadSnappy: Snappy native library loaded
12/09/07 16:11:48 INFO mapred.FileInputFormat: Total input paths to process : 1
12/09/07 16:11:48 INFO mapreduce.JobSubmitter: number of splits:2
12/09/07 16:11:48 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
12/09/07 16:11:48 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
12/09/07 16:11:48 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
12/09/07 16:11:48 WARN conf.Configuration: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
12/09/07 16:11:48 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
12/09/07 16:11:48 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
12/09/07 16:11:48 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
12/09/07 16:11:48 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
12/09/07 16:11:48 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
12/09/07 16:11:48 INFO mapred.ResourceMgrDelegate: Submitted application application_1347025361263_0013 to ResourceManager at /0.0.0.0:8032
12/09/07 16:11:49 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1347025361263_0013/
12/09/07 16:11:49 INFO flow.FlowStep: [import pages] submitted hadoop job: job_1347025361263_0013
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] task completion events identify failed tasks
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] task completion events count: 7
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000000_0, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000001_0, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000001_1, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000000_1, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000001_2, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000000_2, Status : FAILED
12/09/07 16:12:31 WARN flow.FlowStep: [import pages] event = Task Id : attempt_1347025361263_0013_m_000001_3, Status : TIPFAILED
12/09/07 16:12:31 INFO flow.Flow: [import pages] stopping all jobs
12/09/07 16:12:31 INFO flow.FlowStep: [import pages] stopping: (1/1) ...ser/cloudera/output/pages
12/09/07 16:12:31 INFO mapred.ResourceMgrDelegate: Killing application application_1347025361263_0013
12/09/07 16:12:31 INFO flow.Flow: [import pages] stopped all jobs
12/09/07 16:12:31 INFO util.Hadoop18TapUtil: deleting temp path hdfs:/user/cloudera/output/pages/_temporary
12/09/07 16:12:31 WARN cascade.Cascade: [import pages+url pipe+...] flow failed: import pages
cascading.flow.FlowException: local step failed
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:191)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
12/09/07 16:12:31 INFO cascade.Cascade: [import pages+url pipe+...] starting flow: export word
12/09/07 16:12:31 INFO flow.Flow: [export word] at least one sink does not exist
12/09/07 16:12:31 INFO cascade.Cascade: [import pages+url pipe+...] starting flow: export url
12/09/07 16:12:31 INFO flow.Flow: [export url] at least one sink does not exist
12/09/07 16:12:31 INFO flow.Flow: [export word] starting
12/09/07 16:12:31 INFO flow.Flow: [export word] source: Hfs["SequenceFile[['word', 'count']]"]["hdfs:/user/cloudera/output/words"]"]
12/09/07 16:12:31 INFO flow.Flow: [export word] sink: Lfs["TextLine[['offset', 'line']->[ALL]]"]["local/words"]"]
12/09/07 16:12:31 INFO flow.Flow: [export word] parallel execution is enabled: true
12/09/07 16:12:31 INFO flow.Flow: [export word] starting jobs: 1
12/09/07 16:12:31 INFO flow.Flow: [export word] allocating threads: 1
12/09/07 16:12:32 INFO flow.Flow: [export url] starting
12/09/07 16:12:32 INFO flow.Flow: [export url] source: Hfs["SequenceFile[['url', 'word', 'count']]"]["hdfs:/user/cloudera/output/urls"]"]
12/09/07 16:12:32 INFO flow.Flow: [export url] sink: Lfs["TextLine[['offset', 'line']->[ALL]]"]["local/urls"]"]
12/09/07 16:12:32 INFO flow.Flow: [export url] parallel execution is enabled: true
12/09/07 16:12:32 INFO flow.Flow: [export url] starting jobs: 1
12/09/07 16:12:32 INFO flow.Flow: [export url] allocating threads: 1
12/09/07 16:12:32 INFO flow.FlowStep: [export url] starting step: (1/1) local/urls
12/09/07 16:12:32 INFO flow.FlowStep: [export word] starting step: (1/1) local/words
12/09/07 16:12:32 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/cloudera/.staging/job_1347025361263_0014
12/09/07 16:12:32 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/words
12/09/07 16:12:32 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/words
12/09/07 16:12:32 INFO flow.Flow: [export word] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [export word] stopping: (1/1) local/words
12/09/07 16:12:32 INFO flow.Flow: [export word] stopped all jobs
12/09/07 16:12:32 WARN cascade.Cascade: [import pages+url pipe+...] flow failed: export word
cascading.flow.FlowException: unhandled exception
        at cascading.flow.BaseFlow.complete(BaseFlow.java:840)
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:762)
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/words
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
        at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:194)
        at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:130)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:478)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:470)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:360)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:609)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:604)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:604)
        at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        ... 5 more
12/09/07 16:12:32 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/cloudera/.staging/job_1347025361263_0015
12/09/07 16:12:32 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/urls
12/09/07 16:12:32 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/urls
12/09/07 16:12:32 INFO flow.Flow: [export url] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [export url] stopping: (1/1) local/urls
12/09/07 16:12:32 INFO flow.Flow: [export url] stopped all jobs
12/09/07 16:12:32 INFO flow.Flow: [export url] shutting down job executor
12/09/07 16:12:32 INFO flow.Flow: [export url] shutdown complete
12/09/07 16:12:32 WARN cascade.Cascade: [import pages+url pipe+...] flow failed: export url
cascading.flow.FlowException: unhandled exception
        at cascading.flow.BaseFlow.complete(BaseFlow.java:840)
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:762)
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0:8020/user/cloudera/output/urls
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
        at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:194)
        at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:130)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:478)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:470)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:360)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:609)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:604)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:604)
        at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        ... 5 more
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopping all flows
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopping flow: export url
12/09/07 16:12:32 INFO flow.Flow: [export url] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [export url] stopping: (1/1) local/urls
12/09/07 16:12:32 INFO flow.Flow: [export url] stopped all jobs
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopping flow: export word
12/09/07 16:12:32 INFO flow.Flow: [export word] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [export word] stopping: (1/1) local/words
12/09/07 16:12:32 INFO flow.Flow: [export word] stopped all jobs
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopping flow: url pipe+word pipe
12/09/07 16:12:32 INFO flow.Flow: [url pipe+word pipe] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [url pipe+word pipe] stopping: (2/2) ...user/cloudera/output/urls
12/09/07 16:12:32 INFO flow.FlowStep: [url pipe+word pipe] stopping: (1/2) ...ser/cloudera/output/words
12/09/07 16:12:32 INFO flow.Flow: [url pipe+word pipe] stopped all jobs
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopping flow: import pages
12/09/07 16:12:32 INFO flow.Flow: [import pages] stopping all jobs
12/09/07 16:12:32 INFO flow.FlowStep: [import pages] stopping: (1/1) ...ser/cloudera/output/pages
12/09/07 16:12:32 WARN ipc.Client: Unexpected error reading responses on connection Thread[IPC Client (1043744321) connection to localhost.localdomain/127.0.0.1:41379 from cloudera,5,main]
java.lang.NullPointerException
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
12/09/07 16:12:32 INFO mapred.ResourceMgrDelegate: Killing application application_1347025361263_0013
12/09/07 16:12:32 INFO flow.Flow: [import pages] stopped all jobs
12/09/07 16:12:32 INFO cascade.Cascade: [import pages+url pipe+...] stopped all flows
Exception in thread "main" cascading.cascade.CascadeException: flow failed: import pages
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:771)
        at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: cascading.flow.FlowException: local step failed
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:191)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        ... 5 more

Reply all

Reply to author

Forward