Input path does not exist

249 views
Skip to first unread message

Vishal Raut

unread,
Sep 20, 2012, 1:34:42 AM9/20/12
to cascadi...@googlegroups.com
Hello,
   I am running a job on Hadoop/Cascading. The input files are present on Amazon S3 storage. There are thousand of buckets/files for input at S3. But, some of the input files/buckets are missing.

  I am constructing the path for input files, in advance. So, is there any way to know, whether the file is present on the path or not. Can we fetch files dynamically.

  I am getting the following exception when I am running job on my local machine:

Exception in thread "main" cascading.flow.FlowException: unhandled exception
    at cascading.flow.Flow.complete(Flow.java:821)
    at com.smsweb.hadoop.json.GetMessageFromCauseIdandDateForMonths.testJSONParsing(GetMessageFromCauseIdandDateForMonths.java:157)
    at com.smsweb.hadoop.json.GetMessageFromCauseIdandDateForMonths.main(GetMessageFromCauseIdandDateForMonths.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hadoop/myfiles/dt=2011-07-12-00-45/0.gz
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at cascading.tap.hadoop.MultiInputFormat.getSplits(MultiInputFormat.java:240)
    at cascading.tap.hadoop.MultiInputFormat.getSplits(MultiInputFormat.java:188)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at cascading.flow.FlowStepJob.blockOnJob(FlowStepJob.java:164)
    at cascading.flow.FlowStepJob.start(FlowStepJob.java:140)
    at cascading.flow.FlowStepJob.call(FlowStepJob.java:129)
    at cascading.flow.FlowStepJob.call(FlowStepJob.java:39)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

Please suggest me how to handle this situation.

Thanks and Regards
Vishal Raut

Paul Lam

unread,
Sep 20, 2012, 5:19:17 AM9/20/12
to cascadi...@googlegroups.com
not sure why you'd want to construct paths in advance, but if you insist, why not do a pre-cascading, regular java function to generate the list automatically using S3 API and then feed the list into the cascading job?

it's more canonical to specify a parent dir as input and let cascading scan the files for you. make use of template source if you need
Reply all
Reply to author
Forward
0 new messages