Input path does not exist on local hdfs file

967 views
Skip to first unread message

Sarma Tangirala

unread,
Jan 4, 2016, 11:03:48 PM1/4/16
to cascadi...@googlegroups.com
Hello,

This is a stack trace I get when trying to re-run a cascade.

cascading.flow.FlowException: [GetDailyUsersAndSignaturesFlow] unhandled exception
at cascading.flow.BaseFlow.complete(BaseFlow.java:954)
at cascading.cascade.BaseCascade$CascadeJob.call(BaseCascade.java:953)
at cascading.cascade.BaseCascade$CascadeJob.call(BaseCascade.java:900)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-13-1-3.ec2.internal:8020/mnt/var/lib/hadoop/tmp/9410495473_tm_client_daily_activity3_E3623DAFA1C844A49A9E77248631B276
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:106)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:265)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:184)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:146)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:48)
... 4 more


It's failing when it's trying to look for this path,

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-13-1-3.ec2.internal:8020/mnt/var/lib/hadoop/tmp/9410495473_tm_client_daily_activity3_E3623DAFA1C844A49A9E77248631B276

and I'm confused about why it's trying to find that temporary hdfs location as opposed to how it would normally behave looking up the correct partition on s3.

Thanks
Sarma

--
Sarma Tangirala | Software Engineer
Inline image 1

Andre Kelpe

unread,
Jan 5, 2016, 6:09:39 AM1/5/16
to cascading-user
This looks like a hdfs configuration error. Did you change anything in your setup?

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAMvszSChirRbcPKaZKZNg3mtRuC4YGztq94%3D%2BkBs7OWjw9XXpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--

sarma.t...@tubemogul.com

unread,
Jan 5, 2016, 7:32:55 AM1/5/16
to cascadi...@googlegroups.com
I'm using EMR and I don't think I've changed the configuration.

As I run the cascading application I see that this process is a two step one. The first step is skipped. The second step fails with this exception. Maybe that indicates a code problem?



Andre Kelpe

unread,
Jan 5, 2016, 7:42:15 AM1/5/16
to cascading-user
By default flows are skipped when the sink is newer than the source,
unless you provided your own FlowSkipStrategy:
https://github.com/Cascading/cascading/blob/3.0/cascading-core/src/main/java/cascading/flow/FlowSkipIfSinkNotStale.java

- André
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cascading-use...@googlegroups.com.
> To post to this group, send email to cascadi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/cascading-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cascading-user/1451997170367.15c89a78%40Nodemailer.

Sarma Tangirala

unread,
Jan 5, 2016, 11:29:55 AM1/5/16
to cascadi...@googlegroups.com
You're right! The flow that failed has,

  source modification date at: Thu Jan 01 00:00:00 UTC 1970.


For more options, visit https://groups.google.com/d/optout.



--

Sarma Tangirala

unread,
Jan 5, 2016, 11:50:50 AM1/5/16
to cascadi...@googlegroups.com
Well this is what I see right now in the output,

16/01/05 16:42:35 INFO flow.Flow: [GetDailyUsersAndSignat...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969

and 

16/01/05 16:42:36 INFO flow.Flow: [GetDailyUsersAndSignat...] source modification date at: Thu Jan 01 00:00:00 UTC 1970
16/01/05 16:42:36 INFO flow.Flow: [GetDailyUsersAndSignat...] skipping step: (1/2)
16/01/05 16:42:36 INFO flow.Flow: [GetDailyUsersAndSignat...] starting step: (2/2) ...ities/run_date=2016-01-05

and then eventually the exception.

Shouldn't this return false for the skip flow method?

Andre Kelpe

unread,
Jan 5, 2016, 11:52:21 AM1/5/16
to cascading-user
There seems to be a clock skew on one of your systems of s3fs is returning wrong dates all of a sudden. I don't think that you are processing data from 1969...

- André


For more options, visit https://groups.google.com/d/optout.

Sarma Tangirala

unread,
Jan 5, 2016, 11:53:47 AM1/5/16
to cascadi...@googlegroups.com
Indeed :D

Thanks for the help!


For more options, visit https://groups.google.com/d/optout.

Sarma Tangirala

unread,
Jan 5, 2016, 12:02:16 PM1/5/16
to cascadi...@googlegroups.com
Side question,

Can you point me to how cascading figures out what the timestamps are on the sinks and the sources?

Andre Kelpe

unread,
Jan 5, 2016, 1:13:01 PM1/5/16
to cascading-user

Sarma Tangirala

unread,
Jan 5, 2016, 3:11:33 PM1/5/16
to cascadi...@googlegroups.com
Hey Andre,

I don't think this is something to do with clock skew. When I use hdfs on my cluster to look up the modified stamp on the s3 partitions it returns the date that is being reported by cascading. The actual files though report the correct modified stamp. Do you have any other ideas about this? I'm using an orc tap with hfs if that makes a difference.

Thanks
Sarma


For more options, visit https://groups.google.com/d/optout.

Ken Krugler

unread,
Jan 5, 2016, 3:48:16 PM1/5/16
to cascadi...@googlegroups.com
Good to know that you're using "an orc tap", as that's useful for providing effective help.

Note that S3 doesn't actually have directories - it fakes these by using '/' characters in the file path to implicitly define directories.

So the S3 filesystem in Hadoop has to play some tricks to decide if an input "directory" actually exists.

I'm wondering if there's an issue with the ORC tap & S3 when it tries to derive the modification date, where it returns 0 (or -1).

Providing the actual tap component and version being used would be helpful.

Regards,

-- Ken


From: Sarma Tangirala

Sent: January 5, 2016 12:11:27pm PST

To: cascadi...@googlegroups.com

Subject: Re: Input path does not exist on local hdfs file


--
<image.png>



--
Sarma Tangirala | Software Engineer
<image.png>

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAA2tiY%2B-HE%3D5MytwoC-t2HCmyTpRfc9MNrzx3iPXjnzFR6WaGA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Sarma Tangirala | Software Engineer
<image.png>



--
Sarma Tangirala | Software Engineer
<image.png>

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CAA2tiY%2B5jZf-Qt9WuHs6cBH9pgMjcSekoRHNyvdbitZaLR%3DA3g%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Sarma Tangirala | Software Engineer
<image.png>

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Sarma Tangirala

unread,
Jan 5, 2016, 3:57:21 PM1/5/16
to cascadi...@googlegroups.com
Sorry about not being clear before.

The "orc tap" is from a lib called corc. Here's the pom dependency.

<dependency>
<groupId>com.hotels</groupId>
<artifactId>corc-cascading</artifactId>
<version>2.0.0</version>
</dependency>

As for the  component itself, here's the bit of code that create the tap.
OrcFile.SourceBuilder builder = OrcFile.source();
Fields fields = new Fields("tm_client_id");
builder.declaredFields(fields);
String typeString = "struct<tm_client_id:string>";
builder.columns(typeString);
builder.schema(typeString);
Hfs parentTap = new Hfs(builder.build(), filePath);

DelimitedPartition typePartition = new DelimitedPartition(new Fields("type"));
Tap typeTap = new PartitionTap( parentTap, typePartition);

return typeTap;

The S3 dir structure is reflected here in the tap construction, "s3://client_activity/type=9/".





For more options, visit https://groups.google.com/d/optout.



--
Sarma Tangirala | Software Engineer
Inline image 1

Ken Krugler

unread,
Jan 5, 2016, 4:20:14 PM1/5/16
to cascadi...@googlegroups.com
Hi Sarma,

The snippet of code is useful, as it shows that you're actually using a PartitionTap that wraps an Hfs tap that uses the ORC File Scheme - that's another important detail.

From what I remember, you can't use a PartitionTap (or TemplateTap) in a Cascade, if another Flow depends on it.

PartitionTap.getModifiedTime() delegates to the parent tap. Looking at the HFS.getModifiedTime() method, it would seem that this skips sub-directories, which is where all of the partitioned data exists. So I would expect this to return 0. Seems like PartitionTap could try harder to return an appropriate modified date, but I think there are probably other issues with a Cascade trying to use partitioned data timestamps to figure out when a downstream Flow should be triggered.

But Chris Wensel should weigh in here…

-- Ken


From: Sarma Tangirala

Sent: January 5, 2016 12:57:14pm PST

Sarma Tangirala

unread,
Jan 5, 2016, 5:02:45 PM1/5/16
to cascadi...@googlegroups.com
Hey Ken,

Thanks for the response.
Does this part make sense to you though?

sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
source modification date at: Thu Jan 01 00:00:00 UTC 1970

Thanks
Sarma

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Ken Krugler

unread,
Jan 5, 2016, 5:42:55 PM1/5/16
to cascadi...@googlegroups.com


From: Sarma Tangirala

Sent: January 5, 2016 2:02:35pm PST

To: cascadi...@googlegroups.com

Subject: Re: Input path does not exist on local hdfs file


Hey Ken,

Thanks for the response.
Does this part make sense to you though?

sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
source modification date at: Thu Jan 01 00:00:00 UTC 1970

Yes re the source timestamp (what getModifiedTime would return, when called for a partitioned directory).

If there's no data for the sink, then I'd expect it to return 0 as well (based on Hfs code), so that's a bit odd to me.

-- Ken

Andre Kelpe

unread,
Jan 6, 2016, 6:27:05 AM1/6/16
to cascading-user
Those are unix timestamps a.k.a. the epoch. For us unix people the world did not exist before Jan 1st 1970 :-). Joking aside, it could be the code that talks to S3 that causes this. It can't determine the timestamp of a directory (since they don't exist in S3) and returns a constant like 0 or -1. This now depends on the various s3 libs in hadoop, like s3a or s3n. If you are on EMR, they might have their own set of patches, that we don't know about. Maybe start with creating a Tap and call getModifiedTime with various files and directories to see, how you can trigger that behaviour.

- André


For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages