[Scalding/Cascading] Job takes very, very long to configure in case of long input list

389 views
Skip to first unread message

Hyun Joon Seol

unread,
Aug 14, 2013, 10:21:36 AM8/14/13
to cascadi...@googlegroups.com
Hello everyone,
Our team utilizes Scalding to build a language model utilizing a very large corpus. 
We normally fed globbed (e.g. /data/corpus/2013/part-*) input files to scalding, and recently we made a big change in our code structure to de-glob these paths for our purposes.
With this, the input list became significantly longer for some use cases, while other cases still maintained a relatively short input list.

Long story short, for the longer input list case, cascading was hanging for a very long time to configure the job, and it was stuck in calculating the shortest path.
When we put back the globbed input list, it worked again. But in the future we might have a use case where we have multiple globbed paths, so this issue is prone to come up again.
How are we supposed to manage a very long input list? 

Our environment is Hadoop 1.0.3, Scalding 0.8.8.

FYI, here is the jdb stack trace. Thank you for your time. :)

[1] org.jgrapht.alg.RankingPathElementList.isDifferent (null)
[2] org.jgrapht.alg.RankingPathElementList.isAlreadyAdded (null)
[3] org.jgrapht.alg.RankingPathElementList.addPathElements (null)
[4] org.jgrapht.alg.KShortestPathsIterator.tryToAddNewPaths (null)
[5] org.jgrapht.alg.KShortestPathsIterator.updateOutgoingVertices (null)
[6] org.jgrapht.alg.KShortestPathsIterator.next (null)
[7] org.jgrapht.alg.KShortestPaths.getPaths (null)
[8] cascading.flow.planner.ElementGraphs.getAllShortestPathsBetween (ElementGraphs.java:53)
[9] cascading.flow.planner.ElementGraph.getAllShortestPathsTo (ElementGraph.java:393)
[10] cascading.flow.planner.FlowPlanner.failOnMissingGroup (FlowPlanner.java:420)
[11] cascading.flow.hadoop.planner.HadoopPlanner.buildFlow (HadoopPlanner.java:206)
[12] cascading.flow.FlowConnector.connect (FlowConnector.java:454)
[13] com.naver.speech.lmtrain.PreprocessCorpus.buildFlow (PreprocessCorpus.scala:62)
[14] com.naver.speech.lmtrain.PreprocessCorpus.run (PreprocessCorpus.scala:65)
[15] com.naver.speech.lmtrain.Scheduler.doJob (Scheduler.scala:119)
[16] com.naver.speech.lmtrain.Scheduler.run (Scheduler.scala:125)
[17] com.twitter.scalding.Tool.start$1 (Tool.scala:109)
[18] com.twitter.scalding.Tool.run (Tool.scala:125)
[19] com.twitter.scalding.Tool.run (Tool.scala:72)
[20] org.apache.hadoop.util.ToolRunner.run (ToolRunner.java:65)
[21] com.twitter.scalding.Tool$.main (Tool.scala:133)
[22] com.twitter.scalding.Tool.main (null)
[23] sun.reflect.NativeMethodAccessorImpl.invoke0 (native method)
[24] sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39)
[25] sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25)
[26] java.lang.reflect.Method.invoke (Method.java:597)
[27] org.apache.hadoop.util.RunJar.main (RunJar.java:197)

Oscar Boykin

unread,
Aug 14, 2013, 11:24:02 AM8/14/13
to cascadi...@googlegroups.com
This is definitely a cascading or jgraph issue, not at the scalding layer as far as I can tell.

We have seen other cases where the planner takes a very long time to complete a plan.  We used to see the same issue with paths based on time, which is why we implemented TimePathedSource to do this globbing to work around the issue.

I don't know anything about the jgraph library or cascading's use of it, but perhaps that can be optimized.


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.



--
Oscar Boykin :: @posco :: http://twitter.com/posco

Chris K Wensel

unread,
Sep 6, 2013, 1:00:07 PM9/6/13
to cascadi...@googlegroups.com
we may have a fix for this in the latest 2.2 wip. please give it a try.

ckw

Mike Drogalis

unread,
Oct 8, 2013, 1:46:17 PM10/8/13
to cascadi...@googlegroups.com
Running into this issue on short lists of inputs as well. It's definitely a bug within Cascading.

Mike DeLaurentis

unread,
Oct 21, 2013, 10:15:58 AM10/21/13
to cascadi...@googlegroups.com
Hi,

I'm seeing a similar situation, where the planning phase is taking a very long time. It's been hanging in the planning phase for about 10 hours now. The stack trace is very similar to the one above. I ran jstack on the hadoop process probably 100 times or so and it's always the same (see below). I can't tell if it's hanging inside RankingPathElementList.isDifferent, or if that's simply being called many times. This is using Cascading 2.2, called directly from Clojure (not using Cascalog). I will try to come up with a Java test case that reproduces it, but I just wanted to check in the meantime and see if anyone has any insight.

"main" prio=10 tid=0x00000000008bd800 nid=0x6070 runnable [0x00007fa14e0d5000]
   java.lang.Thread.State: RUNNABLE
        at org.jgrapht.alg.RankingPathElementList.isDifferent(Unknown Source)
        at org.jgrapht.alg.RankingPathElementList.isAlreadyAdded(Unknown Source)
        at org.jgrapht.alg.RankingPathElementList.addPathElements(Unknown Source)
        at org.jgrapht.alg.KShortestPathsIterator.tryToAddNewPaths(Unknown Source)
        at org.jgrapht.alg.KShortestPathsIterator.updateOutgoingVertices(Unknown Source)
        at org.jgrapht.alg.KShortestPathsIterator.next(Unknown Source)
        at org.jgrapht.alg.KShortestPaths.getPaths(Unknown Source)
        at cascading.flow.planner.ElementGraphs.getAllShortestPathsBetween(ElementGraphs.java:53)
        at cascading.flow.planner.ElementGraph.getAllShortestPathsFrom(ElementGraph.java:382)
        at cascading.flow.planner.FlowPlanner.failOnLoneGroupAssertion(FlowPlanner.java:428)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:231)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at batcheetah.derivatives.platforms$eval12261$fn__12264.invoke(platforms.clj:156)
        at clojure.lang.MultiFn.invoke(MultiFn.java:231)
        at batcheetah.derivatives.main$run.invoke(main.clj:48)
        at batcheetah.derivatives.main$main.invoke(main.clj:110)
        at clojure.lang.Var.invoke(Var.java:415)
        at batcheetah.derivatives.RunDerivatives.main(RunDerivatives.java:14)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

Chris K Wensel

unread,
Oct 21, 2013, 12:39:42 PM10/21/13
to cascadi...@googlegroups.com
just confirming we are going to need a test to sort this out. 

Mike DeLaurentis

unread,
Oct 21, 2013, 3:57:07 PM10/21/13
to cascadi...@googlegroups.com
Sure, I understand. I'm trying to distill our problem to a small flow that we can use to reproduce it. Would you prefer that I fork Cascading and add a test case, or should I just give you a simple stand-alone Java program that illustrates the problem?

I've been trying to build Cascading and running into some trouble. Using gradle 1.8, running "gradle build" in the top level of Cascading gives me this error:

FAILURE: Build failed with an exception.

* What went wrong:
Could not resolve all dependencies for configuration ':cascading-hadoop:providedCompile'.
> Could not download artifact 'commons-httpclient:commons-httpclient:3.1@jar'
   > Artifact 'commons-httpclient:commons-httpclient:3.1@jar' not found.

I'm not very familiar with gradle. Do you have any idea what that's about?


Chris K Wensel

unread,
Oct 21, 2013, 6:30:13 PM10/21/13
to cascadi...@googlegroups.com
a pull request with a test would be awesome.

there is a bug in ivy (used by gradle) where blah blah blah causes that error, just rm -rf ~/.m2/.../commons-httpclient and things should be better.

ckw

Mike DeLaurentis

unread,
Oct 22, 2013, 12:56:57 PM10/22/13
to cascadi...@googlegroups.com
I just made a pull request: https://github.com/Cascading/cascading/pull/16

Thanks for the tips on building it. Your suggestion worked.

Chris K Wensel

unread,
Oct 22, 2013, 12:58:26 PM10/22/13
to cascadi...@googlegroups.com
awesome. let me see what i can come up with.

Mike DeLaurentis

unread,
Oct 24, 2013, 8:55:06 AM10/24/13
to cascadi...@googlegroups.com
Hi Chris, I was wondering if you had a chance to look at this at all. If you could give any indication about whether you think this is a legitimate issue, whether you think you'll address it, or if you have any suggested workarounds, I would really appreciate it.

Thanks,

Mike

Chris K Wensel

unread,
Oct 24, 2013, 12:05:40 PM10/24/13
to cascadi...@googlegroups.com
I believe it's likely an issue, I can't say exactly when I'll have time to dig into it yet. but I do hope it will be this evening on my flight to Kiev if I can finish this slide deck. else next week. 

Chris K Wensel

unread,
Oct 29, 2013, 3:38:47 PM10/29/13
to cascadi...@googlegroups.com
I still have no ETA on this. 

there is more than one non-linearity we have to overcome, all of which which were scheduled to be re-written in 3.0 (this hasn't been an issue for 5 years, so we figured we had a few more months).

The workaround should be to make smaller Flows and use a Cascade to run them as a single unit. The added benefit is you can re-run the Cascade and it will be faster if only a few source sets of data have been modified.

ckw

Vishwa vichu

unread,
Dec 29, 2016, 1:15:42 AM12/29/16
to cascading-user
Hi Chris,

Is the above mentioned issue (planner taking long time to calculate shortest path using JGraph designed algo) resolved in Cascading 3.1 stable release?

Viswa.

Chris K Wensel

unread,
Dec 29, 2016, 2:57:08 PM12/29/16
to cascadi...@googlegroups.com
you will have to see if it works for you or not.

I recommend using wip-3.2

ckw

Reply all
Reply to author
Forward
0 new messages