Fwd: Job stuck in compilation

265 views
Skip to first unread message

Oscar Boykin

unread,
May 23, 2013, 6:28:32 PM5/23/13
to cascadi...@googlegroups.com, Chris Wensel
Anyone else see cascading take a long time in flow planning phase?

We've seen some very long cases when the graph gets up to 30-35 steps.

---------- Forwarded message ----------
Date: Thu, May 23, 2013 at 1:22 PM
Subject: Re: Job stuck in compilation
To: Scalding Users <scaldin...@twitter.com>


In case anyone is interested, I think I figured out why this happened.

The job had too many steps.  I started from the top of the script and tried to compile larger and larger chunks of the code.  At first it worked fine.  But as the number of steps approached about 30, the flow planning time started to increase in a super-linear fashion (at least quadratically, but maybe exponentially).  By the time the job was at 35 steps, it took over 20 minutes to start running.  I didn't push it past that point.

Clearly there's some kind of search/optimization that happens when you compile a scalding job, and it seems to become intractable somewhere around 30 steps.


On Thu, May 23, 2013 at 11:02 AM:
I'm trying to run a script using scald.rb.  It seems to rsync the jar and then hang while compiling the job.  Has anyone seen this behavior before?  Here is the output:

[INFO] Found Job Class: BuildBFLModelComplete
13/05/23 14:56:09 INFO util.HadoopUtil: resolving application jar from found main method on: com.twitter.scalding.Tool$
13/05/23 14:56:09 INFO planner.HadoopPlanner: using application jar: /home/mmiller/ads-batch-deploy.jar
13/05/23 14:56:09 INFO property.AppProps: using app.id: 9397446BB6825C965870E7D7DA3EE207

After this point it will sit there for literally 1 hour.




--
Oscar Boykin :: @posco :: http://twitter.com/posco

Mansoor Ashraf

unread,
May 24, 2013, 9:57:29 AM5/24/13
to cascadi...@googlegroups.com, Chris Wensel
We have run into this issue several times now. Once you go over 30, compilation never finishes 

Chris K Wensel

unread,
May 24, 2013, 12:26:56 PM5/24/13
to Mansoor Ashraf, cascadi...@googlegroups.com
If you can provide a test I can take a look at it.

sourab...@corp.247customer.com

unread,
Jun 6, 2013, 7:19:32 AM6/6/13
to cascadi...@googlegroups.com, Chris Wensel
I am also facing same kind of non linear delay in planning phase for more than 30 steps.

I have iterative code to do joins and that is what actually creating incremental number of steps depending upon the input. I have the below stats :
Number of steps 13  --> less than 1 min
Number of steps 18  --> less than 1 min
Number of steps 23  --> less than 1 min
Number of steps 28  --> 7 mins
Number of steps 33  --> 43 mins
Number of steps 48  --> more than 2 hrs...did not start

Logs for 28 steps:

13/06/06 16:18:05 INFO util.HadoopUtil: resolving application jar from found main method on: com.tfs.bdp.uc.UCGenerator
13/06/06 16:18:05 INFO planner.HadoopPlanner: using application jar: null
13/06/06 16:18:05 INFO property.AppProps: using app.id: F93A6CDA1847F2252FBD32AD7E1AD962
13/06/06 16:25:23 INFO util.Version: Concurrent, Inc - Cascading 2.0.4
13/06/06 16:25:23 INFO flow.Flow: [] starting
13/06/06 16:25:23 INFO flow.Flow: []  source: MultiSourceTap[1:[Lfs["TextLine[['line']->[ALL]]"]["/var/tmp/idm/localtest"]"]]]
13/06/06 16:25:23 INFO flow.Flow: []  source: MultiSourceTap[1:[Lfs["TextLine[['line']->[ALL]]"]["/var/tmp/idm/localtest-type2"]"]]]
13/06/06 16:25:23 INFO flow.Flow: []  source: MultiSourceTap[1:[Lfs["TextLine[['line']->[ALL]]"]["/var/tmp/idm/localtest"]"]]]
13/06/06 16:25:23 INFO flow.Flow: []  source: MultiSourceTap[1:[Lfs["TextLine[['line']->[ALL]]"]["/var/tmp/idm/localtest"]"]]]
13/06/06 16:25:23 INFO flow.Flow: []  sink: Lfs["TextLine[['line']->['?ucid', '?blob']]"]["/var/tmp/idm/localtest-out"]"]
13/06/06 16:25:23 INFO flow.Flow: []  parallel execution is enabled: false
13/06/06 16:25:23 INFO flow.Flow: []  starting jobs: 28
13/06/06 16:25:23 INFO flow.Flow: []  allocating threads: 1
13/06/06 16:25:23 INFO flow.FlowStep: [] starting step: (1/28)

Thanks
Sourabh

Chris K Wensel

unread,
Jun 6, 2013, 12:10:00 PM6/6/13
to cascadi...@googlegroups.com
if you see this line

13/06/06 16:25:23 INFO flow.Flow: [] starting

you are past the planning phase. and are executing the flow. 

also, this isn't actually running on a cluster since you are using Lfs. so it is highly likely you are having memory issues. 

Oscar Boykin

unread,
Jun 6, 2013, 12:15:38 PM6/6/13
to cascadi...@googlegroups.com
But Chris, look at the time delta:

13/06/06 16:18:05 INFO property.AppProps: using app.id: F93A6CDA1847F2252FBD32AD7E1AD962
13/06/06 16:25:23 INFO util.Version: Concurrent, Inc - Cascading 2.0.4
13/06/06 16:25:23 INFO flow.Flow: [] starting

He only saw "starting" after waiting 7 minutes.  I think that is the issue he is reporting.



--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Chris K Wensel

unread,
Jun 6, 2013, 12:24:31 PM6/6/13
to cascadi...@googlegroups.com
duh, ok I get it now..

try using a modern release to see if the problem is aggravated/alleviated. 2.1.6 and 2.2.0-wip. 

also sending me the dot would be useful.

sourabh chaki

unread,
Jun 7, 2013, 5:21:55 AM6/7/13
to cascadi...@googlegroups.com
Hi Chris,

I tried with 2.1.6. But I don't see any improvement here as well. for 28 steps, it took 9 minutes to start and for 43 steps it took more than 40 minutes and for 48 steps it took more than 2 hours to start. I tried the same in cluster, but there is no improvement in start time delay.

After I start my application first log came from cascading after 9 minutes delay. This is the case of 28 steps.

13/06/07 14:33:54 INFO uc.UCGenerator: Fetching type2 events :: [/var/tmp/idm/localtest-type2]
13/06/07 14:42:00 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
13/06/07 14:42:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/06/07 14:42:01 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/07 14:42:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/07 14:42:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/07 14:42:01 INFO mapred.Task:  Using ResourceCalculatorPlugin : null

Thanks
Sourabh


--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/jsG9NZ5w1VI/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

Oscar Boykin

unread,
Jun 7, 2013, 11:14:49 AM6/7/13
to cascadi...@googlegroups.com
Can you share a test job with Chris? My guess is there is some bad scaling of some scheduling algorithm that only shows up in some graphs.
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Chris K Wensel

unread,
Jun 7, 2013, 12:12:02 PM6/7/13
to cascadi...@googlegroups.com
Can you share a test job with Chris? My guess is there is some bad scaling of some scheduling algorithm that only shows up in some graphs.


that's unlikely. but the planner could be making redundant passes on the process graph partitioning. that could add non-linear behavior.

that said.. I've never seen these log entries before. (I just grep'd 100mb of test logs looking for them)

13/06/07 14:33:54 INFO uc.UCGenerator: Fetching type2 events :: [/var/tmp/idm/localtest-type2]
13/06/07 14:42:00 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

since I don't have a dot, I can't further speculate.

--

sourabh chaki

unread,
Jun 17, 2013, 7:14:53 AM6/17/13
to cascadi...@googlegroups.com, Chris Wensel, nat...@nathanmarz.com
I have created a small test to simulate this scenario. I am using Jcascalog on top of cascading.Code is in my github: https://github.com/sourabhchaki/cascalog-cascading-test/blob/master/src/main/java/com/home/test/CascadingTestInJcascalog.java

Looping Nathan as this code is written in Jcascalog.

Here I am doing self join for the same input for a given depth. For every depth cascading creates 2 jobs. Thus by changing depth I was able check the preparation time for multiple cascading jobs. Here I can see cascading job preparation time is increasing in non linear fashion as number of jobs increases.
Execution steps for this test is here.

depth=5,step 10, time taken:1 sec
[17/06/2013:14:13:19 IST] [INFO] [cascading.property.AppProps main]: using app.id: FC106638099703F5450E89B08BB7442F
[17/06/2013:14:13:20 IST] [INFO] [cascading.util.Version flow]: Concurrent, Inc - Cascading 2.0.0
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: [] starting
......
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: []  starting jobs: 10
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: []  allocating threads: 1
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.FlowStep pool-1-thread-1]: [] starting step: (6/10)

depth=10, steps: 20: Time taken: 15 mins.
[17/06/2013:14:14:50 IST] [INFO] [cascading.property.AppProps main]: using app.id: 264A79523E9A9AF21EB04D2814FBCF9F
[17/06/2013:14:29:54 IST] [INFO] [cascading.util.Version flow]: Concurrent, Inc - Cascading 2.0.0
[17/06/2013:14:29:54 IST] [INFO] [cascading.flow.Flow flow]: [] starting
.....
[17/06/2013:14:29:54 IST] [INFO] [cascading.flow.Flow flow]: []  starting jobs: 20

I tried with depth =15, so jobs= 30, and waited for 1 hrs but the application never started.

Hope this will help you to investigate the problem.

Let me know if you need any more details.

Thanks
Sourabh

Chris K Wensel

unread,
Jun 17, 2013, 8:54:34 PM6/17/13
to sourabh chaki, cascadi...@googlegroups.com, nat...@nathanmarz.com
sorry, I cannot provide any help on this without it being a raw cascading application I can debug. hopefully Nathan or Sam can jump in.

ckw

Sam Ritchie

unread,
Jun 17, 2013, 9:31:09 PM6/17/13
to cascadi...@googlegroups.com, sourabh chaki, nat...@nathanmarz.com
It looks like this is a really serious issue. I can go ahead and translate that code into a Cascading job if you need it -- otherwise, Sourabh, would you mind sending out the dot files for your experiments?

(The dot files will look a bit more sane if you change your "?" prefixes to "!".)

June 17, 2013 5:54 PM
sorry, I cannot provide any help on this without it being a raw cascading application I can debug. hopefully Nathan or Sam can jump in.

ckw



--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 
June 17, 2013 4:14 AM
I have created a small test to simulate this scenario. I am using Jcascalog on top of cascading.Code is in my github: https://github.com/sourabhchaki/cascalog-cascading-test/blob/master/src/main/java/com/home/test/CascadingTestInJcascalog.java

Looping Nathan as this code is written in Jcascalog.

Here I am doing self join for the same input for a given depth. For every depth cascading creates 2 jobs. Thus by changing depth I was able check the preparation time for multiple cascading jobs. Here I can see cascading job preparation time is increasing in non linear fashion as number of jobs increases.
Execution steps for this test is here.

depth=5,step 10, time taken:1 sec
[17/06/2013:14:13:19 IST] [INFO] [cascading.property.AppProps main]: using app.id: FC106638099703F5450E89B08BB7442F
[17/06/2013:14:13:20 IST] [INFO] [cascading.util.Version flow]: Concurrent, Inc - Cascading 2.0.0
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: [] starting
......
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: []  starting jobs: 10
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.Flow flow]: []  allocating threads: 1
[17/06/2013:14:13:20 IST] [INFO] [cascading.flow.FlowStep pool-1-thread-1]: [] starting step: (6/10)

depth=10, steps: 20: Time taken: 15 mins.
[17/06/2013:14:14:50 IST] [INFO] [cascading.property.AppProps main]: using app.id: 264A79523E9A9AF21EB04D2814FBCF9F
[17/06/2013:14:29:54 IST] [INFO] [cascading.util.Version flow]: Concurrent, Inc - Cascading 2.0.0
[17/06/2013:14:29:54 IST] [INFO] [cascading.flow.Flow flow]: [] starting
.....
[17/06/2013:14:29:54 IST] [INFO] [cascading.flow.Flow flow]: []  starting jobs: 20

I tried with depth =15, so jobs= 30, and waited for 1 hrs but the application never started.

Hope this will help you to investigate the problem.

Let me know if you need any more details.

Thanks
Sourabh

On Friday, 24 May 2013 03:58:32 UTC+5:30, Oscar Boykin wrote:
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 
May 23, 2013 3:28 PM
Anyone else see cascading take a long time in flow planning phase?

We've seen some very long cases when the graph gets up to 30-35 steps.




--
Oscar Boykin :: @posco :: http://twitter.com/posco
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie

sourab...@corp.247customer.com

unread,
Jun 19, 2013, 5:46:33 AM6/19/13
to cascadi...@googlegroups.com, Chris Wensel
Hi Sam,

I have generated dot files doing the following steps:

       //Api.execute(new StdoutTap(), newMappings);
       Flow flow = Api.compileFlow(new StdoutTap(), newMappings);
       flow.writeDOT(depth*2+"jobs.dot");

Dot files for 10jobs and 20jobs are attached.

Here also I can see the same problem. For 20jobs it is taking 15 minutes to generate the dot file. Where as for 10jobs it is taking less than 1 sec to generate this dot file. I have generated these dot files in local. I saw the same result in linux machine as well.

I am new to Jcascalog and yet to get experience in cascading. It will be really helpful if you can convert my example to cascading only if Chris need that for debugging. 

Thanks in advance.

Regards,
Sourabh
10jobs.dot
20jobs.dot

Sam Ritchie

unread,
Jun 19, 2013, 11:46:42 AM6/19/13
to cascadi...@googlegroups.com, Chris Wensel
Chris, what are you thinking, here? I think that if we run a profiler on that example code (Thanks again, Sourabh, awesome example), it'll become clear what's going on. I can spend some time next week converting this code to straight-up Cascading, but I'd like to avoid that if possible, since we already have a great repro here.

June 19, 2013 2:46 AM
Hi Sam,

I have generated dot files doing the following steps:

       //Api.execute(new StdoutTap(), newMappings);
       Flow flow = Api.compileFlow(new StdoutTap(), newMappings);
       flow.writeDOT(depth*2+"jobs.dot");

Dot files for 10jobs and 20jobs are attached.

Here also I can see the same problem. For 20jobs it is taking 15 minutes to generate the dot file. Where as for 10jobs it is taking less than 1 sec to generate this dot file. I have generated these dot files in local. I saw the same result in linux machine as well.

I am new to Jcascalog and yet to get experience in cascading. It will be really helpful if you can convert my example to cascading only if Chris need that for debugging. 

Thanks in advance.

Regards,
Sourabh

On Friday, May 24, 2013 3:58:32 AM UTC+5:30, Oscar Boykin wrote:
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
May 23, 2013 3:28 PM
Anyone else see cascading take a long time in flow planning phase?

We've seen some very long cases when the graph gets up to 30-35 steps.




--
Oscar Boykin :: @posco :: http://twitter.com/posco
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

sourabh chaki

unread,
Jun 26, 2013, 8:24:16 AM6/26/13
to cascadi...@googlegroups.com, Chris Wensel

Did any one of you get a chance to look into it?

Thanks
Sourabh

Sam Ritchie

unread,
Jun 26, 2013, 2:35:29 PM6/26/13
to cascadi...@googlegroups.com, Chris Wensel
Chris, it seems like you're pretty uninterested in this issue... can you give us some advice on how to proceed?

June 26, 2013 5:24 AM

Did any one of you get a chance to look into it?

Thanks
Sourabh
On Friday, 24 May 2013 03:58:32 UTC+5:30, Oscar Boykin wrote:
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
May 23, 2013 3:28 PM
Anyone else see cascading take a long time in flow planning phase?

We've seen some very long cases when the graph gets up to 30-35 steps.




--
Oscar Boykin :: @posco :: http://twitter.com/posco
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Chris K Wensel

unread,
Jun 26, 2013, 2:51:00 PM6/26/13
to Sam Ritchie, cascadi...@googlegroups.com
please don't misinterpret my lack of time for being a lack of interest. 

that said, handing me a reproducible test case without any additional embellishments (scala, clojure, etc) that I can plug into our test framework (via a pull request) is the best way for me to resolve an issue and keep it resolved.

without that, I will get to it when I can.

ckw

Matt Martin

unread,
Oct 15, 2014, 10:01:54 PM10/15/14
to cascadi...@googlegroups.com, sritc...@gmail.com
I know this is a pretty old thread, but I ran into a similar problem with very slow job compilation.  After finding this thread, I initially suspected some issue in Cascading.  I was seeing the issue in a fairly simple Scalding job that happened to have a lot of unions.  I thought this job could easily be translated to straight Cascading and might serve as a good test case to help isolate the underlying issue.  But a funny thing happened when I re-wrote the job in Cascading: the job compiled and completed quickly.  For reference, here's a gist to show the code I used (and how I was able to modify the scalding code to run as quickly as the Cascading code): https://gist.github.com/matt-martin/538adf5a28d856503fae.  Maybe there is still some deeper issue with Cascading, but this experience taught me to be just a tiny bit wary of how scalding itself builds a flow.

TLDR: In my case the slow compilation seemed to be due to the way the Scalding code was translated to Cascading and not Cascading itself.

Matt Martin

unread,
Oct 15, 2014, 10:04:57 PM10/15/14
to cascadi...@googlegroups.com, sritc...@gmail.com
Sorry, here is the proper Gist link: https://gist.github.com/matt-martin/4cd9cab6ec761eb7f100

Chris K Wensel

unread,
Oct 15, 2014, 11:29:21 PM10/15/14
to cascadi...@googlegroups.com, Sam Ritchie
Thanks for the follow up.

Next time you suspect the planner doing stupid things, give the Cascading 3 wips a run through. 

It does a few less stupid things than 2.x does. And I hope to find any new ones as quickly as possible before we ship 3.0.0.

ckw

Matt Martin

unread,
Oct 16, 2014, 9:34:39 PM10/16/14
to cascadi...@googlegroups.com, sritc...@gmail.com
Probably should've taken a moment yesterday to put together a Cascading job that more closely resembles the Scalding job that was hanging on compilation.  Sure enough, here is a fairly simple Cascading job that hangs on job compilation (note you can bump up the value of numInputPaths to get it to hang longer):


I tried it with cascading-3.0.0-wip-53 just to make sure it was not an issue specific to 2.x.  Note the gist also has another version of a very similar cascading job that doesn't show any issues.  The difference is that the slow job nests a bunch of Merge objects--i.e. "new Merge(pipeX, new Merge(pipeY, ...))"--whereas the speedier job calls "new Merge(pipeX, pipeY, ...)".  Not sure if this is a general enough case to be worth consideration, but thought I'd bring it up "just in case."

Matt  

Chris K Wensel

unread,
Oct 17, 2014, 12:01:31 PM10/17/14
to cascadi...@googlegroups.com
Hey Matt

I'm unclear if the issue persists at the same level with the 3.0 planner. Or are they worse. 

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Oct 17, 2014, 12:40:24 PM10/17/14
to cascadi...@googlegroups.com
on 3.0 you can set the FlowPlanner#TRACE_STATS_PATH


this may help identify which rules are misbehaving..

will also add, before we ship 3.0.0, support for multiple rule registries will be supported in some fashion. some rule registries may behavior better under differing circumstances (no HashJoins in assembly) that others (compensating for pesky HashJoins).

ckw


For more options, visit https://groups.google.com/d/optout.

Matt Martin

unread,
Oct 23, 2014, 8:00:48 PM10/23/14
to cascadi...@googlegroups.com
Hi Chris,

I haven't done extensive comparisons between 2.x and 3.0, but the performance issue seemed to be about the same on both.  That being said, I tried to reproduce the issue again just now with 3.0.0-wip-57 (instead of 3.0.0-wip-53) and it seems like the issue has disappeared.  I was going to try using TRACE_STATS_PATH, but the results aren't particularly informative now that I cannot recreate the slowness I was seeing before.  I assume maybe you guys fixed whatever the issue was in one of the more recent WIP releases?

Matt

Chris K Wensel

unread,
Oct 23, 2014, 8:09:37 PM10/23/14
to cascadi...@googlegroups.com
if you go back to wip-53 does the slowdown come back? just curious.

I can't think of anything I did that would have made a material difference.

ckw


For more options, visit https://groups.google.com/d/optout.

Matt Martin

unread,
Oct 23, 2014, 8:31:08 PM10/23/14
to cascadi...@googlegroups.com
My bad.  Combination of wishful thinking and trying to do too much at once.  It looks as though wip-57 is more or less as slow as wip-53.

However, when I run the code with TRACE_STATS_PATH, it seems as though most of the time difference is not in the planner.  Here are the stats from the "fast" version:


The difference in duration (~6 seconds) is not anywhere close to the overall difference in runtime.  I have, however, verified that the slowdown comes in the call to "flowConnector.connect(...)" and not "flow.complete()".
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: norm
...

Chris K Wensel

unread,
Oct 23, 2014, 10:58:54 PM10/23/14
to cascadi...@googlegroups.com
confusing.

let me add additional instrumentation to the stats doc to see if we can isolate things a little better.

do know you are running the local mode planner, so things will more hairy when you are using MR or Tez, since there is a lot of more work todo. 

that said, staying in local mode and running a profiler just might be the best thing. better yet, run your java vm with flight recorder on, and share the dump with me.

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=PATH,disk=true,repository=PATH

ckw

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Oct 24, 2014, 2:56:16 PM10/24/14
to cascadi...@googlegroups.com
quick update.

found some time to run this.

its not the query planner with the slowdown, its that fact that this is a local mode application, and that when we are building the executable graph, we are working against the whole graph, not a sub-graph localized in a Mapper/Reducer/Processor/etc. so any graph algorithms have a much larger search space.

that means we are leveraging easy to use but slow algorithms to identify various bits of meta-data, and those algorithms have a O(...) profile that isn't useful at scale. but don't come in to play with mapreduce and tez since they kick in on sub-graphs, that have very few inputs/outputs.

Converting the job to tez, the planner ran in 25 second (lots of graph partitioning happening) and the flow completed in 12 seconds in Tez local mode.

So this is only an issue with Cascading local mode. and something we should address where we can.

I created a skeleton project people can fork and add new ad-hoc one off tests, i put your code here plus my changes


ckw


For more options, visit https://groups.google.com/d/optout.

Umesh Pawar

unread,
Aug 18, 2015, 2:57:15 AM8/18/15
to cascading-user
I am facing this issue in scalding job.

My job has just 20 steps.

By looking at above discussions, people could execute jobs till 30 steps.

Please let me know the solution or workaround to resolve this. This is very urgent from me.
...
Reply all
Reply to author
Forward
0 new messages