Concurrent Execution (Scalding) ?

509 views
Skip to first unread message

Nathan Stults

unread,
May 4, 2012, 12:30:17 PM5/4/12
to cascadi...@googlegroups.com
Hi, another naive question I'm afraid :) 

I am deploying my Scalding jobs to Amazon EMR, and EMR Job Flows seem like they will only run jobs in a serial fashion. Is it possible to structure or launch Scalding jobs in a way that can execute several pipes or even jobs in parallel? My alternative as far as I can tell is to submit multiple job flows to EMR, which isn't really too bad of an option, but if I can do all my work in a single instance of an EMR cluster I think it will be more economical. 

Thanks for any insight,

Nathan

Chris K Wensel

unread,
May 4, 2012, 12:41:29 PM5/4/12
to cascadi...@googlegroups.com

If you are launching Scalding apps as EMR Job Flows, they only run sequentially. This has to do with EMR and not Cascading or Scalding. 

Don't think of EMR Job Flows as one to one with a Cascading Flow.

Your best bet is to create one app with one jar, with all your flows, and run them as a Cascade.

ckw


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/6sCz_NQsb3MJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.


Ken Krugler

unread,
May 4, 2012, 12:59:21 PM5/4/12
to cascadi...@googlegroups.com
Just to add a bit more color commentary…

On May 4, 2012, at 9:41am, Chris K Wensel wrote:


If you are launching Scalding apps as EMR Job Flows, they only run sequentially. This has to do with EMR and not Cascading or Scalding. 

Also note that within a single EMR Job Flow, you can have multiple steps, but those steps run sequentially.

And yes, it would be nice if that wasn't the case :)

I was using EMR while teaching a web mining tutorial at Strata, and had a lab that would automatically upload the Hadoop job jar to S3 and add a step to a persistent (--alive) Job Flow, but each job had to run sequentially.

Don't think of EMR Job Flows as one to one with a Cascading Flow.

Your best bet is to create one app with one jar, with all your flows, and run them as a Cascade.

A step in an EMR Job Flow executes the main() method of your job jar, which in turn can configure/submit any number of actual Hadoop jobs (in parallel or sequentially).

So by doing what Chris suggests, you'll get parallelism (either inside of one Cascading Flow, or across multiple Flow if you're using a Cascade).

-- Ken

On May 4, 2012, at 9:30 AM, Nathan Stults wrote:

Hi, another naive question I'm afraid :) 

I am deploying my Scalding jobs to Amazon EMR, and EMR Job Flows seem like they will only run jobs in a serial fashion. Is it possible to structure or launch Scalding jobs in a way that can execute several pipes or even jobs in parallel? My alternative as far as I can tell is to submit multiple job flows to EMR, which isn't really too bad of an option, but if I can do all my work in a single instance of an EMR cluster I think it will be more economical. 

Thanks for any insight,

Nathan


--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Nathan Stults

unread,
May 4, 2012, 1:48:53 PM5/4/12
to cascadi...@googlegroups.com
Perfect, just what I needed. I haven't had to touch Cascading directly so far, so I want to confirm my perceptions: in Scalding, the entry point for Hadoop looks like this (I expanded intermediate functions):


    val flow = new HadoopFlowConnector( config ).connect( flowDef )
    flow.complete
    flow.getFlowStats.isSuccessful


It looks to me then like a Cascade is a kind of FlowConnector? So can I do something like


    val flow1 = new HadoopFlowConnector( config ).connect( flowDef1 )
    val flow2 = new HadoopFlowConnector( config ).connect( flowDef2 )

    val cascade = new CascadeConnector.connect( flow1, flow2 )
    cascade.complete
    cascade.getFlowStatus.isSuccesful  


In order to execute the cascade? And if there are no dependencies detected, these will run parallel to one another?

( I tried to click the "For More Information see Topological Scheduling" in the doc, but it didn't go anywhere, maybe it isn't completed yet...)

Thank you,

Nathan




On Friday, May 4, 2012 9:41:29 AM UTC-7, Chris K Wensel wrote:

If you are launching Scalding apps as EMR Job Flows, they only run sequentially. This has to do with EMR and not Cascading or Scalding. 

Don't think of EMR Job Flows as one to one with a Cascading Flow.

Your best bet is to create one app with one jar, with all your flows, and run them as a Cascade.

ckw


On May 4, 2012, at 9:30 AM, Nathan Stults wrote:

Hi, another naive question I'm afraid :) 

I am deploying my Scalding jobs to Amazon EMR, and EMR Job Flows seem like they will only run jobs in a serial fashion. Is it possible to structure or launch Scalding jobs in a way that can execute several pipes or even jobs in parallel? My alternative as far as I can tell is to submit multiple job flows to EMR, which isn't really too bad of an option, but if I can do all my work in a single instance of an EMR cluster I think it will be more economical. 

Thanks for any insight,

Nathan

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/6sCz_NQsb3MJ.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Nathan Stults

unread,
May 4, 2012, 3:43:40 PM5/4/12
to cascadi...@googlegroups.com
Sorry, that last question was a bit lazy of me. This works for me quite well, if anyone else using Scalding would also like to Cascade...


import com.hsihealth.scalding._

import com.twitter.scalding._
import cascading.cascade.{Cascade, CascadeConnector}
import cascading.stats.FlowStats
import scala.collection.JavaConversions._

class CascadedJob(args : Args) extends Job(args) {

val jobs = List(new Job1(args), new Job2(args))

override def run(implicit mode : Mode) = {
val flows = jobs.map { job =>
job.buildFlow(mode)
}.toSeq
val cascade = new CascadeConnector().connect( flows:_* )
cascade.complete()
cascade.getCascadeStats.getChildren.toSeq.forall { statObj =>
statObj.asInstanceOf[FlowStats].isSuccessful

Oscar Boykin

unread,
May 4, 2012, 4:47:12 PM5/4/12
to cascadi...@googlegroups.com
Looks like a pull request to me, Nathan.

:)

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/43GOagsb93MJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Nathan Stults

unread,
May 4, 2012, 6:33:56 PM5/4/12
to cascadi...@googlegroups.com

Why yes it does. Will do.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/43GOagsb93MJ.


To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--

You received this message because you are subscribed to the Google Groups "cascading-user" group.

Nathan Stults

unread,
May 4, 2012, 8:02:08 PM5/4/12
to cascadi...@googlegroups.com
Oscar, can you tell me how to run a single Spec from SBT as opposed to the whole suite? The Scalding tests take pretty long to run, and my new Spec seems to want to run last...

Thank you,

Nathan

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.


For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/43GOagsb93MJ.


To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.


For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.

To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

Oscar Boykin

unread,
May 5, 2012, 12:12:38 PM5/5/12
to cascadi...@googlegroups.com
go into sbt and from the sbt console use test-only with a regex for the tests (you can't do this from the command line for some reason):

sbt
# wait for sbt to load up and give you a console.
test-only *MyTestSpec*

it should run just the one you say.

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/teCkAXFk768J.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

adam ilardi

unread,
Mar 7, 2013, 2:41:48 PM3/7/13
to cascadi...@googlegroups.com
Hello. Was this ever merged into the project? I can't seem to find any examples of Cascade usage.

Thanks,
Adam
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/43GOagsb93MJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
Reply all
Reply to author
Forward
0 new messages