Performance

45 views

Skip to first unread message

Dave

unread,

Jul 27, 2012, 11:22:57 AM7/27/12

to seg...@googlegroups.com

Hello all,

New user to Segue.

I have been able to successfully get Segue up and running with all my required packages and code. Along with my data exported.

One of the first things that I tried was running a very trivial function. Something that literally returns the parameter that I pass in. Doing so returns the correct result but this also takes around 5 minutes to do so. Could someone explain why this process takes 5 minutes? Is this just the nature of how Amazon EMR works?

Also, just to be clear. If I opt to use a machine with multiple cores i.e. an 8 core machine. Is there any harm in using a function that takes advantage of multicore so that I can utilize all cores on the machine where the job is being run?

James Long

unread,

Jul 30, 2012, 7:41:02 AM7/30/12

to seg...@googlegroups.com

Hey Dave, welcome to Segue. Glad you were able to get it up and running.

Your long cycle times are not atypical. A lot of communication happens
between your desktop and the remote cluster before the job actually
starts running. Trivial jobs usually run faster than 5 minutes, but
that's not unheard of. Jobs that are well suited to Segue usually are
jobs that would run for many hours on a desktop machine. Since there's
a multiple minute start up time, regardless of the jobs size, the job
has to be long running enough to make up for the slow latency.

In terms of multicore, avoid it like the plage when using Segue. To
load multicore machines up with more instances of R, use the
instancesPerNode parameter of the createCluster() function to run
multiple R instances on each of your worker nodes. I usually set
instancesPerNode = number of processors. The exception is if I have
high memory jobs which are failing due to lack of memory. Then I start
backing down the number of instances.

I hope that helps some.

-J

Dave

unread,

Jul 30, 2012, 8:56:38 AM7/30/12

to seg...@googlegroups.com

Is there any particular reason why multi core should be avoided and more instances of R should be used instead? My multi core calls are fairly deep into my code from where I have broken things up to be spread amongst the cluster with segue and the MC calls themselves take at least 30 minutes to run or (240 minutes without MC)

James Long

unread,

Jul 30, 2012, 11:04:17 AM7/30/12

to seg...@googlegroups.com

The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.

If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.

-J

Dave

unread,

Jul 30, 2012, 7:30:57 PM7/30/12

to seg...@googlegroups.com

Thanks for the response.

Just one quick question. Based on the help for createCluster()

instancesPerNode

Number of R instances per node. Default of NULL uses AWS defaults.

Where are the AWS defaults defined?

On Monday, July 30, 2012 11:04:17 AM UTC-4, jdlong wrote:

The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.

If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.

-J

> Is there any particular reason why multi core should be avoided and more
> instances of R should be used instead? My multi core calls are fairly deep
> into my code from where I have broken things up to be spread amongst the
> cluster with segue and the MC calls themselves take at least 30 minutes to
> run or (240 minutes without MC)
>
>
> On Friday, July 27, 2012 11:22:57 AM UTC-4, Dave wrote:
>>
>> Hello all,
>>
>> New user to Segue.
>>
>> I have been able to successfully get Segue up and running with all my
>> required packages and code. Along with my data exported.
>>
>> One of the first things that I tried was running a very trivial function.
>> Something that literally returns the parameter that I pass in. Doing so
>> returns the correct result but this also takes around 5 minutes to do so.
>> Could someone explain why this process takes 5 minutes? Is this just the
>> nature of how Amazon EMR works?
>>
>> Also, just to be clear. If I opt to use a machine with multiple cores
>> i.e. an 8 core machine. Is there any harm in using a function that takes
>> advantage of multicore so that I can utilize all cores on the machine where
>> the job is being run?

On Monday, July 30, 2012 11:04:17 AM UTC-4, jdlong wrote:

The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.

If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.

-J

> Is there any particular reason why multi core should be avoided and more
> instances of R should be used instead? My multi core calls are fairly deep
> into my code from where I have broken things up to be spread amongst the
> cluster with segue and the MC calls themselves take at least 30 minutes to
> run or (240 minutes without MC)
>
>
> On Friday, July 27, 2012 11:22:57 AM UTC-4, Dave wrote:
>>
>> Hello all,
>>
>> New user to Segue.
>>
>> I have been able to successfully get Segue up and running with all my
>> required packages and code. Along with my data exported.
>>
>> One of the first things that I tried was running a very trivial function.
>> Something that literally returns the parameter that I pass in. Doing so
>> returns the correct result but this also takes around 5 minutes to do so.
>> Could someone explain why this process takes 5 minutes? Is this just the
>> nature of how Amazon EMR works?
>>
>> Also, just to be clear. If I opt to use a machine with multiple cores
>> i.e. an 8 core machine. Is there any harm in using a function that takes
>> advantage of multicore so that I can utilize all cores on the machine where
>> the job is being run?

On Monday, July 30, 2012 11:04:17 AM UTC-4, jdlong wrote:

The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.

If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.

-J

James Long

unread,

Jul 31, 2012, 6:24:41 AM7/31/12

to seg...@googlegroups.com

I've had rather unreliable results trying to use the defaults. So I'm not sure what they actually are. My practice has been to always specify a value. I should change the code to throw a warning if the default is used.