instancesPerNode
Number of R instances per node. Default of NULL uses AWS defaults.
Where are the AWS defaults defined?
The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.
If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.
-J
> Is there any particular reason why multi core should be avoided and more
> instances of R should be used instead? My multi core calls are fairly deep
> into my code from where I have broken things up to be spread amongst the
> cluster with segue and the MC calls themselves take at least 30 minutes to
> run or (240 minutes without MC)
>
>
> On Friday, July 27, 2012 11:22:57 AM UTC-4, Dave wrote:
>>
>> Hello all,
>>
>> New user to Segue.
>>
>> I have been able to successfully get Segue up and running with all my
>> required packages and code. Along with my data exported.
>>
>> One of the first things that I tried was running a very trivial function.
>> Something that literally returns the parameter that I pass in. Doing so
>> returns the correct result but this also takes around 5 minutes to do so.
>> Could someone explain why this process takes 5 minutes? Is this just the
>> nature of how Amazon EMR works?
>>
>> Also, just to be clear. If I opt to use a machine with multiple cores
>> i.e. an 8 core machine. Is there any harm in using a function that takes
>> advantage of multicore so that I can utilize all cores on the machine where
>> the job is being run?
The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.
If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.
-J
> Is there any particular reason why multi core should be avoided and more
> instances of R should be used instead? My multi core calls are fairly deep
> into my code from where I have broken things up to be spread amongst the
> cluster with segue and the MC calls themselves take at least 30 minutes to
> run or (240 minutes without MC)
>
>
> On Friday, July 27, 2012 11:22:57 AM UTC-4, Dave wrote:
>>
>> Hello all,
>>
>> New user to Segue.
>>
>> I have been able to successfully get Segue up and running with all my
>> required packages and code. Along with my data exported.
>>
>> One of the first things that I tried was running a very trivial function.
>> Something that literally returns the parameter that I pass in. Doing so
>> returns the correct result but this also takes around 5 minutes to do so.
>> Could someone explain why this process takes 5 minutes? Is this just the
>> nature of how Amazon EMR works?
>>
>> Also, just to be clear. If I opt to use a machine with multiple cores
>> i.e. an 8 core machine. Is there any harm in using a function that takes
>> advantage of multicore so that I can utilize all cores on the machine where
>> the job is being run?
The advice to avoid multicore is a general heuristic to avoid having
multiple systems managing parallelization of R. This does not sound
like your situation, but I have seem where folks would have a simple
mclapply() and wrap that in emrlapply(). Wrapping multicore inside of
Segue makes debugging really really hard. Your situation sounds like
you have a different type of situation where it might make sense to
use multicore. Just know that if things start blowing up, it's going
to be hard to really understand what's causing the problem as you have
many levels of abstraction with lots of sub forking.
If you have individual worker tasks that are expected to take longer
than 10 minutes, be sure and set the taskTimeout parameter of the
emrlapply() function to a higher value. By default EMR will kill
single jobs running longer than 10 minutes under the assumption that
if a job runs that long, something must have blown up.
-J