Amazon EMR multi-core processing?

Beth

unread,

Jun 25, 2012, 12:10:55 AM6/25/12

to mr...@googlegroups.com

Hi. I'm trying to determine what type of instances and how many I should choose to run my job on amazon EMR. I've run several test cases, but I was surprised to find only small changes in processing time for instances with different numbers of cores. My understanding is that Hadoop 0.2 is the latest available version that can be run on the Amazon EMR instances, but multi-core CPU processing isn't available in Hadoop until 0.21. Does anyone know if this is correct? Thus, is it really true that jobs are only running on a single core on Amazon EMR?

Anyone have recommendations for which instance type is best? I'm currenlty planning to use m2.xlarge or c1.medium

Thanks! Been enjoying getting to know mrjob lately!

Steve Johnson

unread,

Jun 25, 2012, 12:25:46 AM6/25/12

to mr...@googlegroups.com

I don't know much about how many cores which versions of Hadoop can use, but I do know that c1.medium and m2.xlarge mostly do worse than m1.large and m1.xlarge, performance- and cost-wise. But by all means, run your own tests on your own code!

Mark Cahill

unread,

Jun 25, 2012, 9:42:23 AM6/25/12

to mr...@googlegroups.com

Yes, Hadoop spreads jobs across multiple cores. I've found m2.xlarge spot instances to be the most cost effective in off-peak times. During peaks, small instances actually end up cheapest. Check spot prices before you launch the job if you're trying to save money. Generally Hadoop performs better with fewer larger instances than with lots of smaller instances because there is some overhead in dispatching tasks to the cluster nodes.

Beth

unread,

Jun 25, 2012, 10:51:22 AM6/25/12

to mr...@googlegroups.com

Thank you very much. Great information. Any idea what peak times are generally?

Henry

unread,

Jun 25, 2012, 12:59:23 PM6/25/12

to mr...@googlegroups.com

HI Beth,

This is a topic I am interested in as well and the best strategy is to do the following:

1. Collect the job metrics (Perhaps use the builtin Ganglia support?)

2. http://www.dukestartupchallenge.org/starfish

Let me know how it goes: we would need to write some custom code to join the metrics and I'm up for it.

Shivkumar Shivaji

unread,

Jun 25, 2012, 1:03:51 PM6/25/12

to mr...@googlegroups.com

Mappers and reducers can be run on different threads and thus on different cores. Hadoop supports multiple processes on one machine. The speedup is easier to see for large jobs. m1.large for example runs twice as many mappers and reducers on the same number of instances as m1.small.

However, there is typically not a way to have a single mapper or single reducer run on more than one core/thread. Not sure what Hadoop 0.21 supports on this kind of multi-threading, will try to find out more.

The number of mappers and reducers used by m1.large and m1.small per instance is in the emr documentation. Amazon has based this number on the number of cores each instance type supports.

Shivkumar Shivaji

unread,

Jun 25, 2012, 1:23:21 PM6/25/12

to mr...@googlegroups.com

To add to my reply:

There is a multithreaded mapper java class, one mapper using more than one thread, available since Hadoop 0.20. This covers the 2nd case of multi-threading from my previous reply. This class has gone thru major bug fixes in 0.21. In mrjob, it requires one to write custom java code to exploit this type of parallelism.

However, for most people, the process parallelism that works with any Hadoop is more than sufficient.

Mark Cahill

unread,

Jun 28, 2012, 9:42:39 PM6/28/12

to mr...@googlegroups.com

They fluctuate by day, and vary by instance type. You can tell visually by looking at the spot instance pricing history.

Reply all

Reply to author

Forward