Is it possible to launch an EMR Cluster on a VPC (private subnet)?

851 views
Skip to first unread message

Sergio Mafra

unread,
Mar 29, 2014, 6:35:34 PM3/29/14
to mr...@googlegroups.com
Hi fellows,

Just starting to use MRJob and wondering if we can deploy the cluster on a private subnet thru VPC?

All best,

Sergio

Jeffrey Quinn

unread,
Mar 30, 2014, 4:03:21 AM3/30/14
to mr...@googlegroups.com
I have been wondering this same thing. From reading the mrjob 0.4.3 dev documentation you should be able to accomplish this via the "--emr-api-param" option. This lets you feed options directly to the EMR CLI. However it doesn't look like 0.4.3 code is on github yet unless I'm mistaken so I can't test this..

If it turns out there is no such MRJob option, a workaround is to create your own persistent jobflow in your VPC and then feed the jobflow id to MRJob.

You can do this easily though the ruby elastic-mapreduce CLI that amazon supports (note: requires ruby 1.8.7), via a command like this:

elastic-mapreduce --create --alive --name "Super Private Cluster" --subnet "my-super-private-VPC"

The key option here being `--subnet`

This will print the jobflow id to STDOUT, you can also look it up via `elastic-mapreduce --list`

Then you can just tell MRJob to use the jobflow you just created via `--emr-job-flow-id`:

 python yourjob.py -r emr --emr-job-flow-id <jobflow id> input_file

The one issue with this workaround is you lose MrJob's bootstrapping capabilities, which is a pain. You can solve this by adding `--bootstrap-action "s3://myawsbucket/my_bootstrap_script" --args "arg1,arg2""` to your `elastic-mapreduce` command when you create your persistent jobflow (you'll need to put your bootstrap script on s3). Just make sure you install the mrjob python module somewhere inside your bootstrap script in addition to the rest of your bootstrap commands (MRJob does this for you normally).

Sergio Mafra

unread,
Mar 30, 2014, 9:48:22 AM3/30/14
to mr...@googlegroups.com
Hey Jeffrey,

Thanks so much for such detailed answer. I'll test it next week.
If we have the "--emr-api-param" option available, what will be the command line?
Can you show us?

All best,

Sérgio
--
You received this message because you are subscribed to a topic in the Google Groups "mrjob" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mrjob/_uMR05lhBsg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mrjob+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Quinn

unread,
Apr 10, 2014, 6:27:10 PM4/10/14
to mr...@googlegroups.com
No problem, IF this option worked as expected in 0.4.2, what you would want to add to your mrjob conf is something like this:

emr_api_params:
      Instances.Ec2SubnetId: subnet-XXXXXXXX

However when you actually do this, mrjob complains that it doesnt recognize the option and proceeds to not launch your cluster in the VPC.

The easiest workaround I have found (creating / bootstrapping a cluster manually, as I previously mentioned, is a pain) was to actually patch the mrjob source code (you only need to do this for the mrjob on your local machine, the cluster can use the unpatched version), something like this:

mrjob/emr.py around line 1187

        emr_job_flow_id = emr_conn.run_jobflow(
            self._job_name,
            self._opts['s3_log_uri'],
            api_params = {'Instances.Ec2SubnetId' :'subnet-XXXXXXXX'},
            **args)

Sergio Mafra

unread,
Apr 10, 2014, 6:56:42 PM4/10/14
to mr...@googlegroups.com
Hey Jeffrey,

It seems that launching EMR on a private subnet is not possible at the moment.. See: http://cloudconclave.blogspot.com.br/2014/03/emr-in-private-subnet.html

Jeffrey Quinn

unread,
Apr 10, 2014, 7:00:31 PM4/10/14
to mr...@googlegroups.com
That's news to me. I have an EMR cluster running in my VPC this very moment :)

Maybe they were running into this issue?: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-troubleshoot-error-vpc.html

Or maybe "private" subnet means something different from what my interpretation is?

Jeffrey Quinn

unread,
Apr 10, 2014, 7:04:05 PM4/10/14
to mr...@googlegroups.com
Oh I see now. Sorry I misunderstood, yes thats true. I asked about this on SO recently and got a response from someone who is apparently an AWS dev:

Sergio Mafra

unread,
Apr 10, 2014, 7:06:08 PM4/10/14
to mr...@googlegroups.com
Yeap. BTW I know that guy that answered your question.. He works for AWS here in Brazil. Pretty odd! 
Reply all
Reply to author
Forward
0 new messages