bootstrap actions for the .mrjob.conf file

1,014 views
Skip to first unread message

Vishal Goklani

unread,
Nov 2, 2012, 1:34:24 PM11/2/12
to mr...@googlegroups.com
Hi,

I'm new to mrjob and read through the docs, but it's still unclear on how to optimally setup the bootstrap actions for mrjob.

What I would like to do is upgrade my python instance to 2.7.3, and install numpy, cython, pandas, sklearn, etc.

So my questions are as follows:

Is it possible to pre-build the binaries on S3, and then just copy the files over?
Is it possible to pre-build an AMI and just use that for each EMR instance?
Is there a simple way of separating the 32bit pre-built binaries from the 64bit binaries (i.e. the small instance is 32bit, etc)? 
If not, is everyone just rebuilding the binaries when the instance is started (seems very inefficient).

It's also not clear how one would "migrate over their PYTHONPATH to EMR". Does this mean that mrjob would auto-magically copy everything in my current PYTHONPATH to each EMR instance? Do I have to do anything to initiate this process?

Finally, could someone please post a detailed .mrjob.conf file showing each of these steps, and how they are intended to be run. That is, could someone please indicate where in the configuration file I should tell MRJOB to download and install python 2.7.3, and where I should I tell it to install my python modules. The ordering of the bootstrap commands is not clear.

Thanks!

Vishal

Steve Johnson

unread,
Nov 2, 2012, 1:58:34 PM11/2/12
to mr...@googlegroups.com
Here are some likely-accurate answers off the top of my head.

This is a really hairy problem which I have tackled before. Upgrading Python on EMR in a bootstrap action is *not* an easy task. Possible, but not easy. Unfortunately I no longer have access to the script I used.

1. Yes, it should be possible, but compiling is probably easier, or using a .deb from somewhere.
2. No, to my knowledge you cannot build your own AMI. May have changed, so don't take my word for it.
3. I have no idea.
4. I was able to do it in a bootstrap action, so it happens once per instance. If you keep the same job flow running it isn't that high a cost, though I understand most people don't use mrjob's EMR job flow pool features.

PYTHONPATH: This means *you* need to make a tarball with your source tree in it and use --python-archive (I think) with mrjob.

Example config

runners:
  emr:
    bootstrap_actions:
      - install_python.sh
      - sudo python2.7 -m easy_install my_package
    python_archives: [vishals_code.tar.gz]

I hope this helps. Let me know if I can clarify anything. Unfortunately for you I don't have specific information related to installing Python on EMR beyond what I've written hear, but I do happen to know quite a bit about mrjob. Your problem boils down to "I need to install some Debian software on EMR as fast as possible the first time the cluster comes online," which is fairly common. Regardless, just don't expect it to be fast.

The only reason I upgraded Python in the first place was because of a segfault bug in Python itself that manifested on some of my data, and I had to use 2.6.3 instead of 2.6.2. If you don't have *that* sort of problem, I would strongly recommend writing 2.6-compatible code if at all possible for use on EMR.

Hunter Blanks

unread,
Nov 2, 2012, 2:14:34 PM11/2/12
to mr...@googlegroups.com
Vishal,

One thing to consider is that you probably don't want to use the 32-bit instances anyway -- Hadoop tends to be pretty memory intensive, so I think you'll have an easier time just always using c1.mediums as your minimum size.

But, there remain good reasons to use 32 bit for some things, of course.

-HJB

On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgok...@gmail.com> wrote:

Vishal Goklani

unread,
Nov 2, 2012, 2:25:48 PM11/2/12
to mr...@googlegroups.com
Thanks for the response Hunter. I agree entirely, and was only using 32bit instances for testing. Curious if you've tried Disco, which uses less overhead than Hadoop (Erlang vs Java).

Shivkumar Shivaji

unread,
Nov 2, 2012, 2:26:22 PM11/2/12
to mr...@googlegroups.com
On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux. Maybe try bringing a persistent instance and invoking the install, and debug along the way. Compiling is also possible (and perhaps easier) as Steve has suggested, however, it will certainly take a long while per instance. Even numpy seems to take 20+ mins for me. Using an rpm can save this time. You can consider putting all the required rpms on s3 and use a bootstrap script to download and install or you can seed the rpms from the machine where the job is running.

Shiv

Roy Smith

unread,
Nov 2, 2012, 2:32:06 PM11/2/12
to mr...@googlegroups.com
On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:

> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.

We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
--
Roy Smith
r...@panix.com

Shivkumar Shivaji

unread,
Nov 2, 2012, 2:34:26 PM11/2/12
to mr...@googlegroups.com
Interesting, I must have missed this off late. Putting .debs on the bootstrap script is an easy solution then!

Shiv

Shivkumar Shivaji

unread,
Nov 2, 2012, 2:42:10 PM11/2/12
to mr...@googlegroups.com
Disco is interesting. I have heard it mentioned at least a few times. 

However, in my mind, it lacks the broad community support of hadoop and would not risk it in production unless its an experimental project. That could change with more adoption if Disco. If you do want to use Disco, currently you have to do a lot of stuff on your own. Amazon's EMR and mrjob won't work with disco either :) Was interesting in benchmarking it a few times, but never got around to it.

Shiv

Vishal Goklani

unread,
Nov 2, 2012, 3:10:46 PM11/2/12
to mr...@googlegroups.com
Hi Roy,

How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?

Roy Smith

unread,
Nov 2, 2012, 3:16:48 PM11/2/12
to mr...@googlegroups.com
On Nov 2, 2012, at 3:10 PM, Vishal Goklani wrote:

> Hi Roy,
>
> How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?

I'm pretty much a novice at mrjob, so a lot of this was trial-and-error. I'm not actually sure our EMR hosts run ubuntu, just that the apt-get commands work.

$ cat mrjob.conf
runners:
emr:
ami_version: 2.2.1
aws_region: us-east-1
aws_access_key_id: XXXX
aws_secret_access_key: XXXX
bootstrap_cmds:
- sudo apt-get -y install python-virtualenv
- sudo apt-get -y install mercurial
- sudo apt-get -y install libcurl4-openssl-dev
- sudo mkdir -p /home/songza/deploy
- sudo chown -R hadoop.hadoop /home/songza
- hg clone XXXX /home/songza/deploy/current
- virtualenv /home/songza/env/python
- /home/songza/env/python/bin/easy_install pip
- /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
- echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
- echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
- echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
- (cd /home/songza/deploy/current; make aws)

cmdenv:
TZ: Etc/UTC

ec2_key_pair: compute
ec2_key_pair_file: /tmp/compute.pem
ec2_instance_type: m2.4xlarge
enable_emr_debugging: True
num_ec2_instances: 1
s3_log_uri: s3://songza.compute/tmp/logs/
s3_scratch_uri: s3://songza.compute/tmp/
ssh_tunnel_to_job_tracker: True




> On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:
>
>> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
>>
>>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
>>
>> We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
>> --
>> Roy Smith
>> r...@panix.com
>>
>


--
Roy Smith
r...@panix.com



Shivkumar Shivaji

unread,
Nov 2, 2012, 3:27:00 PM11/2/12
to mr...@googlegroups.com
Thanks. Also it looks like Amazon EMR has moved to debian squeeze since late last year. Your AMI is likely debian squeeze. I personally did not check on this for a while and used bootstraps that are not package dependent.

Shiv

Steve Johnson

unread,
Nov 2, 2012, 3:31:40 PM11/2/12
to mr...@googlegroups.com
Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.

Some comments on this config...

        bootstrap_cmds: 
            - sudo mkdir -p /home/songza/deploy
            - sudo chown -R hadoop.hadoop /home/songza
Why are you doing this instead of using /home/hadoop? 

            - hg clone XXXX  /home/songza/deploy/current
This is probably not a great idea if you're making local changes and expecting to see them reflected in your EMR job without pushing to your master branch.

            - virtualenv /home/songza/env/python
            - /home/songza/env/python/bin/easy_install pip
            - /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
            - echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
There's not much reason to use virtualenv in a bootstrap action since all your configuration will be global to the cluster anyway.
 
            - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
            - echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
Use cmdenv for this. Don't assume a shell. 
 
            - (cd /home/songza/deploy/current; make aws)
This is good.
 
        enable_emr_debugging: True
Strangely, I never figured out exactly what this does.

Roy Smith

unread,
Nov 2, 2012, 3:45:34 PM11/2/12
to mr...@googlegroups.com
On Nov 2, 2012, at 3:31 PM, Steve Johnson wrote:

Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.

Some comments on this config...

        bootstrap_cmds: 
            - sudo mkdir -p /home/songza/deploy
            - sudo chown -R hadoop.hadoop /home/songza
Why are you doing this instead of using /home/hadoop? 

I know it's bad form, but we have some places in our codebase which have this path hard-wired.  Life is not always pretty :-)

           - hg clone XXXX  /home/songza/deploy/current
This is probably not a great idea if you're making local changes and expecting to see them reflected in your EMR job without pushing to your master branch.

We're not.  Mostly we drag in the repo so we have access to some database code.  It's all read-only.


            - virtualenv /home/songza/env/python
            - /home/songza/env/python/bin/easy_install pip
            - /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
            - echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
There's not much reason to use virtualenv in a bootstrap action since all your configuration will be global to the cluster anyway.


On the other hand, there's no reason not to.  We use virtualenv in other places.  All not using it on EMR would do is introduce a configuration difference between our EMR and other environments.  Why introduce differences when you don't have to?


            - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
            - echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
Use cmdenv for this. Don't assume a shell. 

Hmmm, I wasn't aware of cmdenv.  I'll read up on it, thanks.

        enable_emr_debugging: True
Strangely, I never figured out exactly what this does.

Neither have we, but it seemed reasonable to turn it on :-)

--
Roy Smith



Shivkumar Shivaji

unread,
Nov 2, 2012, 4:03:23 PM11/2/12
to mr...@googlegroups.com
Quick short comments..

Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.


Debian squeeze, not ubuntu to be precise. The 2 however are quick similar with .deb package management.
        enable_emr_debugging: True
Strangely, I never figured out exactly what this does.

Neither have we, but it seemed reasonable to turn it on :-)


This writes EMR logs to Amazon's simpledb, a key value database. If you are using this feature, ensure that your simpledb logs are deleted after a while, otherwise they will take up space on simpledb.

Shiv

Steve Johnson

unread,
Nov 2, 2012, 4:25:18 PM11/2/12
to mr...@googlegroups.com
Ha, I stand corrected.

Roy, it's a bit curious that you didn't know about cmdenv, considering you're using it to set the time zone. :-)

Roy Smith

unread,
Nov 2, 2012, 4:31:38 PM11/2/12
to mr...@googlegroups.com
Heh, I didn't even notice that.  Like I said, lots of trial-and-error.  I think the timezone part came from some example config I got from somewhere, didn't fully understand, and just keep poking at it until did what I wanted.



On Nov 2, 2012, at 4:25 PM, Steve Johnson wrote:

Roy, it's a bit curious that you didn't know about cmdenv, considering you're using it to set the time zone. :-)


--
Roy Smith



Vishal Goklani

unread,
Nov 3, 2012, 7:15:41 AM11/3/12
to mr...@googlegroups.com
This looks like an interesting approach:


is he compiling, tarring, then moving the files to S3 first?
Reply all
Reply to author
Forward
0 new messages