Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
bootstrap actions for the .mrjob.conf file
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  17 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Vishal Goklani  
View profile  
 More options Nov 2 2012, 1:34 pm
From: Vishal Goklani <vgokl...@gmail.com>
Date: Fri, 2 Nov 2012 10:34:24 -0700 (PDT)
Local: Fri, Nov 2 2012 1:34 pm
Subject: bootstrap actions for the .mrjob.conf file

Hi,

I'm new to mrjob and read through the docs, but it's still unclear on how
to optimally setup the bootstrap actions for mrjob.

What I would like to do is upgrade my python instance to 2.7.3, and install
numpy, cython, pandas, sklearn, etc.

So my questions are as follows:

Is it possible to pre-build the binaries on S3, and then just copy the
files over?
Is it possible to pre-build an AMI and just use that for each EMR instance?
Is there a simple way of separating the 32bit pre-built binaries from the
64bit binaries (i.e. the small instance is 32bit, etc)?
If not, is everyone just rebuilding the binaries when the instance is
started (seems very inefficient).

It's also not clear how one would "migrate over their PYTHONPATH to EMR".
Does this mean that mrjob would auto-magically copy everything in my
current PYTHONPATH to each EMR instance? Do I have to do anything to
initiate this process?

Finally, could someone please post a detailed .mrjob.conf file showing each
of these steps, and how they are intended to be run. That is, could someone
please indicate where in the configuration file I should tell MRJOB to
download and install python 2.7.3, and where I should I tell it to install
my python modules. The ordering of the bootstrap commands is not clear.

Thanks!

Vishal


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steve Johnson  
View profile  
 More options Nov 2 2012, 1:58 pm
From: Steve Johnson <dior...@gmail.com>
Date: Fri, 2 Nov 2012 10:58:34 -0700
Local: Fri, Nov 2 2012 1:58 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Here are some likely-accurate answers off the top of my head.

This is a really hairy problem which I have tackled before. Upgrading
Python on EMR in a bootstrap action is *not* an easy task. Possible, but
not easy. Unfortunately I no longer have access to the script I used.

1. Yes, it should be possible, but compiling is probably easier, or using a
.deb from somewhere.
2. No, to my knowledge you cannot build your own AMI. May have changed, so
don't take my word for it.
3. I have no idea.
4. I was able to do it in a bootstrap action, so it happens once per
instance. If you keep the same job flow running it isn't that high a cost,
though I understand most people don't use mrjob's EMR job flow pool
features.

PYTHONPATH: This means *you* need to make a tarball with your source tree
in it and use --python-archive (I think) with mrjob.

*Example config*

runners:
  emr:
    bootstrap_actions:
      - install_python.sh
      - sudo python2.7 -m easy_install my_package
    python_archives: [vishals_code.tar.gz]

I hope this helps. Let me know if I can clarify anything. Unfortunately for
you I don't have specific information related to installing Python on EMR
beyond what I've written hear, but I do happen to know quite a bit about
mrjob. Your problem boils down to "I need to install some Debian software
on EMR as fast as possible the first time the cluster comes online," which
is fairly common. Regardless, just don't expect it to be fast.

The only reason I upgraded Python in the first place was because of a
segfault bug in Python itself that manifested on some of my data, and I had
to use 2.6.3 instead of 2.6.2. If you don't have *that* sort of problem, I
would strongly recommend writing 2.6-compatible code if at all possible for
use on EMR.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hunter Blanks  
View profile  
 More options Nov 2 2012, 2:14 pm
From: Hunter Blanks <hun...@twilio.com>
Date: Fri, 2 Nov 2012 11:14:34 -0700
Local: Fri, Nov 2 2012 2:14 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Vishal,

One thing to consider is that you probably don't want to use the 32-bit
instances anyway -- Hadoop tends to be pretty memory intensive, so I think
you'll have an easier time just always using c1.mediums as your minimum
size.

But, there remain good reasons to use 32 bit for some things, of course.

-HJB


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Vishal Goklani  
View profile  
 More options Nov 2 2012, 2:25 pm
From: Vishal Goklani <vgokl...@gmail.com>
Date: Fri, 2 Nov 2012 11:25:48 -0700 (PDT)
Local: Fri, Nov 2 2012 2:25 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Thanks for the response Hunter. I agree entirely, and was only using 32bit
instances for testing. Curious if you've tried Disco, which uses less
overhead than Hadoop (Erlang vs Java).


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shivkumar Shivaji  
View profile  
 More options Nov 2 2012, 2:26 pm
From: Shivkumar Shivaji <sshiv...@gmail.com>
Date: Fri, 2 Nov 2012 11:26:22 -0700
Local: Fri, Nov 2 2012 2:26 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux. Maybe try bringing a persistent instance and invoking the install, and debug along the way. Compiling is also possible (and perhaps easier) as Steve has suggested, however, it will certainly take a long while per instance. Even numpy seems to take 20+ mins for me. Using an rpm can save this time. You can consider putting all the required rpms on s3 and use a bootstrap script to download and install or you can seed the rpms from the machine where the job is running.

Shiv

On Nov 2, 2012, at 10:58 AM, Steve Johnson <dior...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Roy Smith  
View profile  
 More options Nov 2 2012, 2:32 pm
From: Roy Smith <r...@panix.com>
Date: Fri, 2 Nov 2012 14:32:06 -0400
Local: Fri, Nov 2 2012 2:32 pm
Subject: Re: bootstrap actions for the .mrjob.conf file
On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:

> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.

We've been using apt-get in our EMR bootstrap scripts.  Our EMR instances are ubuntu.  Not sure if that's the default, or just happens to be the AMI we configured.
--
Roy Smith
r...@panix.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shivkumar Shivaji  
View profile  
 More options Nov 2 2012, 2:34 pm
From: Shivkumar Shivaji <sshiv...@gmail.com>
Date: Fri, 2 Nov 2012 11:34:26 -0700
Local: Fri, Nov 2 2012 2:34 pm
Subject: Re: bootstrap actions for the .mrjob.conf file
Interesting, I must have missed this off late. Putting .debs on the bootstrap script is an easy solution then!

Shiv
On Nov 2, 2012, at 11:32 AM, Roy Smith <r...@panix.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shivkumar Shivaji  
View profile  
 More options Nov 2 2012, 2:42 pm
From: Shivkumar Shivaji <sshiv...@gmail.com>
Date: Fri, 2 Nov 2012 11:42:10 -0700
Local: Fri, Nov 2 2012 2:42 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Disco is interesting. I have heard it mentioned at least a few times.

However, in my mind, it lacks the broad community support of hadoop and would not risk it in production unless its an experimental project. That could change with more adoption if Disco. If you do want to use Disco, currently you have to do a lot of stuff on your own. Amazon's EMR and mrjob won't work with disco either :) Was interesting in benchmarking it a few times, but never got around to it.

Shiv

On Nov 2, 2012, at 11:25 AM, Vishal Goklani <vgokl...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Vishal Goklani  
View profile  
 More options Nov 2 2012, 3:10 pm
From: Vishal Goklani <vgokl...@gmail.com>
Date: Fri, 2 Nov 2012 15:10:46 -0400
Local: Fri, Nov 2 2012 3:10 pm
Subject: Re: bootstrap actions for the .mrjob.conf file
Hi Roy,

How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?
On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Roy Smith  
View profile  
 More options Nov 2 2012, 3:16 pm
From: Roy Smith <r...@panix.com>
Date: Fri, 2 Nov 2012 15:16:48 -0400
Local: Fri, Nov 2 2012 3:16 pm
Subject: Re: bootstrap actions for the .mrjob.conf file
On Nov 2, 2012, at 3:10 PM, Vishal Goklani wrote:

> Hi Roy,

> How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?

I'm pretty much a novice at mrjob, so a lot of this was trial-and-error.  I'm not actually sure our EMR hosts run ubuntu, just that the apt-get commands work.

$ cat mrjob.conf
runners:
    emr:
        ami_version: 2.2.1
        aws_region: us-east-1
        aws_access_key_id: XXXX
        aws_secret_access_key: XXXX
        bootstrap_cmds:
            - sudo apt-get -y install python-virtualenv
            - sudo apt-get -y install mercurial
            - sudo apt-get -y install libcurl4-openssl-dev
            - sudo mkdir -p /home/songza/deploy
            - sudo chown -R hadoop.hadoop /home/songza
            - hg clone XXXX  /home/songza/deploy/current
            - virtualenv /home/songza/env/python
            - /home/songza/env/python/bin/easy_install pip
            - /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
            - echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
            - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
            - echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
            - (cd /home/songza/deploy/current; make aws)

        cmdenv:
            TZ: Etc/UTC

        ec2_key_pair: compute
        ec2_key_pair_file: /tmp/compute.pem
        ec2_instance_type: m2.4xlarge
        enable_emr_debugging: True
        num_ec2_instances: 1
        s3_log_uri: s3://songza.compute/tmp/logs/
        s3_scratch_uri: s3://songza.compute/tmp/
        ssh_tunnel_to_job_tracker: True

> On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:

>> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:

>>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.

>> We've been using apt-get in our EMR bootstrap scripts.  Our EMR instances are ubuntu.  Not sure if that's the default, or just happens to be the AMI we configured.
>> --
>> Roy Smith
>> r...@panix.com

--
Roy Smith
r...@panix.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shivkumar Shivaji  
View profile  
 More options Nov 2 2012, 3:27 pm
From: Shivkumar Shivaji <sshiv...@gmail.com>
Date: Fri, 2 Nov 2012 12:27:00 -0700
Subject: Re: bootstrap actions for the .mrjob.conf file
Thanks. Also it looks like Amazon EMR has moved to debian squeeze since late last year. Your AMI is likely debian squeeze. I personally did not check on this for a while and used bootstraps that are not package dependent.

Shiv

On Nov 2, 2012, at 12:16 PM, Roy Smith <r...@panix.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steve Johnson  
View profile  
 More options Nov 2 2012, 3:32 pm
From: Steve Johnson <dior...@gmail.com>
Date: Fri, 2 Nov 2012 12:31:40 -0700
Local: Fri, Nov 2 2012 3:31 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Let me clarify: *EMR instances run Ubuntu. You do not need to configure
anything to make that happen.*

Some comments on this config...
*
*

>         bootstrap_cmds:

            - sudo mkdir -p /home/songza/deploy

            - sudo chown -R hadoop.hadoop /home/songza


Why are you doing this instead of using /home/hadoop?

            - hg clone XXXX  /home/songza/deploy/current


This is probably not a great idea if you're making local changes and
expecting to see them reflected in your EMR job without pushing to your
master branch.

            - virtualenv /home/songza/env/python

>             - /home/songza/env/python/bin/easy_install pip

            - /home/songza/env/python/bin/pip install -r

> /home/songza/deploy/current/deploy/python/emr-requirements.txt

            - echo 'source /home/songza/env/python/bin/activate' >>

> ~/.bashrc

There's not much reason to use virtualenv in a bootstrap action since all
your configuration will be global to the cluster anyway.

>             - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >>
> ~/.bashrc
>             - echo 'export PYTHONPATH=/home/songza/deploy/current' >>
> ~/.bashrc

Use cmdenv for this. Don't assume a shell.

>             - (cd /home/songza/deploy/current; make aws)

This is good.

>         enable_emr_debugging: True

Strangely, I never figured out exactly what this does.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Roy Smith  
View profile  
 More options Nov 2 2012, 3:45 pm
From: Roy Smith <r...@panix.com>
Date: Fri, 2 Nov 2012 15:45:34 -0400
Local: Fri, Nov 2 2012 3:45 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

On Nov 2, 2012, at 3:31 PM, Steve Johnson wrote:

> Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.

> Some comments on this config...

>         bootstrap_cmds:
>             - sudo mkdir -p /home/songza/deploy
>             - sudo chown -R hadoop.hadoop /home/songza
> Why are you doing this instead of using /home/hadoop?

I know it's bad form, but we have some places in our codebase which have this path hard-wired.  Life is not always pretty :-)

>            - hg clone XXXX  /home/songza/deploy/current
> This is probably not a great idea if you're making local changes and expecting to see them reflected in your EMR job without pushing to your master branch.

We're not.  Mostly we drag in the repo so we have access to some database code.  It's all read-only.

>             - virtualenv /home/songza/env/python
>             - /home/songza/env/python/bin/easy_install pip
>             - /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
>             - echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
> There's not much reason to use virtualenv in a bootstrap action since all your configuration will be global to the cluster anyway.

On the other hand, there's no reason not to.  We use virtualenv in other places.  All not using it on EMR would do is introduce a configuration difference between our EMR and other environments.  Why introduce differences when you don't have to?

>             - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
>             - echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
> Use cmdenv for this. Don't assume a shell.

Hmmm, I wasn't aware of cmdenv.  I'll read up on it, thanks.

>         enable_emr_debugging: True
> Strangely, I never figured out exactly what this does.

Neither have we, but it seemed reasonable to turn it on :-)

--
Roy Smith
r...@panix.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shivkumar Shivaji  
View profile  
 More options Nov 2 2012, 4:03 pm
From: Shivkumar Shivaji <sshiv...@gmail.com>
Date: Fri, 2 Nov 2012 13:03:23 -0700
Local: Fri, Nov 2 2012 4:03 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Quick short comments..

>> Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.

Debian squeeze, not ubuntu to be precise. The 2 however are quick similar with .deb package management.

>>         enable_emr_debugging: True
>> Strangely, I never figured out exactly what this does.

> Neither have we, but it seemed reasonable to turn it on :-)

This writes EMR logs to Amazon's simpledb, a key value database. If you are using this feature, ensure that your simpledb logs are deleted after a while, otherwise they will take up space on simpledb.

Shiv


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steve Johnson  
View profile  
 More options Nov 2 2012, 4:25 pm
From: Steve Johnson <dior...@gmail.com>
Date: Fri, 2 Nov 2012 13:25:18 -0700
Local: Fri, Nov 2 2012 4:25 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Ha, I stand corrected.

Roy, it's a bit curious that you didn't know about cmdenv, considering
you're using it to set the time zone. :-)

On Fri, Nov 2, 2012 at 1:03 PM, Shivkumar Shivaji <sshiv...@gmail.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Roy Smith  
View profile  
 More options Nov 2 2012, 4:31 pm
From: Roy Smith <r...@panix.com>
Date: Fri, 2 Nov 2012 16:31:38 -0400
Local: Fri, Nov 2 2012 4:31 pm
Subject: Re: bootstrap actions for the .mrjob.conf file

Heh, I didn't even notice that.  Like I said, lots of trial-and-error.  I think the timezone part came from some example config I got from somewhere, didn't fully understand, and just keep poking at it until did what I wanted.

On Nov 2, 2012, at 4:25 PM, Steve Johnson wrote:

> Roy, it's a bit curious that you didn't know about cmdenv, considering you're using it to set the time zone. :-)

--
Roy Smith
r...@panix.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Vishal Goklani  
View profile  
 More options Nov 3 2012, 7:15 am
From: Vishal Goklani <vgokl...@gmail.com>
Date: Sat, 3 Nov 2012 07:15:41 -0400
Local: Sat, Nov 3 2012 7:15 am
Subject: Re: bootstrap actions for the .mrjob.conf file

This looks like an interesting approach:

http://bit.ly/X7pD1R

is he compiling, tarring, then moving the files to S3 first?
On Nov 2, 2012, at 4:31 PM, Roy Smith <r...@panix.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »