I'm new to mrjob and read through the docs, but it's still unclear on how to optimally setup the bootstrap actions for mrjob.
What I would like to do is upgrade my python instance to 2.7.3, and install numpy, cython, pandas, sklearn, etc.
So my questions are as follows:
Is it possible to pre-build the binaries on S3, and then just copy the files over? Is it possible to pre-build an AMI and just use that for each EMR instance? Is there a simple way of separating the 32bit pre-built binaries from the 64bit binaries (i.e. the small instance is 32bit, etc)? If not, is everyone just rebuilding the binaries when the instance is started (seems very inefficient).
It's also not clear how one would "migrate over their PYTHONPATH to EMR". Does this mean that mrjob would auto-magically copy everything in my current PYTHONPATH to each EMR instance? Do I have to do anything to initiate this process?
Finally, could someone please post a detailed .mrjob.conf file showing each of these steps, and how they are intended to be run. That is, could someone please indicate where in the configuration file I should tell MRJOB to download and install python 2.7.3, and where I should I tell it to install my python modules. The ordering of the bootstrap commands is not clear.
Here are some likely-accurate answers off the top of my head.
This is a really hairy problem which I have tackled before. Upgrading
Python on EMR in a bootstrap action is *not* an easy task. Possible, but
not easy. Unfortunately I no longer have access to the script I used.
1. Yes, it should be possible, but compiling is probably easier, or using a
.deb from somewhere.
2. No, to my knowledge you cannot build your own AMI. May have changed, so
don't take my word for it.
3. I have no idea.
4. I was able to do it in a bootstrap action, so it happens once per
instance. If you keep the same job flow running it isn't that high a cost,
though I understand most people don't use mrjob's EMR job flow pool
features.
PYTHONPATH: This means *you* need to make a tarball with your source tree
in it and use --python-archive (I think) with mrjob.
I hope this helps. Let me know if I can clarify anything. Unfortunately for
you I don't have specific information related to installing Python on EMR
beyond what I've written hear, but I do happen to know quite a bit about
mrjob. Your problem boils down to "I need to install some Debian software
on EMR as fast as possible the first time the cluster comes online," which
is fairly common. Regardless, just don't expect it to be fast.
The only reason I upgraded Python in the first place was because of a
segfault bug in Python itself that manifested on some of my data, and I had
to use 2.6.3 instead of 2.6.2. If you don't have *that* sort of problem, I
would strongly recommend writing 2.6-compatible code if at all possible for
use on EMR.
On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgokl...@gmail.com> wrote:
> Hi,
> I'm new to mrjob and read through the docs, but it's still unclear on how
> to optimally setup the bootstrap actions for mrjob.
> What I would like to do is upgrade my python instance to 2.7.3, and
> install numpy, cython, pandas, sklearn, etc.
> So my questions are as follows:
> Is it possible to pre-build the binaries on S3, and then just copy the
> files over?
> Is it possible to pre-build an AMI and just use that for each EMR instance?
> Is there a simple way of separating the 32bit pre-built binaries from the
> 64bit binaries (i.e. the small instance is 32bit, etc)?
> If not, is everyone just rebuilding the binaries when the instance is
> started (seems very inefficient).
> It's also not clear how one would "migrate over their PYTHONPATH to EMR".
> Does this mean that mrjob would auto-magically copy everything in my
> current PYTHONPATH to each EMR instance? Do I have to do anything to
> initiate this process?
> Finally, could someone please post a detailed .mrjob.conf file showing
> each of these steps, and how they are intended to be run. That is, could
> someone please indicate where in the configuration file I should tell MRJOB
> to download and install python 2.7.3, and where I should I tell it to
> install my python modules. The ordering of the bootstrap commands is not
> clear.
One thing to consider is that you probably don't want to use the 32-bit
instances anyway -- Hadoop tends to be pretty memory intensive, so I think
you'll have an easier time just always using c1.mediums as your minimum
size.
But, there remain good reasons to use 32 bit for some things, of course.
On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgokl...@gmail.com> wrote:
> Hi,
> I'm new to mrjob and read through the docs, but it's still unclear on how
> to optimally setup the bootstrap actions for mrjob.
> What I would like to do is upgrade my python instance to 2.7.3, and
> install numpy, cython, pandas, sklearn, etc.
> So my questions are as follows:
> Is it possible to pre-build the binaries on S3, and then just copy the
> files over?
> Is it possible to pre-build an AMI and just use that for each EMR instance?
> Is there a simple way of separating the 32bit pre-built binaries from the
> 64bit binaries (i.e. the small instance is 32bit, etc)?
> If not, is everyone just rebuilding the binaries when the instance is
> started (seems very inefficient).
> It's also not clear how one would "migrate over their PYTHONPATH to EMR".
> Does this mean that mrjob would auto-magically copy everything in my
> current PYTHONPATH to each EMR instance? Do I have to do anything to
> initiate this process?
> Finally, could someone please post a detailed .mrjob.conf file showing
> each of these steps, and how they are intended to be run. That is, could
> someone please indicate where in the configuration file I should tell MRJOB
> to download and install python 2.7.3, and where I should I tell it to
> install my python modules. The ordering of the bootstrap commands is not
> clear.
Thanks for the response Hunter. I agree entirely, and was only using 32bit instances for testing. Curious if you've tried Disco, which uses less overhead than Hadoop (Erlang vs Java).
On Friday, November 2, 2012 2:14:36 PM UTC-4, Hunter wrote:
> Vishal,
> One thing to consider is that you probably don't want to use the 32-bit > instances anyway -- Hadoop tends to be pretty memory intensive, so I think > you'll have an easier time just always using c1.mediums as your minimum > size.
> But, there remain good reasons to use 32 bit for some things, of course.
> -HJB
> On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgok...@gmail.com<javascript:> > > wrote:
>> Hi,
>> I'm new to mrjob and read through the docs, but it's still unclear on how >> to optimally setup the bootstrap actions for mrjob.
>> What I would like to do is upgrade my python instance to 2.7.3, and >> install numpy, cython, pandas, sklearn, etc.
>> So my questions are as follows:
>> Is it possible to pre-build the binaries on S3, and then just copy the >> files over? >> Is it possible to pre-build an AMI and just use that for each EMR >> instance? >> Is there a simple way of separating the 32bit pre-built binaries from the >> 64bit binaries (i.e. the small instance is 32bit, etc)? >> If not, is everyone just rebuilding the binaries when the instance is >> started (seems very inefficient).
>> It's also not clear how one would "migrate over their PYTHONPATH to EMR". >> Does this mean that mrjob would auto-magically copy everything in my >> current PYTHONPATH to each EMR instance? Do I have to do anything to >> initiate this process?
>> Finally, could someone please post a detailed .mrjob.conf file showing >> each of these steps, and how they are intended to be run. That is, could >> someone please indicate where in the configuration file I should tell MRJOB >> to download and install python 2.7.3, and where I should I tell it to >> install my python modules. The ordering of the bootstrap commands is not >> clear.
On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux. Maybe try bringing a persistent instance and invoking the install, and debug along the way. Compiling is also possible (and perhaps easier) as Steve has suggested, however, it will certainly take a long while per instance. Even numpy seems to take 20+ mins for me. Using an rpm can save this time. You can consider putting all the required rpms on s3 and use a bootstrap script to download and install or you can seed the rpms from the machine where the job is running.
Shiv
On Nov 2, 2012, at 10:58 AM, Steve Johnson <dior...@gmail.com> wrote:
> Here are some likely-accurate answers off the top of my head.
> This is a really hairy problem which I have tackled before. Upgrading Python on EMR in a bootstrap action is *not* an easy task. Possible, but not easy. Unfortunately I no longer have access to the script I used.
> 1. Yes, it should be possible, but compiling is probably easier, or using a .deb from somewhere.
> 2. No, to my knowledge you cannot build your own AMI. May have changed, so don't take my word for it.
> 3. I have no idea.
> 4. I was able to do it in a bootstrap action, so it happens once per instance. If you keep the same job flow running it isn't that high a cost, though I understand most people don't use mrjob's EMR job flow pool features.
> PYTHONPATH: This means *you* need to make a tarball with your source tree in it and use --python-archive (I think) with mrjob.
> I hope this helps. Let me know if I can clarify anything. Unfortunately for you I don't have specific information related to installing Python on EMR beyond what I've written hear, but I do happen to know quite a bit about mrjob. Your problem boils down to "I need to install some Debian software on EMR as fast as possible the first time the cluster comes online," which is fairly common. Regardless, just don't expect it to be fast.
> The only reason I upgraded Python in the first place was because of a segfault bug in Python itself that manifested on some of my data, and I had to use 2.6.3 instead of 2.6.2. If you don't have *that* sort of problem, I would strongly recommend writing 2.6-compatible code if at all possible for use on EMR.
> On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgokl...@gmail.com> wrote:
> Hi,
> I'm new to mrjob and read through the docs, but it's still unclear on how to optimally setup the bootstrap actions for mrjob.
> What I would like to do is upgrade my python instance to 2.7.3, and install numpy, cython, pandas, sklearn, etc.
> So my questions are as follows:
> Is it possible to pre-build the binaries on S3, and then just copy the files over?
> Is it possible to pre-build an AMI and just use that for each EMR instance?
> Is there a simple way of separating the 32bit pre-built binaries from the 64bit binaries (i.e. the small instance is 32bit, etc)? > If not, is everyone just rebuilding the binaries when the instance is started (seems very inefficient).
> It's also not clear how one would "migrate over their PYTHONPATH to EMR". Does this mean that mrjob would auto-magically copy everything in my current PYTHONPATH to each EMR instance? Do I have to do anything to initiate this process?
> Finally, could someone please post a detailed .mrjob.conf file showing each of these steps, and how they are intended to be run. That is, could someone please indicate where in the configuration file I should tell MRJOB to download and install python 2.7.3, and where I should I tell it to install my python modules. The ordering of the bootstrap commands is not clear.
On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
--
Roy Smith
r...@panix.com
> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
> We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
> --
> Roy Smith
> r...@panix.com
Disco is interesting. I have heard it mentioned at least a few times.
However, in my mind, it lacks the broad community support of hadoop and would not risk it in production unless its an experimental project. That could change with more adoption if Disco. If you do want to use Disco, currently you have to do a lot of stuff on your own. Amazon's EMR and mrjob won't work with disco either :) Was interesting in benchmarking it a few times, but never got around to it.
Shiv
On Nov 2, 2012, at 11:25 AM, Vishal Goklani <vgokl...@gmail.com> wrote:
> Thanks for the response Hunter. I agree entirely, and was only using 32bit instances for testing. Curious if you've tried Disco, which uses less overhead than Hadoop (Erlang vs Java).
> On Friday, November 2, 2012 2:14:36 PM UTC-4, Hunter wrote:
> Vishal,
> One thing to consider is that you probably don't want to use the 32-bit instances anyway -- Hadoop tends to be pretty memory intensive, so I think you'll have an easier time just always using c1.mediums as your minimum size.
> But, there remain good reasons to use 32 bit for some things, of course.
> -HJB
> On Fri, Nov 2, 2012 at 10:34 AM, Vishal Goklani <vgok...@gmail.com> wrote:
> Hi,
> I'm new to mrjob and read through the docs, but it's still unclear on how to optimally setup the bootstrap actions for mrjob.
> What I would like to do is upgrade my python instance to 2.7.3, and install numpy, cython, pandas, sklearn, etc.
> So my questions are as follows:
> Is it possible to pre-build the binaries on S3, and then just copy the files over?
> Is it possible to pre-build an AMI and just use that for each EMR instance?
> Is there a simple way of separating the 32bit pre-built binaries from the 64bit binaries (i.e. the small instance is 32bit, etc)? > If not, is everyone just rebuilding the binaries when the instance is started (seems very inefficient).
> It's also not clear how one would "migrate over their PYTHONPATH to EMR". Does this mean that mrjob would auto-magically copy everything in my current PYTHONPATH to each EMR instance? Do I have to do anything to initiate this process?
> Finally, could someone please post a detailed .mrjob.conf file showing each of these steps, and how they are intended to be run. That is, could someone please indicate where in the configuration file I should tell MRJOB to download and install python 2.7.3, and where I should I tell it to install my python modules. The ordering of the bootstrap commands is not clear.
How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?
On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:
> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
> We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
> --
> Roy Smith
> r...@panix.com
> How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?
I'm pretty much a novice at mrjob, so a lot of this was trial-and-error. I'm not actually sure our EMR hosts run ubuntu, just that the apt-get commands work.
> On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:
>> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
>>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
>> We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
>> --
>> Roy Smith
>> r...@panix.com
Thanks. Also it looks like Amazon EMR has moved to debian squeeze since late last year. Your AMI is likely debian squeeze. I personally did not check on this for a while and used bootstraps that are not package dependent.
Shiv
On Nov 2, 2012, at 12:16 PM, Roy Smith <r...@panix.com> wrote:
> On Nov 2, 2012, at 3:10 PM, Vishal Goklani wrote:
>> Hi Roy,
>> How did you get MRJOB to use Ubunutu for EMR? Would you mind posting your .mrjob.conf file?
> I'm pretty much a novice at mrjob, so a lot of this was trial-and-error. I'm not actually sure our EMR hosts run ubuntu, just that the apt-get commands work.
>> On Nov 2, 2012, at 2:32 PM, Roy Smith <r...@panix.com> wrote:
>>> On Nov 2, 2012, at 2:26 PM, Shivkumar Shivaji wrote:
>>>> On binaries such as numpy, cython, pandas, sklearn it is probably quicker to find .rpm and call them in a bootstrap script. I think EMR uses a red hat based linux.
>>> We've been using apt-get in our EMR bootstrap scripts. Our EMR instances are ubuntu. Not sure if that's the default, or just happens to be the AMI we configured.
>>> --
>>> Roy Smith
>>> r...@panix.com
Let me clarify: *EMR instances run Ubuntu. You do not need to configure
anything to make that happen.*
Some comments on this config...
*
*
> bootstrap_cmds:
- sudo mkdir -p /home/songza/deploy
- sudo chown -R hadoop.hadoop /home/songza
Why are you doing this instead of using /home/hadoop?
- hg clone XXXX /home/songza/deploy/current
This is probably not a great idea if you're making local changes and
expecting to see them reflected in your EMR job without pushing to your
master branch.
> Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.
> Some comments on this config...
> bootstrap_cmds: > - sudo mkdir -p /home/songza/deploy
> - sudo chown -R hadoop.hadoop /home/songza
> Why are you doing this instead of using /home/hadoop?
I know it's bad form, but we have some places in our codebase which have this path hard-wired. Life is not always pretty :-)
> - hg clone XXXX /home/songza/deploy/current
> This is probably not a great idea if you're making local changes and expecting to see them reflected in your EMR job without pushing to your master branch.
We're not. Mostly we drag in the repo so we have access to some database code. It's all read-only.
> - virtualenv /home/songza/env/python
> - /home/songza/env/python/bin/easy_install pip
> - /home/songza/env/python/bin/pip install -r /home/songza/deploy/current/deploy/python/emr-requirements.txt
> - echo 'source /home/songza/env/python/bin/activate' >> ~/.bashrc
> There's not much reason to use virtualenv in a bootstrap action since all your configuration will be global to the cluster anyway.
On the other hand, there's no reason not to. We use virtualenv in other places. All not using it on EMR would do is introduce a configuration difference between our EMR and other environments. Why introduce differences when you don't have to?
> - echo 'export SONGZA_BASEDIR=/home/songza/deploy/current' >> ~/.bashrc
> - echo 'export PYTHONPATH=/home/songza/deploy/current' >> ~/.bashrc
> Use cmdenv for this. Don't assume a shell.
Hmmm, I wasn't aware of cmdenv. I'll read up on it, thanks.
> enable_emr_debugging: True
> Strangely, I never figured out exactly what this does.
Neither have we, but it seemed reasonable to turn it on :-)
>> Let me clarify: EMR instances run Ubuntu. You do not need to configure anything to make that happen.
Debian squeeze, not ubuntu to be precise. The 2 however are quick similar with .deb package management.
>> enable_emr_debugging: True
>> Strangely, I never figured out exactly what this does.
> Neither have we, but it seemed reasonable to turn it on :-)
This writes EMR logs to Amazon's simpledb, a key value database. If you are using this feature, ensure that your simpledb logs are deleted after a while, otherwise they will take up space on simpledb.
> Let me clarify: *EMR instances run Ubuntu. You do not need to configure
> anything to make that happen.*
> Debian squeeze, not ubuntu to be precise. The 2 however are quick similar
> with .deb package management.
> enable_emr_debugging: True
> Strangely, I never figured out exactly what this does.
> Neither have we, but it seemed reasonable to turn it on :-)
> This writes EMR logs to Amazon's simpledb, a key value database. If you
> are using this feature, ensure that your simpledb logs are deleted after a
> while, otherwise they will take up space on simpledb.
Heh, I didn't even notice that. Like I said, lots of trial-and-error. I think the timezone part came from some example config I got from somewhere, didn't fully understand, and just keep poking at it until did what I wanted.
On Nov 2, 2012, at 4:25 PM, Steve Johnson wrote:
> Roy, it's a bit curious that you didn't know about cmdenv, considering you're using it to set the time zone. :-)
> Heh, I didn't even notice that. Like I said, lots of trial-and-error. I think the timezone part came from some example config I got from somewhere, didn't fully understand, and just keep poking at it until did what I wanted.
> On Nov 2, 2012, at 4:25 PM, Steve Johnson wrote:
>> Roy, it's a bit curious that you didn't know about cmdenv, considering you're using it to set the time zone. :-)