Now you should be able to run Dumbo jobs on Elastic MapReduce. To start a cluster, you can use the Ruby client as so:
elastic-mapreduce --create --alive
SSH into the cluster using your EC2 keypair as user hadoop and install Dumbo with the following two commands:
wget http://peak.telecommunity.com/dist/ez_setup.py
sudo python ez_setup.py dumbo
Then you can run your Dumbo scripts. I was able to run the ipcount.py demo with the following command.
dumbo start ipcount.py -hadoop /home/hadoop -input s3://anhi-test-data/wordcount/input/ -output s3://anhi-test-data/output/dumbo/wc/
The -hadoop option is important. At this point I haven't created an automatic Dumbo install script, so you'll have to install Dumbo by hand each time you launch the cluster. Fortunately installation is easy.
On Nov 24, 7:24 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> FYI
>
> > Hi Klaas.
> > Just a quick note, we had to roll back the AMI that had dumbo support.
> > It should be back on there before Monday, though.
> > Sorry for the inconvenience, I should have emailed you sooner.
> > Regards,
> > Andrew
>
> On Nov 24, 2009, at 7:28, "Klaas Bosteels" <klaas.boste...@gmail.com> wrote:
>
> Nitin,
> It's great to hear that you'll be using dumbo for your pycon talk!
> The python version is probably the problem yeah. You could try
> changing the generated commands manually, but I think you might also
> run into some other issues if you don't use python 2.5 or newer.
> Btw, you could also use amazon EMR instead of raw EC2 instances.
> Here's part of an email I got from one of the amazon guys recently:
>
> Now you should be able to run Dumbo jobs on Elastic MapReduce. To
> start a cluster, you can use the Ruby client as so:
>
> elastic-mapreduce --create --alive
>
> SSH into the cluster using your EC2 keypair as user hadoop and install
> Dumbo with the following two commands:
>
> wgethttp://peak.telecommunity.com/dist/ez_setup.py
> sudo python ez_setup.py dumbo
>
> Then you can run your Dumbo scripts. I was able to run the ipcount.py
> demo with the following command.
>
> dumbo start ipcount.py -hadoop /home/hadoop -input
> s3://anhi-test-data/wordcount/input/ -output
> s3://anhi-test-data/output/dumbo/wc/
>
> The -hadoop option is important. At this point I haven't created an
> automatic Dumbo install script, so you'll have to install Dumbo by
> hand each time you launch the cluster. Fortunately installation is
> easy.
>
> I'll try to blog about this onhttp://dumbotics.comonce the automatic
> install script is ready.
> -Klaas
>
> On 24 Nov 2009, at 15:34, Nitin Madnani wrote:
>
> Klaas,
>
> Thanks for getting back to me! Yeah, I think the Python version may be
> the kicker here. It's Python 2.3! Do you think that's the problem?
>
> BTW, I am trying to do all this for my PyCon 2010 talk which is on
> doing large scale natural language processing using NLTK and Dumbo.
>
> BTW, just to clarify, the regular streaming command (non-module
> specification) without using Dumbo seems to have worked just fine. I
> guess I will try taking the command line generated by Dumbo and
> modifying it to use the other semantics and see what happens.
>
> If nothing works on this cluster, I will have to use EC2, I guess.
>
> Nitin
>
> On Tue, Nov 24, 2009 at 5:04 AM, Klaas Bosteels
>
> For more options, visit this ...
>
> read more »