Re: [Numpy-discussion] Numpy/Scipy for EC2

65 views
Skip to first unread message

Dorian Raymer

unread,
Oct 29, 2009, 2:24:00 AM10/29/09
to Discussion of Numerical Python, codenod...@googlegroups.com, sy...@googlegroups.com, sage-n...@googlegroups.com
Hi Dan,

I have recently created an AMI for running python processes. 

I recommend using the ubuntu server ami's provided by http://alestic.com/
Alestic is a well known provider of public AMI images. I think this is exactly the place you want to start from; anything you need is an apt-get or easy_install away. 
From the moment you launch an instance, you are literally minutes away from being able to run a computation with Python. 

I also recommend the FireFox plugin called ElasticFox for interfacing the AWS api. It is a lot easier than the command line api tools!

I left some rough notes on my AMI creation/setup process here: http://wiki.github.com/codenode/codenode/backend-demonstration-ec2-image
The notes include the ami-id of my resulting image, which you should be able to launch if you wish. If you are interested, I can dive into more detail on how I set up the os/python environment, etc.

The image I created is used as the Codenode live public notebook backend: http://live.codenode.org/
You can create an account, login, start a Notebook, import Numpy and run any code you want right now!

Hope this is useful,
Dorian

I cross-posted this to codenode-devel, sympy, and sage-notebook; I think this topic could be of interest to others on those lists. 




On Wed, Oct 28, 2009 at 9:29 PM, Dan Yamins <dya...@gmail.com> wrote:
Hi all:

I'm gearing up to build an Amazon Machine Instance (AMI) for use in doing Numpy/Scipy computations on the Amazon EC2 cloud.

I'm writing to ask if anyone has any advice for which (if any) publicly available AMI I should start with.

If any one has any specific AMI's that they think are good bases from which to modify -- or really, any other advice about using numpy/scipy on EC2 -- I'd love to know.

Beyond that, even if you don't know which AMI to recommend (or even what an AMI is), I still would like advice about which Linux flavor to use.  I've had some experience with Mac OSX (and, with David Cornapeau's help over this list, I was able to build 64-bit Scipy with Python 2.6!), but I really know nothing about what the build process is like on Linux (and most likely, unless someone recommends a good AMI with optimized BLAS/LAPACK already built, I'm going to have to built it from scratch).    So, should I use Ubuntu or Debian or Fedora or Centos or ...?    

Thanks!
Dan

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Dorian Raymer

unread,
Nov 20, 2009, 6:25:07 PM11/20/09
to codenod...@googlegroups.com


---------- Forwarded message ----------
From: Dan Yamins <dya...@gmail.com>
Date: Thu, Nov 19, 2009 at 5:45 PM
Subject: Re: [Numpy-discussion] Numpy/Scipy for EC2
To: Discussion of Numerical Python <numpy-di...@scipy.org>


Hi all:

I'm just writing to report on my experience using Starcluster, which enables the use of NumPy and Scipy in the Amazon EC2 cloud computing environment.  The purpose of my email is to extol Starcluster's qualities, and suggest that the NumPy community be aware of its development.    I suspect there are others in the community who find cloud computing an attractive idea but a little daunting to get into, and would be pleasantly surprised out how easy Starcluster makes it to get started using NumPy on Amazon EC2.

For those of you who aren't familiar with AMIs and the Amazon EC2 service, see e.g. http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud.   Three of the basic concepts are  "Amazon Machine Images" (AMIs),  "machine instances" of AMIs, and the Elastic Block Storage (EBS) service.   AMIs are disk images containing a virtual machine, including an operating system and other software you add on.  Instances are temporarily allocated computers, booted with your chosen virtual machine, that you start up on demand, use for computations with software from the AMI, and then terminate.   EBS is a persistent storage service, also from Amazon, that serves as permanent file-systems in the cloud.   You allocate an EBS volume of a given size, attach the EBS volume(s) to a running machine instance just like any other hard-drive, and use it to store the files  you use/create during computation, both during the computation and then for later use whenever you start up a new instance.   

A couple of weeks ago I wrote to this list asking for advice on finding a good Amazon Machine Instance (AMI) for using NumPy and Scipy on Amazon cloud.   I didn't want to have to build a linux machine image with optimized blas and lapack myself, and I figured that there might be good existing publicly-available AMIs that I could use as a base.   Robert Kern suggested that I look into the Starcluster project (http://web.mit.edu/stardev/cluster/).   

I have found Starcluster extremely useful.  It made it possible for me to, in the course of one day, go from knowing essentially nothing about cloud something, to being able to run large-scale parallel clusters with my favorite NumPy/SciPy-scripts.  

The basis of what Starcluster offers are two solidly-build AMIs.  The operating system is Ubuntu Jaunty, and comes with prebuilt optimized blas and lapack, numpy, Scipy, matplotlib, ipython, and several other useful packages for scientific computing in python.   It uses Python 2.6, and comes in both 32-bit and 64-bit flavors.  The AMIs are based on AMIs from Alestic (http://alestic.com/), and are built with best-practices for ensuring stability and good interaction with Amazon's system.    They have proved very stable and extensible.
   
In addition to these AMIs, Starcluster has three extremely useful features:

    -- Built-in support for mounting EBS drives as NFS filesystems, and then administering the shared drive across multiple machine instances. 
    -- The Sun Grid Engine (SGE), a queuing system for scheduling jobs to be run in parallel across instances
    -- A python module with a few commands that give you an incredibly simple interface for automating the process of starting/terminating a cluster of instances, mounting the shared drive, starting the grid engine, &c -- and configuring your cluster needs (e.g. how many nodes it will contain, which AMIs to use, which EBS volumes to mount etc.). 

As a result, all you have to do to have a NumPy-enabled cluster-on-demand is:
    1) Get an amazon EC2 account, and the accompanying security credentials (.501 certificates and PGP keypair) for your account.
    2) Install starcluster ("easy_install starcluster")
    3) Follow the installation procedure on the starcluster website for getting, attaching, and formatting an EBS volume as an NFS drive.
    4) Set up your starcluster configuration file.
    5) Start a 1-node cluster, modify the installation as you see fit, and re-bundle the result into a new AMI as described on the Amazon website http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/.   (Don't forget to edit your starcluster configuration file to reflect your new AMI.)   This step is optional -- If you don't need anything else special, you can just use Starcluster's base images.

After that, starting a cluster is as easy as typing single command ("starcluster -s").  To submit parallel jobs on your cluster, you can learn to use the Sun Grid Engine "qsub" command (http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/htmlman1/qsub.html) or use the python bindings to the SGE interface (http://code.google.com/p/drmaa-python/).     Or, if you like Parallel Python, that works perfectly well on these clusters too.

Overall, in my experience, Starcluster has been easy, stable and powerful, and I encourage anyone who is curious about cloud computing with Numpy to look into it. 

Starcluster is by no means a finished project.  At the moment, you can only administer one cluster at a time from your given local machine, since starcluster has no notion of a "session" and it can't distinguish between different clusters you've started up (you can start multiple clusters, but then any starcluster commands that you type in your local terminal might get confused about which amazon machine instances you're referring to, so it has trouble administering them.)    Also, there's no dynamic load balancing, so once you've started a cluster with a certain number of nodes, you're stuck with that number of computers while the cluster is running, even if you're only using a few of them or suddenly need more. 

The developer of the project (Justin Riley) says on his website that he's planning to add these features in the next release.    Now, I'm not the creator or developer or maintainer of Starcluster, and I have no affiliation with Justin Riley or the project whatsoever, so I want to make it clear I don't speak for them in any way except as a satisfied user.  I don't know what his commitment to his development plans are, either -- however, I hope he sticks to his timeline, as I think continuing the vigorous development of his project would be a real plus for the NumPy community.  I'm hoping that if others in the NumPy community like his project and start using it, that will make add to the likelihood of continued development. (If anyone from the NumPy community is interesting in helping the developer out, perhaps you should consider shooting him an email.)

Anyhow, I apologize for this long email, and hope it may be of use to somebody!

Dan






 

Reply all
Reply to author
Forward
0 new messages