Hi Steven
Indeed, I used NLTK for my EC2 experiments in the PyCon talk. I tried two ways:
(a) I created a new Amazon Machine Image(AMI) which contained NLTK installed as part of the filesystem.
(b) Dumbo allows one to specify what eggs to transfer to ec2 for a streaming job by using the '-libegg' option and so I basically specified the nltk and yaml eggs as part of the (local) command line.
To be honest, (b) was easier and worked better than (a) but obviously only allows the use of the nltk code and not data. For that additional work would be needed to install the data as part of the filesystem. I was using external larger datasets and so that was not a problem for me.
To replicate my PyCon experiments, see
http://www.umiacs.umd.edu/~nmadnani/pycon/replicate.pdf
I have also been thinking about creating a public AMI that would contain all of nltk, all of it's data and all of the contrib stuff. However, generally the problem Is that then you have to create a new AMI everytime you want go update the nltk codebase. Option (b) is better since you can just specify the relevant egg file.
I would be more than happy to be involved if you wanted to create some nltk-based ec2 resources.
Cheers,
- Nitin