Here are some shell scripts that I'm using with success to help me manage running jug across a cluster. Luis, if you're interested in using something like this, I can put these on the git and they can be upgraded to be more release-ready, although perhaps they should be renamed if so. I like being able to hit Tab after typing jug and see both plain old jug and all of my commands for managing the slaves
[hamiltont@foo ~]$ jug
jug jughalt juglog jugoutput jugrun jugstatus
__________________________________________
# Using parallel ssh with IP addresses of all slave computers
alias pssh='pssh -i -h workers.iplist '
# Allows me to keep an eye on all workers
alias jugstatus='watch -n 10 -d jug status --cache benchmark_jug.py'
# Rapidly starts a ton of workers, and also starts a screen session that will write to the log file once the entire job is complete. The waitonjug.sh is posted below
alias jugrun='echo "`date`: Start" >> .juglog; screen -d -m -S jugwatcher sh ~/.waitonjug; pssh screen -d -S jug -m sh run_workers.sh'
# Allows me to see the output of all executors
alias jugoutput='pssh cat /mnt/localhd/.jug*'
# Kills all jug workers
alias jughalt='echo "`date`: Halt ">> .juglog; screen -S jugwatcher -X quit; pssh pkill jug;'
# Simply outputs my log file
alias juglog='cat ~/.juglog'
_______________
[hamiltont@foo ~]$ cat run_workers.sh
#!/bin/sh
# Removing output from old runs
rm /mnt/localhd/.jug*
# Starting 10 executors on each slave machine, and forwarding output to some slave-machine specific files in a directory not shared via NFS
for i in {1..10}
do
jug execute benchmark_jug.py > /mnt/localhd/.jug$i.out 2> /mnt/localhd/.jug$i.err &
done
# Waiting for all sub-processes to complete before we terminate
wait
__________________________________
[hamiltont@foo ~]$ juglog
Wed Nov 14 02:46:39 EST 2012: Start
Wed Nov 14 02:46:39 EST 2012: Completed
Wed Nov 14 02:47:39 EST 2012: Halt
Wed Nov 14 02:49:10 EST 2012: Start
Wed Nov 14 03:00:00 EST 2012: Completed