I thought they were; bjobs reported 16 tasks and they were running for 10 minutes before I checked on them, but only 9 of them were creating the output files.
However, in order to make testing and debugging easier I've reduced my code down to a very simple example and have found what I think is an issue with the interaction between jug and the batch system on the cluster I'm using. Just in case it is useful for others to test or consider, here is what I've done;
I've reduced my script to the following testing code;
from jug import Task
import time
def process_file(target):
'''very simple process of target file'''
for i in range(20):
with open(target,'w') as fh:
fh.write(str(i)+"\n")
time.sleep(1)
start_year = 2008
end_year = 2040
run_dir = './runid/'
seasons = ['djf','mam','jja','son']
todo = []
for y in range(start_year, end_year+1):
for s in seasons:
target = run_dir + str(y) + str(s)
meantask = Task(process_file, target)
todo.append(meantask)
jug reports 132 tasks ready to go. Running these tasks would create a series of files in the "runid" directory (needs to be created before starting and should be emptied before re-running) containing a number that is incremented once a second for 20 seconds. The original script was intended to average monthly data from a climate model into seasonal mean files for each year.
I submit to the batch scheduler using the command
bsub -o %J.out -J "test[1-16]" jug execute
I can count the number of running processes using
cat runid/* | grep -v 19 | wc -l
and the number of completed processes using
cat runid/* | grep 19 | wc -l
When the batch job is running I see the same behaviour as I saw with the full version; less than 16 tasks are being run and the number changes as tasks complete and new ones are started (or not started), but the number of running tasks remained below 16. Output from the batch job suggested only 15 tasks ran (no output from job test[1] in the output file) which points at something odd going on.
When I run the test interactively on the log in node using 16 processes, i.e. not using the batch scheduler, I see the expected behaviour; 16 tasks in progress at all times. So something funny is going on with the interaction between jug and the cluster I'm using (I can use the cluster for processing hundreds of simple jobs at once so I'm reasonable confident that the cluster itself is ok).
Now to raise this with the cluster sys admins.
Thanks for the reassurance that I wasn't doing anything dumb.
M