Control of the number of concurrent processes

Matthew Mizielinski

unread,

Feb 3, 2015, 8:41:57 AM2/3/15

to jug-...@googlegroups.com

I'm testing out Jug as it appears to have some promising properties, but I've hit one awkard point; how to control the number of concurrent processes.

The testing configuration I have has 132 tasks to complete, each of which will take ~20 minutes. I submit 16 processes in a job array using

bsub -o %J.out -J "debug[1-16]" jug execute jugfile.py

When all 16 processes start to run I check "jug status" and it reports

Task name                                    Waiting       Ready    Finished     Running
----------------------------------------------------------------------------------------
jugfile.process_file                               0         123           0           9

Checking the files these processes are churning through suggests that only 9 of them are running as reported.

I would like to be able to scale a process such as this up to several thousand tasks and possibly use up to a hundred processors to do so (should the cluster I'm using become quiet enough).
It is not clear from the documentation how I can control this (a little at https://jug.readthedocs.org/en/latest/tasks.html#task-size-control implying this is only possible for mapreduce based processing)

Any suggestions on how to control this would be welcome.

Matthew

Matthew Mizielinski

unread,

Feb 3, 2015, 8:54:57 AM2/3/15

to jug-...@googlegroups.com

Additional information; I'm using jug version 0.9.6

Luis Pedro Coelho

unread,

Feb 3, 2015, 10:59:38 AM2/3/15

to jug-...@googlegroups.com

Hi Matthew,

On Tue, 3 Feb 2015, at 02:41 PM, Matthew Mizielinski wrote:
> The testing configuration I have has 132 tasks to complete, each of which
> will take ~20 minutes. I submit 16 processes in a job array using
>
> bsub -o %J.out -J "debug[1-16]" jug execute jugfile.py

This is the correct way to do it.

> When all 16 processes start to run I check "jug status" and it reports
>
> Task name Waiting Ready
> Finished
> Running
> ----------------------------------------------------------------------------------------
> jugfile.process_file 0 123
> 0
> 9
>
> Checking the files these processes are churning through suggests that
> only
> 9 of them are running as reported.

This seems odd.

Are you sure that all 16 processes are running? Sometimes, it may take a
few seconds for LSF to start running these....

HTH,
--
Luis Pedro Coelho | EMBL | http://luispedro.org
My blog: http://metarabbit.wordpress.com

Matthew Mizielinski

unread,

Feb 4, 2015, 4:43:30 AM2/4/15

to jug-...@googlegroups.com

I thought they were; bjobs reported 16 tasks and they were running for 10 minutes before I checked on them, but only 9 of them were creating the output files.

However, in order to make testing and debugging easier I've reduced my code down to a very simple example and have found what I think is an issue with the interaction between jug and the batch system on the cluster I'm using. Just in case it is useful for others to test or consider, here is what I've done;

I've reduced my script to the following testing code;

from jug import Task
import time
   
def process_file(target):
    '''very simple process of target file'''
    for i in range(20):
        with open(target,'w') as fh:
            fh.write(str(i)+"\n")
        time.sleep(1)
   
start_year = 2008
end_year = 2040
run_dir = './runid/'

seasons = ['djf','mam','jja','son']

todo = []
for y in range(start_year, end_year+1):
    for s in seasons:
        target = run_dir + str(y) + str(s)
        meantask = Task(process_file, target)
        todo.append(meantask)

jug reports 132 tasks ready to go. Running these tasks would create a series of files in the "runid" directory (needs to be created before starting and should be emptied before re-running) containing a number that is incremented once a second for 20 seconds. The original script was intended to average monthly data from a climate model into seasonal mean files for each year.

I submit to the batch scheduler using the command

bsub -o %J.out -J "test[1-16]" jug execute

I can count the number of running processes using

cat runid/* | grep -v 19 | wc -l

and the number of completed processes using

cat runid/* | grep 19 | wc -l

When the batch job is running I see the same behaviour as I saw with the full version; less than 16 tasks are being run and the number changes as tasks complete and new ones are started (or not started), but the number of running tasks remained below 16. Output from the batch job suggested only 15 tasks ran (no output from job test[1] in the output file) which points at something odd going on.

When I run the test interactively on the log in node using 16 processes, i.e. not using the batch scheduler, I see the expected behaviour; 16 tasks in progress at all times. So something funny is going on with the interaction between jug and the cluster I'm using (I can use the cluster for processing hundreds of simple jobs at once so I'm reasonable confident that the cluster itself is ok).

Now to raise this with the cluster sys admins.

Thanks for the reassurance that I wasn't doing anything dumb.

M

Matthew Mizielinski

unread,

Feb 4, 2015, 5:10:43 AM2/4/15

to jug-...@googlegroups.com

After rerunning and looking through the output file once more I've found something anomalous;

In my most recent output there are results from each of the 16 processes and each process reports having done either 10 or 11 tasks. Adding them all up gives 169 tasks completed out of 132 tasks to be done.

Is it possible that two jug tasks are attempting to do exactly the same work at the same time? If so how would I diagnose or prevent this?

Should it be relevant here I'm using jug version 0.9.6.

TIA,

Matthew

Luis Pedro Coelho

unread,

Feb 4, 2015, 9:26:53 AM2/4/15

to jug-...@googlegroups.com

> After rerunning and looking through the output file once more I've found
> something anomalous;
>
> In my most recent output there are results from each of the 16 processes
> and each process reports having done either 10 or 11 tasks. Adding them
> all
> up gives 169 tasks completed out of 132 tasks to be done.
>
> Is it possible that two jug tasks are attempting to do exactly the same
> work at the same time? If so how would I diagnose or prevent this?

Yes, this is strange and wrong. Jug should not work like this (I mean,
the whole point is to avoid this!).

Old versions of NFS do not support exclusive file creation primitives
that jug uses, but unless you are using a very old cluster, this should
not be an issue.

> Should it be relevant here I'm using jug version 0.9.6.

The newest versions do have a few bugfixes, but they are probably not
relevant (unless you're running Python 3, are you?).

Luis

Matthew Mizielinski

unread,

Feb 4, 2015, 10:47:12 AM2/4/15

to jug-...@googlegroups.com

Yes, this is strange and wrong. Jug should not work like this (I mean,
the whole point is to avoid this!).

Old versions of NFS do not support exclusive file creation primitives
that jug uses, but unless you are using a very old cluster, this should
not be an issue.

The cluster is only a couple of years old (see details at jasmin.ac.uk should you be interested) and the storage is something fancy by Panasus (something in their ActivStor line, but I'm out of my depth at this point).

> Should it be relevant here I'm using jug version 0.9.6.

The newest versions do have a few bugfixes, but they are probably not
relevant (unless you're running Python 3, are you?).

Python 2.7 (.3 I think)

Is there any way I can get jug to kick out any additional (debugging) info that might be handy?

M

Luis Pedro Coelho

unread,

Feb 4, 2015, 12:30:10 PM2/4/15

to jug-...@googlegroups.com

> Is there any way I can get jug to kick out any additional (debugging)
> info
> that might be handy?

jug --verbose=info ...

prints out more things, but I don't know if it'll be very helpful.

You could try to have each Task print out its arguments and see if two
different processes print out the same combination of arguments. In your
code, something like:

def process_file(target):
'''very simple process of target file'''

print("RUNNING ON TARGET -{}-".format(target))

HTH
Luis

> M
>
> --
> You received this message because you are subscribed to the Google Groups
> "jug-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jug-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bryan Lawrence

unread,

Feb 5, 2015, 1:15:54 AM2/5/15

to jug-...@googlegroups.com

Hi Matthew

I think we might have to understand file locking in PANFS ... compared with NFS ... at one point I did wonder about it ... if it does turn out to be a problem we could instantiate an appropriate object store since it's obviously important.

Do carry on logging to this group your results, and I'll try and get some others interested in the problem - but I have to confess it wont be a high priority for the systems team right now ... so the more information we can get independent of them, the better.

Cheers

Bryan

Matthew Mizielinski

unread,

Feb 6, 2015, 4:31:24 AM2/6/15

to jug-...@googlegroups.com

jug --verbose=info ...

prints out more things, but I don't know if it'll be very helpful.

Not in this case.

You could try to have each Task print out its arguments and see if two
different processes print out the same combination of arguments. In your
code, something like:

def process_file(target):
'''very simple process of target file'''
print("RUNNING ON TARGET -{}-".format(target))

This confirmed that some tasks are being performed multiple times by different jug processes.
As Bryan suggests below this may be connected with the Panasus file system used on Jasmin, so I've raised a ticket with the support teams there.
I'll follow up to this thread should we get more useful information.

Thanks for your help,

M

Luis Pedro Coelho

unread,

Feb 9, 2015, 10:57:17 AM2/9/15

to jug-...@googlegroups.com

By the way, (and I could have thought of this sooner):

using the redis backend probably works for you. It works well if your
task results are small (and you seem to be writing results to disk
inside the tasks, anyway...

HTH
Luis

--
Luis Pedro Coelho | EMBL | http://luispedro.org
My blog: http://metarabbit.wordpress.com

Bryan Lawrence

unread,

Feb 9, 2015, 11:00:07 AM2/9/15

to jug-...@googlegroups.com

Hi Luis

Indeed, I tried the redis some time ago, and it could well work. I'm going to try and find someone who has time (i.e. not me) to try it in the near future.

Cheers

Bryan

--

Bryan Lawrence

University of Reading: Professor of Weather and Climate Computing.

National Centre for Atmospheric Science: Director of Models and Data.

STFC: Director of the Centre for Environmental Data Archival.

Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence

Reply all

Reply to author

Forward