Using doit to dispatch jobs on a cluster

38 views
Skip to first unread message

True Merrill

unread,
May 4, 2023, 2:40:17 PM5/4/23
to python-doit
Hi all,

I'm joining a development team that is interested in using doit to manage a batch processing pipeline.  This is relatively simple to do when all of the tasks are being executed locally on the machine running doit, but it isn't clear whether doit can support (even by extending the code by defining new actions) tasks that are run on remote machines.

Please tell me if the use case I outline below wont work or whether there are other existing tools that fit this use case better.

Essentially, we want to manage a batch processing pipeline where each task is executed as a cluster job (I am using Slurm and Sun Grid Engine).  To do this, I've already made a custom doit Action class (inheriting from BaseAction) which manages all of the overhead of packaging a task as a batch script and submitting to our cluster. 

When doit runs, it marches through the computational graph and for tasks that are incomplete, it submits a cluster job to execute the task.  Each (parent) task may have downstream dependent (child) tasks that require data from the parent.  So we use the scheduling system (provided by Slurm or SGE) to link the child jobs to be executed when the parent jobs complete.  The doit process terminates when all of the required jobs have been submitted to the cluster.

This works well for submitting a collection of linked jobs, but in our current implementation doit doesn't check whether any one task is already running or is pending execution.  So if a user runs the same dodo.py script twice, they are likely going to resubmit jobs that are already running or will run in the future.  We have a race condition.

Does anyone know how I can make doit check whether a task is in progress on a remote machine and prevent dispatching multiple jobs?

Am I trying to make a square peg fit into a round hole with doit and are there better tools available?

Has anyone successfully used doit in an application where tasks are being executed on a remote machine?  How did you handle the case of monitoring processes that are in progress on a remote system?

Thank you!
Reply all
Reply to author
Forward
0 new messages