The function "list -s" is exactly what I was looking for! I was searching for preflight, dryrun, etc.
Regarding the clustering/scaling we use the platform's back (pytorch distributed data parallel) for the gpu/cluster support. Our base environment is pretty static so we don't need too many smarts. Meaning, we can use the same node for the "controller' (rank 0) consistently and can make a ton of assumptions. Doit is really just supporting the dependency graph determination part.
My one remaining question: For the parallel part (when the jobs are small), the one piece I am trying to figure out is which "slot" in the parallel pool is being used. For example, if I want to run 16 jobs across 4 gpus is there an easy way to know which "slot" is being used, so I know which GPU to assign? So is some sort of environment variable or argument available with "worker id" or something?
A little more background if you're bored:
What we really is generate an intermediate json structure similar to doit which contains stanza of inputs, outputs and actions. Then we have scripts that recapitulate that into dodo.py files, or linear python. When we want to scale to larger platforms we'll make a converter to snakemake or something more heavyweight. For now, doit fits the sweet spot of size, ease of use and easy installation.