Generators of CWL workflows

308 views
Skip to first unread message

andr...@gmail.com

unread,
Jan 3, 2018, 10:10:33 AM1/3/18
to common-workflow-language

I wanted to discuss the need for building generators of CWL.


Currently, there is a GUI-based CWL generator (Rabix). Although it greatly enhances the usability of CWL for a casual user, for a developer (like myself) a scripting generator is preferable. 


Of course, one can directly write CWL by hand, but this exercise quickly gets tiresome due to a high verbosity of the CWL. For the same reason, CWL is very difficult to code-review and maintain. We already had a quick discussion about this with folks from Seven Bridges.


There is scriptcwl (https://github.com/NLeSC/scriptcwl), which has been getting multiple updates recently.


I have been working lately on writing another Python generator of CWL workflows. It was inspired by scriptcwl package, but avoids creating a separate object model like scriptcwl does, and inherits instead from the native Python counterparts of YAML structures (dict and list). This way, one can always directly manipulate the data structures in Python for a final polish and immediately dump them as CWL. The objective is to generate the final CWL code from Python, and never edit CWL by hand (at least for the workflows). This way, only Python codebase can be maintained and stored in a revision control. 


My generator provides various methods to perform bulk generation of CWL in order to greatly reduce the verbosity. For example, there are methods for  adding all tool outputs that match a wildcard pattern, or for creating workflow inputs for all tool inputs (again, based on pattern matching if required). 


Reducing verbosity and boilerplate is my major reason for using a generator.

 

Below I have pasted an annotated example of the generating code and its output.

 

I saw some snippets of discussions about defining a CWL object model and the APIs in various languages. If that is going to be done, then the generator would be better off using the object model.


I would like to get some idea of what is the current thinking here on the subject of CWL generators.

 

Thanks,

Andrey

 

#loads a directory of tools

tool_lib = tools.tool_library(pjoin(cwl_tool_dir,"*.cwl"),path_start=cwl_tool_dir)

#create a workflow; `wf` inherits from Python dict

wf = workflow(tool_lib=tool_lib)

 

wf.add_inputs(dict(prepareref_tgz="File",
                   manifest="File",
                   sample="sample_reads"))

#create a step in the workflow; `s` inherits from Python dict

s = wf.add_step("ariba_run.cwl")

#add all tool outputs as step outputs
s.add_outs()

s.add_in("prepareref_tgz")

 

#reads are extracted from a ‘record’ object
s.add_in(id="reads_1", source="sample", valueFrom="$(self.file1)")
s.add_in(id="reads_2", source="sample", valueFrom="$(self.file2)")

 

#add all remaining needed inputs of the tool by creating a workflow input for each and connecting it to the step input (will add step-specific prefix by default)
s.add_ins()

wf.add_output("ariba_run/assembled_genes")

 

#add workflow outputs: first wildcard entry selects all matching step outputs; second entry passes workflow input to the output with a prefix
wf.add_outputs(["ariba_run/*",
                ("out","manifest")])

 

# hand-copy requirements from a tool

wf["requirements"] = wf.get_tool_lib().tools["gene_extractor.cwl"]["requirements"]


wf.save("test_wf_02.cwl")

 

 

The generated CWL code:

 

 

#!/usr/bin/env cwl-runner
class: Workflow
cwlVersion: v1.0
inputs:
- id: prepareref_tgz
  type: File
- id: manifest
  type: File
- id: sample
  type: sample_reads
- id: ariba_run__assembled_threshold
  type:
  - 'null'
  - float
- id: ariba_run__assembly_cov
  type:
  - 'null'
  - int
- id: ariba_run__force
  type:
  - 'null'
  - boolean
- id: ariba_run__gene_nt_extend
  type:
  - 'null'
  - int
- id: ariba_run__min_scaff_depth
  type:
  - 'null'
  - int
- id: ariba_run__noclean
  type:
  - 'null'
  - boolean
- id: ariba_run__nucmer_breaklen
  type:
  - 'null'
  - int
- id: ariba_run__nucmer_min_id
  type:
  - 'null'
  - int
- id: ariba_run__nucmer_min_len
  type:
  - 'null'
  - int
- id: ariba_run__outdir
  type:
  - 'null'
  - string
- id: ariba_run__threads
  type:
  - 'null'
  - int
- id: ariba_run__unique_threshold
  type:
  - 'null'
  - float
- id: ariba_run__verbose
  type:
  - 'null'
  - boolean
outputs:
- id: assembled_genes
  type: File
  outputSource:
  - ariba_run/assembled_genes
- id: ariba_run__assembled_seqs
  type: File
  outputSource:
  - ariba_run/assembled_seqs
- id: ariba_run__assemblies
  type: File
  outputSource:
  - ariba_run/assemblies
- id: ariba_run__log_clusters
  type: File
  outputSource:
  - ariba_run/log_clusters
- id: ariba_run__report
  type: File
  outputSource:
  - ariba_run/report
- id: ariba_run__version_info
  type: File
  outputSource:
  - ariba_run/version_info
- id: out__manifest
  type: File
  outputSource:
  - manifest
requirements:
- class: ScatterFeatureRequirement
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
- class: SubworkflowFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: SchemaDefRequirement
  types:
  - fields:
    - name: sample_reads/file1
      type: File
    - name: sample_reads/file2
      type: File
    - name: sample_reads/SampleID
      type: string
    name: sample_reads
    type: record
steps:
- run: ariba_run.cwl
  id: ariba_run
  in:
  - id: prepareref_tgz
    source: prepareref_tgz
  - id: reads_1
    source: sample
    valueFrom: $(self.file1)
  - id: reads_2
    source: sample
    valueFrom: $(self.file2)
  - id: assembled_threshold
    source: ariba_run__assembled_threshold
  - id: assembly_cov
    source: ariba_run__assembly_cov
  - id: force
    source: ariba_run__force
  - id: gene_nt_extend
    source: ariba_run__gene_nt_extend
  - id: min_scaff_depth
    source: ariba_run__min_scaff_depth
  - id: noclean
    source: ariba_run__noclean
  - id: nucmer_breaklen
    source: ariba_run__nucmer_breaklen
  - id: nucmer_min_id
    source: ariba_run__nucmer_min_id
  - id: nucmer_min_len
    source: ariba_run__nucmer_min_len
  - id: outdir
    source: ariba_run__outdir
  - id: threads
    source: ariba_run__threads
  - id: unique_threshold
    source: ariba_run__unique_threshold
  - id: verbose
    source: ariba_run__verbose
  out:
  - id: assembled_genes
  - id: assembled_seqs
  - id: assemblies
  - id: log_clusters
  - id: report
  - id: version_info

 

 

Lourens Veen

unread,
Jan 10, 2018, 3:07:35 AM1/10/18
to common-workflow-language
Hi Andrey, and everyone,

I've been thinking a bit about this, in the light of ScriptCWL (of which my colleague is the author) and the proof-of-concept Python bindings for a future version of CWL that I've been working on over at https://github.com/NLeSC/pycwl/tree/cwl-experimental.

It seems to me that there is room for at least two workflow-generation solutions: a low-level one that lets you define a workflow by constructing a graph of objects (whether dictionaries or high-level objects or some combination) and serialising that to YAML (which is my idea of pycwl), and a high-level one that lets you load a set of steps and connect them together into a workflow (like ScriptCWL and your solution). Of course, the second one could be built on top of the first, if it were available, and I think your solution approximates such a design.

Whether to go with a dict/list based representation of the YAML description, or use real classes, is I think a matter of style. Maybe it depends on the language you're using as well. In the last couple of decades we've seen a shift from object oriented languages like Java and types-first programming to dynamically-typed languages like Python and JavaScript and behaviour-first programming using simple types. Personally, I'm a fan of type safety, so my Python bindings will have real classes, using the standard Ruamel/PyYAML functionality to construct them from the YAML input, and to serialise them back to YAML again (and yes, it does that, it's part of the YAML spec).

As for a high-level gluing-of-steps API, I quite like ScriptCWL (but of course I'm biased :-)). I don't really understand your "add all step inputs as workflow inputs" function, is that still useful for workflows containing more than one step? How do you wire together two steps in your system?

pim...@gmail.com

unread,
Jan 16, 2018, 12:25:15 PM1/16/18
to common-workflow-language

Hi Andrey, 


I wonder if it might be interesting to align Cromwell's WOM and e.g. a scriptcwl object model? 

Gr. Pim

Andrey Tovchigrechko

unread,
Jan 23, 2018, 5:38:13 PM1/23/18
to common-workflow-language
Laurens,
Looking at your code, are you writing your own parser for the CWL? You are not using cwltool for this because you want to experiment with the future CWL 2 syntax, is that correct?
I like the idea of defining custom (de-)serializers for the object model directly from/to YAML. The reason I am disinclined using any object model now is because none is yet defined (CWL standard does not define it). 

And here comes my key issue with how CWL ecosystem is being developed:
CWL has been initially conceived as a kind of workflow exchange language. Something that can be passed to different engines, and easy to write parsers for. It was not assumed to be easily writable by humans. Like a kind of a universal assembler language that could be executed by different CPU models.
Early discussions on this site contain statements like "it will be easy to create various generators for CWL".
In reality, the generators were never a focus of development, and the enthusiasts were left with writing CWL by hand ("coding in assembler"). UI-based tools like Rabix Composer are not for everyone. The fact that CWL still got such following reflects how attractive is the idea of a universal workflow description format.

I think that the lack of programmatic generators is seriously hurting the acceptance of CWL. I notice that there is essentially one engine (Toil) that supports it on the popular cluster backends (and one cloud in a truly distributed mode). Even that is rough around the edges. Cromwell is coming, but it will initially support only a small subset of requirement statements. I think what happens is that pipeline developers (coding bioinformaticians, for example) come to try it, get horrified by the verbosity compared to, say, NextFlow, and abandon the idea. Most people simply do not have time to code like that. That undercuts the user base, and it turn removes the incentive for implementing support across different engines.

There has to be a practical generator tool now that is as easy to build pipelines with as NextFlow or WDL. I agree that it does not matter that much if the API is functional style or OO as long as it is: - easy to use for typical cases; - can address complicated cases, maybe with more direct CWL manipulation; - can keep up with the changing standard and its support in the major implementations.

Answering your question about "add all step inputs as workflow inputs":

I am trying to address this use case - often the workflow is built from multiple tools where only a small subset of parameters of each tool is used to define the workflow graph. For example, if you are using Bowtie2 in a middle of the workflow, you would use input read files and output sam file as input and output ports that are connected to other steps, but Bowtie2 has about a dozen other parameters that tune its behavior. Often, it is desirable to expose at least some of them to the process (e.g. user) that is executing the workflow. In the CWL (at least as far as I could find), you have to propagate them into the containing workflow by copying their declarations twice: - first, as "step" "in" attributes; - second, as the containing workflow "input" attributes. That is a lot of boiler plate. In WDL and NextFlow, the user can override internal step parameters from the command line (in WDL, by using HOCON chained dot notation) without propagating the parameter definitions upward through all the layers. 
So, my generator allows to generate those CWL boiler-plate parameter propagation chains with a single function call. Normally, you would first use explicit parameter names to chain steps into a graph using "in" and "out" attributes, and then write a call that says "now take all remaining inputs of step X, create 'input' definition for each of them in the containing workflow, and create step 'in' definition that it connected to that input". The call by default will prefix the generated workflow parameter names with the step name, and it allows selecting only a subset of parameter names with a glob pattern.
There is another method that solves a similar task of generating a boiler plate code for bulk-exposing outputs of a given step/tool as workflow outputs. 

Andrey Tovchigrechko

unread,
Jan 23, 2018, 5:46:09 PM1/23/18
to common-workflow-language
They also have this CWL object model:
I do not know how that plugs into WOM.

Jeff Gentry

unread,
Jan 23, 2018, 5:48:22 PM1/23/18
to Andrey Tovchigrechko, common-workflow-language
Hi - the wdl4s repo was merged fully into cromwell and then development continued from there. You can see the remnants in Cromwell's wom, wdl, and cwl subdirectories

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-language+unsub...@googlegroups.com.
To post to this group, send email to common-workflow-language@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/3e553efa-697d-464e-b333-68cbff5d594e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andrey Tovchigrechko

unread,
Jan 23, 2018, 5:55:33 PM1/23/18
to common-workflow-language
Thanks. I guess that also supports my argument that focusing on any object model that is not part of the standard is chasing a moving target. No better than defining a focused set of functions based on common use cases.
To post to this group, send email to common-workf...@googlegroups.com.

Jeff Gentry

unread,
Jan 23, 2018, 5:58:49 PM1/23/18
to Andrey Tovchigrechko, common-workflow-language
Yes, in this specific case it'd certainly be wise to at least wait until it's something one could call complete with a straight face before latching on to it :)

To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-language+unsubscr...@googlegroups.com.
To post to this group, send email to common-workf...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-language+unsub...@googlegroups.com.

Peter Amstutz

unread,
Jan 23, 2018, 8:14:25 PM1/23/18
to Andrey Tovchigrechko, common-workflow-language
Hi Andrey,

I pretty much agree with everything you are saying. Understand that the progression of the project has been first developing the core spec and second getting it adopted by workflow engines, so it is only recently that there is sufficient foundation to be able to focus more on ease of use.

Are you suggesting that generator efforts should be more prominent on the front page?  A generator or domain specific language need an implementation, documentation, tutorials, and such, which is a whole project on its own.  It sounds like you are already working on something like this, if you are volunteering to lead this effort then I will be happy to promote it.

Thanks,
Peter

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-language+unsubscr...@googlegroups.com.
To post to this group, send email to common-workflow-language@googlegroups.com.

Lourens Veen

unread,
Jan 24, 2018, 5:47:03 AM1/24/18
to common-workflow-language
Hi Andrey,

Yes, I'm writing my own parser (and generator, but that's not there yet), as I'm not doing CWL 1, and I wanted to a try a bit of a different approach, using more YAML features and allowing for more wiggle room between the object format and the serialisation format. I see that need for a nice Python standard library for CWL as well, in fact I sort of needed one when I started building Cerise (a job submission engine for CWL workflows) and that's what led to this project. It's not just generation though, or (de)serialising from/to an object representation, I think it would also be really nice to have utility functions for e.g. checking types or collecting a list of input files that should be staged, the CWL-specific but not runner-specific parts of a CWL runner implementation. I just need to find a solid chunk of time to work on the pycwl code again, as time's been in short supply recently. I will try to update the Google doc with my object model soon, as not all of it survived first contact with coding a parser :-).

Best,

Lourens
Reply all
Reply to author
Forward
0 new messages