Draft 1 tool description spec ready for review

41 views
Skip to first unread message

peter

unread,
Dec 8, 2014, 3:50:51 PM12/8/14
to common-workf...@googlegroups.com
I've finish my first pass through writing the Draft 1 spec for the tool
description language, and it is ready for review and comments:

https://github.com/common-workflow-language/common-workflow-language/blob/draft-1/specification/tool-description.md

Some notes:

* I tried to formalize the semantics for running tools, but I'm sure
there are still lots of holes to fill in.

* Added a required "schema" field, which we can use to distinguish
between versions of the tool document language.

* Language probably wavers between being documentation (as a user, you
can do this) and specification (an implementation MUST do this). We
will probably eventually want to split this into separate documents for
concepts/tutorial/examples and the actual specification.

* I made a few changes to the format:
* $expr no longer takes an object with separate "lang" and "value"
fields, it just takes string with the javascript code directly
* Renamed optional "meta" field in "file" records to "metadata" and
removed the "file_type" required field until we have decided how we want
want to handle file types.

* We should add a well defined "status" field in the output record that
indicates if a tool execution should be considered successful or failed.

* Punted on specifying that tool metadata fields should use FOAF and
DOAP until we get our act together with json-ld more generally.

* I made an effort to write the spec from the perspective of "what do I
think is the best way this should work" and not "what do the current
implementations do" so the implementations are likely to be out of step.

* Didn't talk enough about nested adapters, or Docker. TODO

* No discussion about generating a GUI form to create and edit the input
document. This should go in its own document.

* I think that the most important bar a specification has to reach
whether a competent developer take the spec and write a conforming
implementation without reference to existing implementations. I'm not
sure if the current spec passes that bar (probably not.)

Since this is a milestone draft, with more drafts intended to follow it,
I'd like to suggest that our discussion focus on clarifying the
semantics and fixing any major errors, and not on adding new features.

- Peter



Nebojsa Tijanic

unread,
Dec 10, 2014, 9:36:14 AM12/10/14
to peter, common-workf...@googlegroups.com
Thanks a lot for writing this up, Peter!

I considered making a pull request with some proposed changes but, since they are mostly conceptual, seems like discussion belongs on the list:

[01] Rename job.json to job.cwl.json (less chance of conflict with tools).
[02] Rename "$expr" to "@expr". Less conflict with js/json stuff like angularjs and mongodb.
[03] Move schemas from raw.githubusercontent.com to gh-pages branch
[04] Add "@type" field for File, CommandLineTool and JobOrder entities.
[05] Use only cpu and mem for resources initially to keep things simple.
[06] Enumerate secondary files by naming convention (see below).
[07] Include certain fields in output.adapter for output file metadata (see below).
[08] Apart from "$job", expressions should have a "$self" variable if used in context of input.adapter.transform (bound to value for that input) or output.adapter.metadata (bound to path of file matching glob, see below).
[09] Restrict use of expressions to specific places:
 - input.adapter.transform
 - adapter.baseCmd[]
 - adapter.stdin
 - adapter.stdout
 - adapter.args[].value
 - output.adapter.glob
 - output.adapter.metadata[]
 - output.adapter.value
[10] Replace docker image URI with imageRepo and imageTag (I think docker doesn't quite have URIs).
[11] It might be more consistent to specify that Files have a default transform of "$self.path".
[12] Represent expressions and $job references as {"@type": "Transform", "lang": "<javascript or jsonpointer>", "value": "<js_expr or jsonpointer_expr>"}
[13] Remove adapter.transform and only use adapter.value
[14] Specify that if, after running the tool, working dir contains a "result.cwl.json" file, no output adapters should be used and the file content must be the result that matchers outputs schema. This seems quite useful for very complex tools where a wrapper is needed, since the wrapper can also report the outputs.

Regarding [06] and [07], it's really useful to have a way to set output file metadata (and secondary files) from the output adapter. If file metadata are RDF properties, it's possible to do it via inference rules, but that's way too complicated for people to write imo. A simpler way that should cover most cases:
 - An output.adapter.metadata dict which will be set as File.metadata for output files. Values can be expressions with $self bound to path of that specific file.
 - output.adapter.inheritMetadataFrom field with input ID as value. Metadata for produced file will be the same as for input file(s), before updating with output.adapter.metadata dict. In case of multiple input files, a "tuple intersection" can be used (only set the properties which are same for all input files). This can later be extended with different strategies.
 - output.adapter.secondaryFiles can be a list of extensions. A naming convention can be used (as is the case in most bioinformatics tools/formats) instead of enumerating actual files. We would need a way to signify if extension is appended or replaced (e.g. ".bai" and "$.dict", respectively - so "file.bam" would have "file.bam.bai" and "file.dict").
 - output.adapter.inheritSecondaryFilesFrom with input ID as value can also be useful for indexer tools, in case there are multiple different indices for same file, each created by a different tool.

If there's general agreement about any of these points, I'll gladly make a pull request for specs as well as the reference implementation. Sorry I wasn't of any help during initial writeup, had a very busy week.

Thanks again,








- Peter



--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.
To post to this group, send email to common-workf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/1418071890.12966.5.camel%40curoverse.com.
For more options, visit https://groups.google.com/d/optout.



--
Nebojsa Tijanic
Seven Bridges Genomics, Inc.

peter

unread,
Dec 11, 2014, 11:29:37 AM12/11/14
to Nebojsa Tijanic, common-workf...@googlegroups.com
On Wed, 2014-12-10 at 15:35 +0100, Nebojsa Tijanic wrote:


> [01] Rename job.json to job.cwl.json (less chance of conflict with
> tools).

Agree.

> [02] Rename "$expr" to "@expr". Less conflict with js/json stuff like
> angularjs and mongodb.

Do angular and mongo db already explicitly have an $expr key? Also
"@expr" seems like it would conflict with json-ld.

> [03] Move schemas from raw.githubusercontent.com to gh-pages branch

Go ahead. I haven't used github pages I would appreciate it if you can
take care of that.

> [04] Add "@type" field for File, CommandLineTool and JobOrder
> entities.

This looks like taking baby steps towards using json-ld. Rather than
just adding in a few bits and pieces, it would make more sense to start
talking seriously about that for draft 2?

> [05] Use only cpu and mem for resources initially to keep things
> simple.

Agree.

> [06] Enumerate secondary files by naming convention (see below).
> [07] Include certain fields in output.adapter for output file metadata
> (see below).

Response below.

> [08] Apart from "$job", expressions should have a "$self" variable if
> used in context of input.adapter.transform (bound to value for that
> input) or output.adapter.metadata (bound to path of file matching
> glob, see below).

$self for inputs makes sense.

> [09] Restrict use of expressions to specific places:
> - input.adapter.transform
> - adapter.baseCmd[]
> - adapter.stdin
> - adapter.stdout
> - adapter.args[].value
> - output.adapter.glob
> - output.adapter.metadata[]
> - output.adapter.value

Schema validation should already be checking that? If there are gaps,
go ahead and bring the spec and schema in line.

> [10] Replace docker image URI with imageRepo and imageTag (I think
> docker doesn't quite have URIs).

Ok. These are going to be Docker-specific keys, then? We should also
include an option for a custom Docker registry.

> [11] It might be more consistent to specify that Files have a default
> transform of "$self.path".

This got me thinking that all the adapter transformations could be
defined in the spec in terms of default transform expressions.

> [12] Represent expressions and $job references as {"@type":
> "Transform", "lang": "<javascript or jsonpointer>", "value": "<js_expr
> or jsonpointer_expr>"}

Hmm.
> it, send an email to common-workflow-language
> +unsub...@googlegroups.com.

peter

unread,
Dec 11, 2014, 2:02:47 PM12/11/14
to Nebojsa Tijanic, common-workf...@googlegroups.com
On Wed, 2014-12-10 at 15:35 +0100, Nebojsa Tijanic wrote:


> [01] Rename job.json to job.cwl.json (less chance of conflict with
> tools).

Agree.

> [02] Rename "$expr" to "@expr". Less conflict with js/json stuff like
> angularjs and mongodb.

Do angular and mongo db already explicitly have an $expr key? Also
"@expr" seems like it would similarly have potential conflict with
json-ld.

> [03] Move schemas from raw.githubusercontent.com to gh-pages branch

Go ahead. I haven't used github pages I would appreciate it if you can
take care of that.

> [04] Add "@type" field for File, CommandLineTool and JobOrder
> entities.

This looks like taking baby steps towards using json-ld. Rather than
just adding in a few bits and pieces, it would make more sense to start
the conversation about json-ld and Avro for draft 2?

> [05] Use only cpu and mem for resources initially to keep things
> simple.

Agree.

> [06] Enumerate secondary files by naming convention (see below).
> [07] Include certain fields in output.adapter for output file metadata
> (see below).

Can I propose this be put on the TODO for draft 2 or 3, when we tackle
metadata?

> [08] Apart from "$job", expressions should have a "$self" variable if
> used in context of input.adapter.transform (bound to value for that
> input) or output.adapter.metadata (bound to path of file matching
> glob, see below).

Agree on $self for inputs (I remember seeing a $context variable in the
Rabix code and not understanding what it was for and taking it out, now
that you've explained it,

> [09] Restrict use of expressions to specific places:
> - input.adapter.transform
> - adapter.baseCmd[]
> - adapter.stdin
> - adapter.stdout
> - adapter.args[].value
> - output.adapter.glob
> - output.adapter.metadata[]
> - output.adapter.value

Agree. Schema validation should already be checking that? If there are
gaps, go ahead and bring the spec and schema in line.

> [10] Replace docker image URI with imageRepo and imageTag (I think
> docker doesn't quite have URIs).

Agree. These are going to be Docker-specific keys, then? We should
also include an option for a custom Docker registry.

> [11] It might be more consistent to specify that Files have a default
> transform of "$self.path".

Agree. This also got me thinking that all the adapter transformations
could be defined in the spec in terms of default transform expressions.

> [12] Represent expressions and $job references as {"@type":
> "Transform", "lang": "<javascript or jsonpointer>", "value": "<js_expr
> or jsonpointer_expr>"}

This seems cumbersome to write. Same comment from earlier about
introducing @type tags without going to full json-ld. Also if we use
json-ld then we can rid of jsonpointer, and might define this whole
fragment differently. I took out the "lang" when I was writing draft 1
specifically to make it clear that expressions are only Javascript.
Needs further discussion.

> [13] Remove adapter.transform and only use adapter.value

Agree. I don't think my reference implementation or the spec even talks
about adapter.transform?

> [14] Specify that if, after running the tool, working dir contains a
> "result.cwl.json" file, no output adapters should be used and the file
> content must be the result that matchers outputs schema. This seems
> quite useful for very complex tools where a wrapper is needed, since
> the wrapper can also report the outputs.

Agree.

Regarding outputs, what do you think about specifying a required
"status" field in the output record which indicates "success" or
"failure"?

> If there's general agreement about any of these points, I'll gladly
> make a pull request for specs as well as the reference implementation.
> Sorry I wasn't of any help during initial writeup, had a very busy
> week.

Agree on 1, 3, 5, 8, 9, 10, 11, 13, 14. Others need further discussion.

Thanks,
Peter


Nebojsa Tijanic

unread,
Dec 12, 2014, 7:24:35 AM12/12/14
to peter, common-workf...@googlegroups.com
Yeah, lot of the above was baby steps to json-ld. I'd be happy to leave draft1 as-is w/regard to those points to accelerate the release of JSON-LD-based draft2.

Regarding outputs, what do you think about specifying a required
"status" field in the output record which indicates "success" or
"failure"?

By "output record" do you mean the result.cwl.json?

If so, it feels unneeded, at least at this stage. If we have success/failure on the level of individual jobs, the natural way to detect it seems to be process exit code (TODO: successCodes field) or exceptions raised while evaluating expressions or validation failing.

If someone wrote a wrapper, they can still report success/failure through exit code and give more info on stderr. 

On the other hand, if we nest the outputs structure under e.g. "outputs" key of result.cwl.json, it leaves room for extending the structure later. I'd still rely on exit code, just to avoid having multiple ways of declaring success/failure.

Will send a pull request over the weekend.

Tim Pierce

unread,
Dec 22, 2014, 4:05:14 PM12/22/14
to common-workf...@googlegroups.com
Hi, folks --

I'm Tim Pierce, one of the other engineers who works with Peter at Curoverse.  I volunteered to put another couple of eyes on the tool description document, and Peter asked if I would send my suggestions here for public feedback.

This is really exciting stuff and I'm really glad to see how much work the group has gotten done so far.  I apologize in advance if I'm touching on stuff that the rest of you have already addressed -- I reviewed some of the discussion before I sent this but really only briefly.

One thing that I would like to see in this document is some mention of how fatal errors are handled -- in the event of malformed JSON in the tool description document, or an unknown type name in the input schema, that sort of thing.  Maybe if that's a function of the workflow infrastructure, it's out of scope for the tool description document, but it would be helpful at least to have some idea of what kinds of document errors are considered fatal.

I find it a little bit confusing that "inputs" is used as the name of both the input schema in the tool description document and the literal inputs in the job order document.  Maybe this doesn't present confusion in practice -- it just looks a little squirrelly to me at first.

1. Execution

The "Execution" section of the document specifies, "The designated output directory is empty except for "job.cwl.json" which contains the job order."

Is that correct?  The job order document is specified in the designated *output* directory?  That seems counterintuitive to me -- I would not expect the designated output directory to contain any files that are necessary before the job has even started.

2. Assuptions and Restrictions

Item 3 under "Assumptions and Restrictions" says: "Tool input must come from: the command line, the standard input stream, by reading input files, and/or by accessing network resources."  Item #5 says "Tool output is emitted via the standard output stream, by writing files, or by accessing a network resource."

Are there any other forms of input or output possible? We have already ruled out keyboard or interactive input in #1. Is this wording just meant to emphasize #9, to ensure that tools don't get input from an attached atomic vector plotter or whatever?  It's probably harmless, but I'm wondering if there are specific input/output sources that I'm not thinking of that need additional clarification.

Suggestion for an additional restriction: "Tool output files will be written with ownership and permissions that permit other tools in the workflow to read those files as inputs.  The workflow infrastructure is responsible for enforcing this."

3. References

I'd like to see the spec clarify what behavior is expected if the user writes a reference in the tool description document that can't be resolved.

4. Mixins

Suggestion: have an additional example that shows explicitly how a mixin is processed when it's introduced in a nested object in the source document (i.e. not at top level).  For example:

    {
      "foo": 11,
      "bar": {
        "item1": 12,
        "$mixin": "doc2.json#"
      }
    }

5. Expressions

Does this specification need to describe how the expression should be sandboxed, or is it up to the implementor to choose a JS sandbox solution? Would it be useful to include pointers to one or two possible solutions?

6. Input schema

I'm concerned about the note that the $schema keyword will be ignored. Does this imply that there's no way to write new versions of the input schema or to gracefully upgrade existing documents?  (I recognize that trying to version schemas is an incredibly knotty problem and doesn't generally work out in practice as beautifully as it should on paper.)

I was also confused by this: "Any object in the schema where the "type" keyword is meaningful may contain a field named "adapter" containing an adapter record."  Is there any situation in which the type keyword is *not* meaningful?  (Also for "output schema")

7. Adapter record

"If the adapter record is listed in the input schema..." Is it optional to list the adapter record in the input schema or required?  If required, that should be made explicit; if optional, should specify what default settings are used to build the command line.  Also need clarification on "the corresponding value" in the job order document -- does that mean a job order field named "adapter"?  How is that field used?

The default values for "separator" and "itemSeparator" is "none."  Recommend saying "the empty string" for clarity.  (And is that really the correct default for itemSeparator, or should the default be whitespace?)

If I'm reading the example presented correctly, the tool's #/adapter/args/0/value field should be a job order reference, i.e. "$job" and not "$ref".

8. Resources

The "resources" field appears to be incomplete.  The description says "it has four fields" but only two are listed (cpu and mem).

9. Output schema

"Any object in the schema where the "type" keyword is meaningful may contain an "adapter" object." Same concern as for "input schema" -- is there any situation in which this keyword is not meaningful?

10. Output adapter

Should specify a collation order for files matched by the glob pattern (UTF-8?)

"After the tool process finishes, if the directory contains a file called "result.cwl.json", no output adapters will be used and the content of that file must be the output record."  If this file is written to the designated output directory before the tool is even run, the tool's output will be ignored entirely.  Is that intentional?  I'm concerned that it could be a way to accidentally skip pipeline results if the workflow is not written carefully.

Nebojsa Tijanic

unread,
Dec 24, 2014, 7:09:41 AM12/24/14
to Tim Pierce, common-workf...@googlegroups.com
Hi, Tim. Thanks a lot for the comments and pull request!

My thoughts inline.

>> Is that correct?  The job order document is specified in the designated *output* directory?  That seems counterintuitive to me -- I would not expect the designated output directory to contain any files that are necessary before the job has even started.

Yes. The wording should perhaps be slightly different. The job order is placed inside the cwd, which is also the designated output directory.
Reason for this is the following: some tools may be too complex or weird to describe with adapters, so a "wrapper" script may be needed. If people are already writing the wrapper script, they shouldn't need to write additional javascript expressions in the tool description - the wrapper can read the inputs from job.cwl.json and write the outputs to result.cwl.json.

An alternative could be to write the job order elsewhere and set an environment variable to path of the job.cwl.json file. This would keep the working dir clear when the process is executed. Not sure if it's worth it.

>> Is this wording just meant to emphasize #9, to ensure that tools don't get input from an attached atomic vector plotter or whatever?  It's probably harmless, but I'm wondering if there are specific input/output sources that I'm not thinking of that need additional clarification.

I think it's just emphasis, as you said.

>> Suggestion for an additional restriction: "Tool output files will be written with ownership and permissions that permit other tools in the workflow to read those files as inputs.  The workflow infrastructure is responsible for enforcing this."

This is a great point, but it's more of a note than a restriction - when implementing the executor, one would of course have to handle this. Tools will always create files as the user that started the process. This could be a bit tricky with Docker, since we can only guarantee that the image has a root user. It would make implementation a lot easier if the spec said that docker images will run with UID 1001 or some such, but it will make image building harder. I'd rather keep the tool adapting process easy and executor implementation a bit harder.

>> I'd like to see the spec clarify what behavior is expected if the user writes a reference in the tool description document that can't be resolved.

Agreed, the spec doesn't discuss errors much, as you said. It might be best to just specify that these are errors and leave the error reporting and retry logic and such to executor implementations. I think all document errors should be fatal. Schema errors can be detected before execution and it's best not to "swallow" any adapter/expression related errors.

>> The "resources" field appears to be incomplete.  The description says "it has four fields" but only two are listed (cpu and mem).

Whoops, fixed. Used to be four, but we simplified it to two.

>> "Any object in the schema where the "type" keyword is meaningful may contain an "adapter" object." Same concern as for "input schema" -- is there any situation in which this keyword is not meaningful?

Sadly, yes. JSON-Schema uses a "properties" object to declare object properties (as opposed to array of objects). This is one of the reasons we will likely switch to Avro.

>> Should specify a collation order for files matched by the glob pattern (UTF-8?)

Great idea! Issue created.

>> "After the tool process finishes, if the directory contains a file called "result.cwl.json", no output adapters will be used and the content of that file must be the output record."  If this file is written to the designated output directory before the tool is even run, the tool's output will be ignored entirely.  Is that intentional?  I'm concerned that it could be a way to accidentally skip pipeline results if the workflow is not written carefully.

Hmm, the idea is that result.cwl.json would be created by the process. The workflow engine should make sure that it's not there before the process starts and no one but the tool process can create it. Depending on the implementation, it might be possible for a malicious tool to mess up the workflow by writing files in other tool's directories, but that can be prevented by only giving it write access to its own cwd.

In a workflow run, each tool instance would have it's own directory with write access. Perhaps this was the source of confusion? The spec doesn't mention workflows yet.

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.

To post to this group, send email to common-workf...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tim Pierce

unread,
Dec 29, 2014, 2:50:16 PM12/29/14
to Nebojsa Tijanic, common-workf...@googlegroups.com
Thanks for the response. Again, I apologize in advance if I'm going over old business here, and please feel free to direct me at any archived discussion of these issues.

On Wed, Dec 24, 2014 at 7:09 AM, Nebojsa Tijanic <nebojsa...@sbgenomics.com> wrote:
Hi, Tim. Thanks a lot for the comments and pull request!

My thoughts inline.

>> Is that correct?  The job order document is specified in the designated *output* directory?  That seems counterintuitive to me -- I would not expect the designated output directory to contain any files that are necessary before the job has even started.

Yes. The wording should perhaps be slightly different. The job order is placed inside the cwd, which is also the designated output directory.
Reason for this is the following: some tools may be too complex or weird to describe with adapters, so a "wrapper" script may be needed. If people are already writing the wrapper script, they shouldn't need to write additional javascript expressions in the tool description - the wrapper can read the inputs from job.cwl.json and write the outputs to result.cwl.json.

An alternative could be to write the job order elsewhere and set an environment variable to path of the job.cwl.json file. This would keep the working dir clear when the process is executed. Not sure if it's worth it.

It makes sense if this is simply a side effect of having the designated output directory default to cwd, but it's not a good idea to codify side effects as requirements of a specification. It doesn't seem to me that there's any intrinsic benefit of requiring input files to appear in the designated output directory, and some potentially harmful ones (e.g. it prevents a workflow from being able to guarantee that the designated output directory is empty before the tool runs).

Can we define a "designated input directory" for a tool, and allow that also to default to cwd? 
 

>> Suggestion for an additional restriction: "Tool output files will be written with ownership and permissions that permit other tools in the workflow to read those files as inputs.  The workflow infrastructure is responsible for enforcing this."

This is a great point, but it's more of a note than a restriction - when implementing the executor, one would of course have to handle this. Tools will always create files as the user that started the process. This could be a bit tricky with Docker, since we can only guarantee that the image has a root user. It would make implementation a lot easier if the spec said that docker images will run with UID 1001 or some such, but it will make image building harder. I'd rather keep the tool adapting process easy and executor implementation a bit harder.

I see: this section essentially defines a bunch of things that tools may not do (rely on non-POSIX calls, write to input files, read from hardware devices, etc.) as opposed to things that they simply aren't required to do. Fair enough.

 
>> I'd like to see the spec clarify what behavior is expected if the user writes a reference in the tool description document that can't be resolved.

Agreed, the spec doesn't discuss errors much, as you said. It might be best to just specify that these are errors and leave the error reporting and retry logic and such to executor implementations. I think all document errors should be fatal. Schema errors can be detected before execution and it's best not to "swallow" any adapter/expression related errors.

That makes sense to me too; I'd just like to make that explicit.
 
>> "After the tool process finishes, if the directory contains a file called "result.cwl.json", no output adapters will be used and the content of that file must be the output record."  If this file is written to the designated output directory before the tool is even run, the tool's output will be ignored entirely.  Is that intentional?  I'm concerned that it could be a way to accidentally skip pipeline results if the workflow is not written carefully.

Hmm, the idea is that result.cwl.json would be created by the process. The workflow engine should make sure that it's not there before the process starts and no one but the tool process can create it. Depending on the implementation, it might be possible for a malicious tool to mess up the workflow by writing files in other tool's directories, but that can be prevented by only giving it write access to its own cwd.

In a workflow run, each tool instance would have it's own directory with write access. Perhaps this was the source of confusion? The spec doesn't mention workflows yet.

If the workflow spec is expected to ensure that this file is not accidentally or maliciously created before the tool is run, then that seems fine. I'm just looking for places where defining the spec rigorously could ensure that a conforming implementation can't allow race conditions that could pollute a computation.

FWIW, this concern dovetails a little bit with my question about "designated output directory" above. If a tool could designate separate directories for input and output, then it (or the workflow infrastructure) can ensure that the designated output directory contains no files when the tool begins executing. Only if we assume that the designated output directory is also the input directory do we run into messy edge cases like this.

--t.
Reply all
Reply to author
Forward
0 new messages