embedding CWL into a Docker image

144 views
Skip to first unread message

Peter Amstutz

unread,
Nov 23, 2015, 10:38:02 AM11/23/15
to common-workf...@googlegroups.com
Hello everyone,

I have been thinking for a while it would be nice to be able to bundle
a CWL tool description along with a Docker image, so that the two
artifacts don't need to be distributed separately. Doing some
research, it Docker offers a LABEL directive in a Dockerfile for
setting simple key-value metadata. I can think of at least 3
approaches to using this, so I wanted to see if anyone else had any
thoughts or experience:

1) Add a label with the full text of the CWL document. This is the
simplest approach. The drawbacks are that there does not appear to a
way to set a label using the text of a file, so it requires a small
script to generate a Dockerfile on the fly with the LABEL directive
and the quoted text. It also may be awkward for viewing since labels
are assumed to be small snippets of text and not multi-KB documents.

2) Add a label that points to a file inside the image. Then we need
to get the file out of the image. Unfortunately, "docker cp"
apparently only works with running containers, so I'm not sure how
this would work.

3) Split up the top level of the CWL document into a bunch of separate
labels. So there would be a cwl.inputs label, a cwl.outputs label,
etc. This also requires generating the Dockerfile on the fly, but the
result is a bit more parsable by humans as well as potentially a
little bit more workable with the limited metadata filtering offered
by Docker. The downside of this approach is that you can't describe
more than one tool inside the Docker image at a time.

Thoughts?

Thanks,
Peter

Stian Soiland-Reyes

unread,
Nov 24, 2015, 8:29:00 AM11/24/15
to common-workflow-language

I agree that this is a very nice proposition.

I think I prefer option #2, as it's easier to get started with.

Instead of docker cp you can use "docker cat" which should generally
work (except where there is a very minimal installation or a
not-so-nice-behaving ENTRYPOINT)

With this option we can also have a default file-path, e.g.
/cwl/tool-description.json - which is easy to add even when
not using a Dockerfile.


#3 sounds tricky with regards to inputs and outputs - but you could do
cwl.inputs.1 or something silly. It has a lower barrier of entry, but requires
additional tooling to convert to a regular tool description.

#1 sounds tricky as the tool description doesn't fit well on a single
line, potential JSON/Yaml escaping issues from a Dockerfile or bash, etc. So
although this is easiest to consume for a CWL engine, it is also the hardest to
produce for a Docker container maker.


I think we want to encourage the Docker makers so they have a low barrier of
entry, IMHO both #2 and #3 have this. #3 has slightly lower barrier of entry,
but may give confusion if they try to read the CWL specs which then
won't match what they see.

Excerpts from Peter Amstutz's message of 2015-11-23 15:38:01 +0000:
--
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

Michael Crusoe

unread,
Nov 24, 2015, 9:55:10 AM11/24/15
to Stian Soiland-Reyes, common-workflow-language
On Tue, Nov 24, 2015 at 3:29 PM Stian Soiland-Reyes <soilan...@cs.manchester.ac.uk> wrote:

I agree that this is a very nice proposition.

I think I prefer option #2, as it's easier to get started with.

Instead of docker cp you can use "docker cat" which should generally
work (except where there is a very minimal installation or a
not-so-nice-behaving ENTRYPOINT)

With this option we can also have a default file-path, e.g.
/cwl/tool-description.json - which is easy to add even when
not using a Dockerfile.

As for default file path, we should use a Filesystem Hierarchy Standard compliant one. The Debian Med team approved the following locations:
`/usr/share/cwl/${binary-name}`
`$XDG_DATA_HOME/cwl/${binary-name}.cwl`
(which usually be `$HOME/.local/share/cwl/${binary-name}.cwl` )

Nebojsa Tijanic

unread,
Nov 24, 2015, 1:38:54 PM11/24/15
to Michael Crusoe, Stian Soiland-Reyes, common-workflow-language
Option 2 makes it much harder to see what the image offers in terms of cwl (would need to download image and run a container or manually look at layers). Does option 2 require the image to also have a cwl runner? If not, it means people would first need to extract the cwl files outside the container and then use them with the docker image inserted as runtime requirement.

What about option #4: add a label that has a URL as value which points where cwl descriptions can be found?

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.
To post to this group, send email to common-workf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/CAD%3DWrcJh6cctgxXfTucMMkBw5wLM%3DLcnWmmY30YkNZ1q_TEPoA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Peter Amstutz

unread,
Nov 24, 2015, 2:16:24 PM11/24/15
to Nebojsa Tijanic, Michael Crusoe, Stian Soiland-Reyes, common-workflow-language
The idea is that an external CWL runner can inspect the image and get
the tool file out. So for example instead of "run: footool.cwl" you
could say something like "run: docker:fooimage". The image wouldn't
need to have a cwl runner in it, although that's another direction we
could go.

It took me a minute to realize for (2) by "docker cat" Stian means
"docker run <image> cat". This has the drawback of having to actually
download and run the image. The advantage is that it's just a COPY
and then a LABEL in the Dockerfile.

I think option (4) URL to an external resource kind of defeats the
purpose of having it packed in with the image. The external resource
could be inaccessible or change unexpectedly.

I'm divided, since (1) putting the CWL document into the contents of a
label lets you use "docker inspect" to get it out easily without
executing anything, but generating the label to put on the image
requires a little bit of help to json-escape the document text as a
string.

Any other ideas?

Thanks,
Peter
> https://groups.google.com/d/msgid/common-workflow-language/CACnO1SEU5vPOrEVqCviCVyZdgHdCqYGEn1aR%3DLAK_FBxJLckqw%40mail.gmail.com.

Nebojsa Tijanic

unread,
Nov 24, 2015, 2:35:25 PM11/24/15
to Peter Amstutz, Michael Crusoe, Stian Soiland-Reyes, common-workflow-language

On Tue, Nov 24, 2015 at 8:16 PM, Peter Amstutz <peter....@curoverse.com> wrote:
I think option (4) URL to an external resource kind of defeats the
purpose of having it packed in with the image.  The external resource
could be inaccessible or change unexpectedly.

The external cwl resource(s) changing can be a feature of sorts - it allows for description updates without the heavy work of modifying the image.
Pointing to e.g. a specific git commit can keep them immutable, though they can still get inaccessible. Also, #4 is not mutually exclusive with #1-3.

I'd prefer #2 for storing the files since it allows for {import:} statements and keeping the yamls readable. It's not much difference between docker inspect and docker run, if the image needs to be downloaded anyway.

Peter Amstutz

unread,
Nov 30, 2015, 10:05:26 AM11/30/15
to Nebojsa Tijanic, Michael Crusoe, Stian Soiland-Reyes, common-workflow-language
Here's what I came up with (but I think we should defer finalizing
this for the next draft.)

Dockerfile:

COPY example.cwl /usr/share/cwl/bedtools/
LABEL org.w3id.cwl.tool /usr/share/cwl/example/example.cwl

Use a URL scheme like this to refer to files inside docker containers:

dockertool:example/container:tag?/usr/share/cwl/tool.cwl

Then using a tweaked "url join" function and url-aware file access
everything else involved in reading and executing the CWL file just
works.

Alternately, we could scan containers for the contents of
/usr/share/cwl, and the label might not even be necessary.

- Peter

Stian Soiland-Reyes

unread,
Nov 30, 2015, 11:14:22 AM11/30/15
to Peter Amstutz, Nebojsa Tijanic, Michael Crusoe, common-workflow-language
This sounds like a sensible approach. Keep the description close to
the binary, keeping it online defines the whole idea - this is
basically what we already can do.


As to requiring download before accessing CWL metadata, I think the
only metadata that is useful before downloading would be label and
description.

Those could be (possibly duplicated or outdated) be provided as Docker
labels - the difference being that these would be labels and
descriptions for the docker image itself (which could have multiple
tool descriptions). I don't think this is CWL specific, so perhaps
just reuse (kind of) dcterms:title and dcterms:description?

LABEL dcterms.title="The Example Tool"
LABEL dcterms.description="Have fun using the Example Tool for any
example usage"

(or.. if we go by the reverse domain notation strictly)


LABEL org.purl.dc.terms.title="The Example Tool"
LABEL org.purl.dc.terms.description="Have fun using the Example Tool
for any example usage"

Using DC Terms means you can also add dcterms.license etc.
> --
> You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.
> To post to this group, send email to common-workf...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/CAEXjzRvZNHEHrOBA_vVqTP2HQgBciqo0YZLb_sMBjzswUNrUiQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



--

Nebojsa Tijanic

unread,
Dec 2, 2015, 11:25:14 AM12/2/15
to Stian Soiland-Reyes, Peter Amstutz, Michael Crusoe, common-workflow-language
Alternately, we could scan containers for the contents of /usr/share/cwl, and the label might not even be necessary.

How about a combination - put the directory containing cwl files in a docker label (and suggest people use a standard location like /usr/share). The use of the label signifies there's cwl inside, and other metadata can refer to common properties of all cwl descriptions.

Use a URL scheme like this to refer to files inside docker containers:
> dockertool:example/container:tag?/usr/share/cwl/tool.cwl

Wish there was a way to make better URLs. The image repo+tag need to be part of the path (and we should allow for non-dockerhub images) but that makes it harder for internal path to work with regular urljoin. Perhaps if we require the tag to be present, we can use it to include the inside-docker path part of the path section of url (and thus work with joins):

dockertool:example/container:tag/usr/share/cwl/tool.cwl for dockerhub
dockertool://docker.example.org/example/container:tag/usr/share/cwl/tool.cwl for other registries

I'm probably overthinking this and putting the in-docker path to query segment of URL is fine.

Peter Amstutz

unread,
Dec 2, 2015, 5:52:19 PM12/2/15
to Nebojsa Tijanic, Stian Soiland-Reyes, Michael Crusoe, common-workflow-language

Storing a directory path would work, although one benefit of referencing exact files is that one could start a container with "sleep 3600" and then use "docker cp" to pull files out (Stian's "docker run cat" trick works too) without additional fiddling with "ls".

The url thing is annoying because "foo", "foo/bar", "foo/bar:baz", "docker.io/foo/bar:baz" and the hexadecimal image id are all valid ways to refer to an image, because there can be 0, 1, or 2 slashes and the colon is optional there isn't really a way to parse it reliably except by either only accepting the least ambiguous form (e.g. dockertool://docker.io/foo/bar:baz/usr/share/tool.cwl which is still a little awkward) or separating the path out into the query field.

So yes, I already over thought it and that's the conclusion I came to.

- Peter

Nebojsa Tijanic

unread,
Dec 3, 2015, 8:04:33 AM12/3/15
to Peter Amstutz, Stian Soiland-Reyes, Michael Crusoe, common-workflow-language
On Wed, Dec 2, 2015 at 11:52 PM, Peter Amstutz <peter....@curoverse.com> wrote:

So yes, I already over thought it and that's the conclusion I came to.

Haha. I'm fine with query though, was just going through the same thought process apparently :) 

Peter Amstutz

unread,
Dec 3, 2015, 9:05:10 AM12/3/15
to Nebojsa Tijanic, Stian Soiland-Reyes, Michael Crusoe, common-workflow-language
I forgot to mention, it also turns out that the Python urlparse module
hard codes various URL schemes and applies different behavior
depending on the scheme, so if you give it a scheme it doesn't
recognize (such as dockertool:...) you don't get the same path joining
behavior as if it were a http:// URL. So in Python at least there's
no way around having a custom path join function.

- Peter
Reply all
Reply to author
Forward
0 new messages