Taverna, SCUFL2 and wfdesc

93 views
Skip to first unread message

Stian Soiland-Reyes

unread,
Oct 21, 2014, 11:42:22 AM10/21/14
to common-workf...@googlegroups.com
Hi,

I am one of the developers of the Taverna workflow system - http://taverna.org.uk/

(btw - we have just recently been accepted as an Apache Incubator project - http://incubator.apache.org/projects/taverna.html )



In Taverna 2, we had our workflow definition language called t2flow. It is fairly one-to-one mapping to internal Java objects in Taverna, and people found it hard to develop against it (although several did anyway, e.g. a web-based editor) - see http://ns.taverna.org.uk/2008/xml/t2flow/ for the gory details :) .



For Taverna 3 we therefore made the Scufl2 workflow format - http://dev.mygrid.org.uk/wiki/display/developer/SCUFL2 - and a corresponding Java API - http://github.com/taverna/taverna-scufl2. SCUFL2 is not meant as a generic workflow language, but as a way to generalize the Taverna workflow model. See our "Taverna, reloaded" paper http://www.taverna.org.uk/pages/wp-content/uploads/2010/04/T2Architecture.pdf for details of the Taverna execution model.

In short, a Scufl2 Workflow Bundle is a structured ZIP file with a series of XML files, which follow XSD schemas (but are also valid RDF/XML). The schemas are currently at https://github.com/taverna/taverna-scufl2/tree/master/scufl2-rdfxml/src/main/resources/uk/org/taverna/scufl2/rdfxml/xsd

Within the bundle, configuration of each step (e.g. a command line tool invocation) is described as a JSON file - their structure vary per activity type, and Taverna has quite a few activity types from plugins - http://dev.mygrid.org.uk/wiki/display/tav250/Service+types


The JSON for the Command Line Tool activity, which I guess is most relevant to you, is unfortunately in a bit in flux - we feel a need for a cleaner separation between the Tool Definition (command line, parameters, input and output files, etc) and the invocation details (e.g. where to SSH or how to create symbolic links).


In Taverna's Workbench application we have a UI for creating such configuratoin on an ad-hoc basis with a kind of template-based shell script. I mentioned this in the call. See http://dev.mygrid.org.uk/wiki/display/tav250/Command and sibling pages.

It is possible to load a set of these descriptions from an XML file, to browse possible command line tools. Obviously this also requires these tools to be installed, so this has been used mainly within a grid execution infrastructure like Nordugrid/KnowARC. The most used XML is http://taverna.nordugrid.org/sharedRepository/xml.php which is generated from http://taverna.nordugrid.org/sharedRepository/index.php


--
Stian Soiland-Reyes
http://orcid.org/0000-0001-9842-9718

Nebojsa Tijanic

unread,
Oct 23, 2014, 1:02:59 PM10/23/14
to Stian Soiland-Reyes, common-workf...@googlegroups.com
Hi, Stian.

Thanks a lot for the links and explanations. I've seen some taverna workflow XML files earlier and they seemed too tightly coupled with Java to suit our purpose (I suppose they were from older version). The Scufl2 format seems a lot more similar to what we've been discussing here. I didn't understand a lot of things about it, hope you don't mind me asking a few questions:

 - If I understand correctly, the "parallelize" layer would process individual items of a collection supplied on an input port if incoming collection depth is larger than declared port depth. It seems that iterationStrategyStack is meant to specify what happens if there is more than one port with depth > expected. Is this correct? If so, are there any strategies apart from combinations of Cartesian product and zipping?

 - What is the difference between portDepth and granularPortDepth?

 - Can port types be declared? If so, what are the available types and are structs supported?

 - In the paper, the example for loops was an async service. This seems like it saves some compute resources (a thread), but like it also could have been done by having the processor poll+block. Have you encountered any other cases where loops were needed?

Regarding the tool service, the model seems straightforward (by looking at the screenshots). The advanced/file_lists wiki page is empty, and that was one of the things I didn't get: how do input/output ports map to tools that work with lists of files (of arbitrary length)?

Thanks,

--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.
To post to this group, send email to common-workf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/1a7eb2fd-a564-4152-a8b9-09c5d6e2b2db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Nebojsa Tijanic
Seven Bridges Genomics, Inc.

Stian Soiland-Reyes

unread,
Oct 24, 2014, 12:08:21 PM10/24/14
to Nebojsa Tijanic, common-workf...@googlegroups.com, List for general discussion and hacking of the Taverna project

On 23 October 2014 18:02, Nebojsa Tijanic <nebojsa...@sbgenomics.com> wrote:
>
> Thanks a lot for the links and explanations. I've seen some taverna workflow
> XML files earlier and they seemed too tightly coupled with Java to suit our
> purpose (I suppose they were from older version).

Yes, that is particularly one reason why we moved away from the t2flow format - it relied on a particular serialization from Java that was hard to deal with outside the engine code when those classes are not available.

> The Scufl2 format seems a
> lot more similar to what we've been discussing here. I didn't understand a
> lot of things about it, hope you don't mind me asking a few questions:

Feel free! I am sorry that our scufl2 documentation is still lacking quite a bit..

We have only used Scufl2 for Taverna workflows, so it follows quite close to the Taverna execution semantics. For other systems it would be natural to reuse what can be equivalent (e.g. processors, ports, data links) and leave out what is not directly mappable, e.g. the dispatch layers.

We have made processors be configurable as well, so if there is no need to distinguish between a node and its execution, then profiles for workflow system X could skip activities and their binding to processors.

> - If I understand correctly, the "parallelize" layer would process
> individual items of a collection supplied on an input port if incoming
> collection depth is larger than declared port depth. It seems that
> iterationStrategyStack is meant to specify what happens if there is more
> than one port with depth > expected. Is this correct? If so, are there any
> strategies apart from combinations of Cartesian product and zipping?

That's right. If the port depths of inputs match those on the link, there would be no iterations and no parallellization within that processor - it is simply executed 'as is'. That is also the interpretation if there is no iteration strategy defined. (In Taverna 1 we would autogenerate a cross product if nothing was defined, but this gave unpredictable iterations with more than two inputs)

If the depth on the incoming link is less than the expected depth (e.g. a single value when expecting a list) it is simply wrapped in singleton list(s).

There are only those two strategies at the moment, what we call cross and dot product.

http://dev.mygrid.org.uk/wiki/display/tav250/List+handling

It is also possible, in SCUFL2, to combine multiple layers in the iteration strategy stack to do say an outer dot product of lists at depth 2, and a cross product at the inner lists of depth 1. This is quite complex to explain to the users, so we never added this to the user interface. We instead explain how to do this with a nested workflow - http://dev.mygrid.org.uk/wiki/display/tav250/List+handling#Listhandling-Usingnestedworkflowstotweaklisthandling

In our engine we also have a third option which we have never really used - PrefixDotProduct.

> Matches jobs where the index array of the job on index 0 is the prefix of the index array of the job on index 1. This node can only ever have exactly two child nodes!

As it was never used we did not add support for this to SCUFL2.

Are there other iteration mechanisms you are thinking of? E.g. "Just first value" or something?

> - What is the difference between portDepth and granularPortDepth?

This has to do with streaming activities. Most of Taverna activities have always the same values for these (in scufl2 you can therefore leave granularPortDepth==null to mean the same). However, there are some services that output a list, but which are able to give back those values one at a time. The Taverna Engine supports this - and forwards on those items to any service downstream that expects individual values. The original service do however still have to return the final list when it is finished iterating (if it ever finishes) - which the iteration strategy and downstream services expecting more than single values are waiting for.

This is for instance the case for a service that retrieves a large CSV file or does a large SQL query, and can spit out the individual column values, row by row, even before the whole file has been transferred back.

In this case, granularPortDepth=0 and portDepth=1.

When modifying granularPortDepth there are many port combinations that don't make sense, as you have to output on all the granular ports at the same time, e.g. if you say output port A,B and C are granularDepth 0 and depth 1, while port D is granular/depth 0 - then you have to return return all A/B/C for every granular return.

{ a1, b1, c1 } # granular
{ a2, b2, c2 } # granular
{ a3, b3, c3 } # granular
{ [a1,a2,a3] [b1,b2,b3] [c1,c2,c3] d } # final

> - Can port types be declared? If so, what are the available types and are
> structs supported?

Syntactic types we do s part of the activity configuration - as different activities have different ways to describe bytes/integers/etc.

We use the configuration key "dataType" per port defined in the config, e.g.

{ "outputTypes": [
{ "port": "inputA", "dataType": "R_EXP" }
{ "port": "inputB", "dataType": "PNG_FILE" }
] }

In this example, configuring an R script, there are some pre-determined constants which determines how the output values are picked up or delivered to R - e.g. PNG_File will save a graph as a PNG, while an R_EXP is serializing an R structure so that it can be passed to another R script.

Most of the time we don't need to define the syntactic types in the workflow definition, as the service implementation will know based on the rest of the configuration - e.g. if we are using a WSDL Service then the XSD will say that for portA we need a string, and for portB we need to base64-encode
a binary. A REST service will know based on the declared Content-Type if the input is binary or string - and it will know from the HTTP-returned Content-Type what we're getting back.

In t2flow we often included this inferred information at definition time, but we found that usually it would end up out of sync, wrong or confusing (many home-brewn mime-type that were never used by anything). With SCUFL2 we wanted to move to a more prescriptive workflow definition.

Semantic types we do as annotations as they do not affect the execution. We also see these as advisory as the service might change its mind over night and return something differently (or more like it - something even more specific).
We use http://purl.org/DP/components - see the extracted wfdesc (which includes all annotations):

https://github.com/wf4ever/scufl2-wfdesc/blob/master/src/test/resources/valid_component_imagemagickconvert.wfdesc.ttl

> - In the paper, the example for loops was an async service. This seems like
> it saves some compute resources (a thread), but like it also could have been
> done by having the processor poll+block. Have you encountered any other
> cases where loops were needed?

Yes, it can also be implemented by the specific activity/processor, which several of our plugins do. It is however difficult to support this for generic services as no agreed system for poll/getResult is established.

The loop can also be used with user-driven loops, where you have an interactive step that asks for tweaked parameters or "is this result good enough". This can even be automated as you can configure loop to feedback output ports as new input ports. In this case your inner workflow must output a "loop": "false" value to stop the loop.

The loop condition checker can be calling something in the world, e.g. loop until the weather forecast is sunny or a sensor measurement is within range.

> Regarding the tool service, the model seems straightforward (by looking at
> the screenshots). The advanced/file_lists wiki page is empty, and that was
> one of the things I didn't get: how do input/output ports map to tools that
> work with lists of files (of arbitrary length)?

It gets trickier.. as Taverna lists are ordered. For input lists you can say to put it as a folder, and we'll make folder/0, folder/1 etc. and also produce an index file.

We have not yet got support for arbitrary list of output, but I guess you could do the same in inverse.

--
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

Nebojsa Tijanic

unread,
Oct 30, 2014, 7:21:59 AM10/30/14
to Stian Soiland-Reyes, common-workf...@googlegroups.com, List for general discussion and hacking of the Taverna project
Hi, Stian.

Thank you for the in-depth explanation. The workflow model seems really good and I hope that we will be directly compatible with it (different encoding, but hopefully basically the same thing).

In all Taverna examples I've seen so far, individual records (or lists of records) flow through the data links. Is there a way to pass files between components (without streaming record-by-record)? If so, can additional metadata about a file be included as well? I've read something about data being passed as URLs, but can't remember where and it wasn't in depth. Happy to just look at the docs if you can point me to it.

If so, it feels like our current draft on tool descriptions can in many cases be mapped directly to Taverna (by including a custom activity/processor, I think). The workflow execution may not be optimal, but it feels like it would work.

Stian Soiland-Reyes

unread,
Nov 5, 2014, 7:13:30 AM11/5/14
to Nebojsa Tijanic, common-workf...@googlegroups.com, List for general discussion and hacking of the Taverna project
The elements passed can be of different types. Internally in the
engine we always have an indirection of a "reference set", which can
point to either a file, a URL, an "inline string" or "inline byte" -
the last being just in-memory representations.

The workflow definition is however oblique to what those files
are/would be/will be - but that is something we added support for in
the abstract workflow definitions in wfdesc
http://purl.org/wf4ever/wfdesc
where basically you can declare the wfdesc:Artifact as a fixed filename.

We do not in Taverna have a good way yet to attach metadata to the
data items that pass along. We want to add a way to add arbitrary
provenance during a workflow run from a service - perhaps also
allowing any kind of annotations along the way would be the way
forward (our Data Bundle output already has the mechanism for
capturing that - https://w3id.org/bundle/#manifest-annotations

This would for instance be a way to attach "image/svg+xml" as
dc:format on a binary blob, so that it can get a nice extension/view
coming out of the workflow.




On 30 October 2014 11:21, Nebojsa Tijanic
> https://groups.google.com/d/msgid/common-workflow-language/CACnO1SEPnBKFTSr9QksoX1Q8t%3DCJcuU1kOttNYwMsEcXZyTwYQ%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.



--

Nebojsa Tijanic

unread,
Nov 6, 2014, 10:41:34 AM11/6/14
to Stian Soiland-Reyes, common-workf...@googlegroups.com, List for general discussion and hacking of the Taverna project
Hi, Stian.

This is great! wfprov basically solves a problem we wanted to tackle later (run descriptions) and wfdesc is general enough that we're definitely compatible with it.

Annotations seem great for end results of the workflow, but don't seem needed for intermediary values if wfdesc:Artifact can be an arbitrary JSON-compatible structure (which is what we plan to use).

I've two more questions regarding the "parallelize" layer:
 - Suppose a processor takes a list of values and outputs a single value. If we supply it a matrix on the input port, will the output automatically be a list?
 - Since nested workflows are treated as processors, does that mean the input ports of the workflow will act as wait/join spots for data flow? For example, if a workflow has a processor with input/output ports of depth 0 and gets a list from an upstream processor, it will not wait for each item of the list to be ready; if the same processor is nested in a workflow, will it wait for the list to be ready? If so, was this designed to solve a specific problem?

Alan R Williams

unread,
Nov 6, 2014, 11:04:10 AM11/6/14
to common-workf...@googlegroups.com
On 06/11/2014 15:41, Nebojsa Tijanic wrote:
> Hi, Stian.
>
> This is great! wfprov basically solves a problem we wanted to tackle
> later (run descriptions) and wfdesc is general enough that we're
> definitely compatible with it.
>
> Annotations seem great for end results of the workflow, but don't seem
> needed for intermediary values if wfdesc:Artifact can be an arbitrary
> JSON-compatible structure (which is what we plan to use).

There has been discussion in the Taverna community of some annotations
being generated by services. For example to associate metadata with
data. In that case, being able to annotate intermediate values will be
useful.

> I've two more questions regarding the "parallelize" layer:
> - Suppose a processor takes a list of values and outputs a single
> value. If we supply it a matrix on the input port, will the output
> automatically be a list?

Yes.

> - Since nested workflows are treated as processors, does that mean the
> input ports of the workflow will act as wait/join spots for data flow?
> For example, if a workflow has a processor with input/output ports of
> depth 0 and gets a list from an upstream processor, it will not wait for
> each item of the list to be ready; if the same processor is nested in a
> workflow, will it wait for the list to be ready?

It depends if you specified that the input to the nested workflow was a
list or a single value. If you specify that it is a list then there is a
single invocation of the nested workflow and that waits until it has a
list and then the processor inside the nested workflow will iterate over
the list. If you specify that the input to the nested workflow is a
single value, then you will have iterations of the nested workflow
itself and inside each nested workflow invocation, there would be one
processor invocation.

> If so, was this
> designed to solve a specific problem?

Alan

Stian Soiland-Reyes

unread,
Nov 11, 2014, 8:58:30 PM11/11/14
to common-workf...@googlegroups.com
On 6 November 2014 16:05, Alan R Williams
<alan.r....@manchester.ac.uk> wrote:

>> If so, was this
>> designed to solve a specific problem?

The pipelining (executing as soon as data is available) was added as a
significant speed-up for workflows where basically every other step is
iterating over a data-load+split step at the very stop. Added in
Taverna 2, this was a natural progression from the existing "Run as
soon as all inputs are ready" semantics. We just changed the "run"
from a single processor run over the accumulated inputs to individual
iteration runs over indexed inputs (producing indexed outputs).

The fact that Taverna's nested workflows do not stream through can be
seen either as a bug or a feature. In a way is part of their promise
of being a process, that they will consume the data at the port depths
they declare - if data just flowed in and out independently just using
the input/output ports as well.. ports, in a way the border of the
nested workflow is being washed away - it is no longer meaningful to
talk about particular runs of that nested workflow.


We also have control links to cause a processor to not start until the
controlling processor has completed all its iterations - synchronizing
with a nested workflow is however more powerful because you can
basically 'block' at intermediate list depths, also its inner inputs
will be traversed in list order rather than in order of results.
Reply all
Reply to author
Forward
0 new messages