--
You received this message because you are subscribed to the Google Groups "common-workflow-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-workflow-la...@googlegroups.com.
To post to this group, send email to common-workf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-workflow-language/1a7eb2fd-a564-4152-a8b9-09c5d6e2b2db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On 23 October 2014 18:02, Nebojsa Tijanic <nebojsa...@sbgenomics.com> wrote:
>
> Thanks a lot for the links and explanations. I've seen some taverna workflow
> XML files earlier and they seemed too tightly coupled with Java to suit our
> purpose (I suppose they were from older version).
Yes, that is particularly one reason why we moved away from the t2flow format - it relied on a particular serialization from Java that was hard to deal with outside the engine code when those classes are not available.
> The Scufl2 format seems a
> lot more similar to what we've been discussing here. I didn't understand a
> lot of things about it, hope you don't mind me asking a few questions:
Feel free! I am sorry that our scufl2 documentation is still lacking quite a bit..
We have only used Scufl2 for Taverna workflows, so it follows quite close to the Taverna execution semantics. For other systems it would be natural to reuse what can be equivalent (e.g. processors, ports, data links) and leave out what is not directly mappable, e.g. the dispatch layers.
We have made processors be configurable as well, so if there is no need to distinguish between a node and its execution, then profiles for workflow system X could skip activities and their binding to processors.
> - If I understand correctly, the "parallelize" layer would process
> individual items of a collection supplied on an input port if incoming
> collection depth is larger than declared port depth. It seems that
> iterationStrategyStack is meant to specify what happens if there is more
> than one port with depth > expected. Is this correct? If so, are there any
> strategies apart from combinations of Cartesian product and zipping?
That's right. If the port depths of inputs match those on the link, there would be no iterations and no parallellization within that processor - it is simply executed 'as is'. That is also the interpretation if there is no iteration strategy defined. (In Taverna 1 we would autogenerate a cross product if nothing was defined, but this gave unpredictable iterations with more than two inputs)
If the depth on the incoming link is less than the expected depth (e.g. a single value when expecting a list) it is simply wrapped in singleton list(s).
There are only those two strategies at the moment, what we call cross and dot product.
http://dev.mygrid.org.uk/wiki/display/tav250/List+handling
It is also possible, in SCUFL2, to combine multiple layers in the iteration strategy stack to do say an outer dot product of lists at depth 2, and a cross product at the inner lists of depth 1. This is quite complex to explain to the users, so we never added this to the user interface. We instead explain how to do this with a nested workflow - http://dev.mygrid.org.uk/wiki/display/tav250/List+handling#Listhandling-Usingnestedworkflowstotweaklisthandling
In our engine we also have a third option which we have never really used - PrefixDotProduct.
> Matches jobs where the index array of the job on index 0 is the prefix of the index array of the job on index 1. This node can only ever have exactly two child nodes!
As it was never used we did not add support for this to SCUFL2.
Are there other iteration mechanisms you are thinking of? E.g. "Just first value" or something?
> - What is the difference between portDepth and granularPortDepth?
This has to do with streaming activities. Most of Taverna activities have always the same values for these (in scufl2 you can therefore leave granularPortDepth==null to mean the same). However, there are some services that output a list, but which are able to give back those values one at a time. The Taverna Engine supports this - and forwards on those items to any service downstream that expects individual values. The original service do however still have to return the final list when it is finished iterating (if it ever finishes) - which the iteration strategy and downstream services expecting more than single values are waiting for.
This is for instance the case for a service that retrieves a large CSV file or does a large SQL query, and can spit out the individual column values, row by row, even before the whole file has been transferred back.
In this case, granularPortDepth=0 and portDepth=1.
When modifying granularPortDepth there are many port combinations that don't make sense, as you have to output on all the granular ports at the same time, e.g. if you say output port A,B and C are granularDepth 0 and depth 1, while port D is granular/depth 0 - then you have to return return all A/B/C for every granular return.
{ a1, b1, c1 } # granular
{ a2, b2, c2 } # granular
{ a3, b3, c3 } # granular
{ [a1,a2,a3] [b1,b2,b3] [c1,c2,c3] d } # final
> - Can port types be declared? If so, what are the available types and are
> structs supported?
Syntactic types we do s part of the activity configuration - as different activities have different ways to describe bytes/integers/etc.
We use the configuration key "dataType" per port defined in the config, e.g.
{ "outputTypes": [
{ "port": "inputA", "dataType": "R_EXP" }
{ "port": "inputB", "dataType": "PNG_FILE" }
] }
In this example, configuring an R script, there are some pre-determined constants which determines how the output values are picked up or delivered to R - e.g. PNG_File will save a graph as a PNG, while an R_EXP is serializing an R structure so that it can be passed to another R script.
Most of the time we don't need to define the syntactic types in the workflow definition, as the service implementation will know based on the rest of the configuration - e.g. if we are using a WSDL Service then the XSD will say that for portA we need a string, and for portB we need to base64-encode
a binary. A REST service will know based on the declared Content-Type if the input is binary or string - and it will know from the HTTP-returned Content-Type what we're getting back.
In t2flow we often included this inferred information at definition time, but we found that usually it would end up out of sync, wrong or confusing (many home-brewn mime-type that were never used by anything). With SCUFL2 we wanted to move to a more prescriptive workflow definition.
Semantic types we do as annotations as they do not affect the execution. We also see these as advisory as the service might change its mind over night and return something differently (or more like it - something even more specific).
We use http://purl.org/DP/components - see the extracted wfdesc (which includes all annotations):
> - In the paper, the example for loops was an async service. This seems like
> it saves some compute resources (a thread), but like it also could have been
> done by having the processor poll+block. Have you encountered any other
> cases where loops were needed?
Yes, it can also be implemented by the specific activity/processor, which several of our plugins do. It is however difficult to support this for generic services as no agreed system for poll/getResult is established.
The loop can also be used with user-driven loops, where you have an interactive step that asks for tweaked parameters or "is this result good enough". This can even be automated as you can configure loop to feedback output ports as new input ports. In this case your inner workflow must output a "loop": "false" value to stop the loop.
The loop condition checker can be calling something in the world, e.g. loop until the weather forecast is sunny or a sensor measurement is within range.
> Regarding the tool service, the model seems straightforward (by looking at
> the screenshots). The advanced/file_lists wiki page is empty, and that was
> one of the things I didn't get: how do input/output ports map to tools that
> work with lists of files (of arbitrary length)?
It gets trickier.. as Taverna lists are ordered. For input lists you can say to put it as a folder, and we'll make folder/0, folder/1 etc. and also produce an index file.
We have not yet got support for arbitrary list of output, but I guess you could do the same in inverse.
--
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718