Sumatra and workflow management systems

66 views
Skip to first unread message

Denis

unread,
Sep 1, 2015, 6:50:56 PM9/1/15
to sumatra-users
Dear Sumatra users,

I am implementing tools for reproducible research in our lab, and I am learning about capabilities of Sumatra. The question I have: did anyone try to use it in a combination with workflow management systems?

Currently, I am working on a use case that I describe below, and I would highly appreciate if you could help me to understand how could I use Sumatra to handle it. Let's assume I have a task consisting of two steps:

Step1. given an image img_0001.tif and a file with input parameters track_assumptions_3.param, run some tracking algorithm tracker.py that outputs a file tracking_results.txt.

Step2. Run some statistical analysis stat.R, given as input tracking_results.txt and some parameters stat_assumptions_8.param. The result is stat_results.txt

Further assume, there are two students. The first student tests different tracking algorithms with different parameters and uses Sumatra to keep track of all his steps (SumatraProject1). He stores results in a folder which is available for the second student who performs Step2. The second student tests different statistical analysis scripts and also uses Sumatra (SumatraProject2).

When we have some results of statistical analysis (e.g. stat_results.txt), we would like to know what inputs did we use for all processing steps (e.g. img_0001.tif, track_assumptions_3.param, tracking_results.txtstat_assumptions_8.param).  Of course, I could write some custom functions to recursively collect all inputs from SumatraProject1 and SumatraProject2. However, I am wondering if there are any existing solutions to this issue. Ideally, it would be great to build some dependency graph of different steps of data processing, and that is a functionality of workflow management systems... One year ago in this thread I found it is mentioned: 

"However, workflows _can_ be supported implicitly (identifying sequences/graphs of computations in which the output of one computation is the input to another) - we currently have a Google Summer of Code student working on tools to support this. "

Any updates on that?

Best regards,
Denis





Andrew Davison

unread,
Sep 2, 2015, 4:18:45 AM9/2/15
to sumatr...@googlegroups.com
Dear Denis,

> On 02 Sep 2015, at 00:50, Denis <denis.s...@gmail.com> wrote:
>
> Dear Sumatra users,
>
> I am implementing tools for reproducible research in our lab, and I am learning about capabilities of Sumatra. The question I have: did anyone try to use it in a combination with workflow management systems?
>
> Currently, I am working on a use case that I describe below, and I would highly appreciate if you could help me to understand how could I use Sumatra to handle it. Let's assume I have a task consisting of two steps:
>
> Step1. given an image img_0001.tif and a file with input parameters track_assumptions_3.param, run some tracking algorithm tracker.py that outputs a file tracking_results.txt.
>
> Step2. Run some statistical analysis stat.R, given as input tracking_results.txt and some parameters stat_assumptions_8.param. The result is stat_results.txt
>
> Further assume, there are two students. The first student tests different tracking algorithms with different parameters and uses Sumatra to keep track of all his steps (SumatraProject1). He stores results in a folder which is available for the second student who performs Step2. The second student tests different statistical analysis scripts and also uses Sumatra (SumatraProject2).

To handle this in Sumatra you would need to have a single Project. Each student could maintain their own local configuration (e.g. student 1 would have Python as the default executable, student 2 would have R), but the project name and record store would have to be the same for both students. Tags could be used to filter the records so each student could display only the records they are working on themselves.

It would be possible in principle to allow records to belong to multiple projects, but this has not been implemented.

> When we have some results of statistical analysis (e.g. stat_results.txt), we would like to know what inputs did we use for all processing steps (e.g. img_0001.tif, track_assumptions_3.param, tracking_results.txt, stat_assumptions_8.param). Of course, I could write some custom functions to recursively collect all inputs from SumatraProject1 and SumatraProject2. However, I am wondering if there are any existing solutions to this issue. Ideally, it would be great to build some dependency graph of different steps of data processing, and that is a functionality of workflow management systems... One year ago in this thread I found it is mentioned:
>
> "However, workflows _can_ be supported implicitly (identifying sequences/graphs of computations in which the output of one computation is the input to another) - we currently have a Google Summer of Code student working on tools to support this. "

The (local) web interface now contains links between records and data files, so from the page for stat_results.txt, you can follow the hyperlinks back through all the intermediate computations to the original data. It should be straightforward to add a view which shows the entire dependency graph on a single web page. Can I suggest you create a new issue in the issue tracker (https://github.com/open-research/sumatra/issues) explaining precisely what you would like to see?

Also, since I have limited time, if anyone on the list is interested in working on this, please chime in!

Cheers,

Andrew

Reply all
Reply to author
Forward
0 new messages