Execution of a workflow with iterables set at the start

Augustine Koh

unread,

Apr 12, 2014, 9:20:14 PM4/12/14

to nipy...@googlegroups.com

Dear all,

I have some questions pertaining to a recent successful creation of a workflow (albeit a very short one) that consists of 2 registration nodes. As it could work, soon I will be building a bigger pipeline that would hopefully be able to replace the manual processing we do in our group, so I'm just taking it step by step to be sure that my concepts and understanding is correct before I move on to the next level.

The DAG of the workflow is attached here in this post, and as you can see, it contains an infosource IdentityInterface node at the beginning to ensure that the portion from datasource to datasink get repeated for all the other subjects that I have. I have 16 mouse subjects, and when I executed the whole workflow, it worked fine and all the data turned out as expected, but I have some questions relating to the performance and the behavior of the execution with inforsource that I hope to ask here:

1. When I executed the workflow, I could see from the messages in terminal that the workflow was executed for each mouse one at a time (i.e. the registration workflow from datasource to datasink was run for all 16 mice one at a time, the program would wait for the previous mouse's run to complete before it executed the workflow for the next mouse). I was wondering if it would be possible to execute the entire thing parallel for all 16 mice simultaneously. I've considered using Mapnode to wrap up the infosource node, but I'm concerned that that would be using map node wrongly since ultimately it will reduce the branches into 1 which is not exactly what I'm trying to do here. Also, I'm running the workflow on my single Macbook Pro so its a small machine and from what I've read about the grid engines, most of them are only meant for large computing clusters, so it seems like grid engines can't help here. Any ideas or suggestions to enable simultaneity would be greatly appreciated!

2. This question is more of a conceptual one. I noticed that the workflow executed for all 16 mice in a random manner that did not follow the mice subject numbers. For example, the workflow would execute first on mice 3, then go on to mice 9 then mice 6 and all the way until it the last mice. Is this normal behavior, and is there any reason why it behaves as such? Because in the source code for my workflow, I listed the mice numbers in the infosource iterables in a nice sequence [1,2,3...,16] so I always thought that the workflow, since it ran sequentially, would follow this sequence too.

Thank you very much!

Best regards,

Augustine

graph.dot.png

Michael Waskom

unread,

Apr 13, 2014, 10:10:05 PM4/13/14

to nipy...@googlegroups.com

Hi Augustine,

For parallel execution in a local environment you can use the Multiprocessing plugin. You set the plugin when you call Workflow.run(), which is what determines whether your graph will execute in series or in parallel. Once you do that, each iterable value should run in parallel.

Best,

Michael

Augustine Koh

unread,

Apr 13, 2014, 11:18:40 PM4/13/14

to nipy...@googlegroups.com

Dear Michael,

Thank you for your advice and directing me to the nipype plugins documentation page! I think must have seen the multiprocessing plugin before, but didn't understand that local system would refer to small systems such as my Macbook. Now I get the drift. And just to be sure, if we never specify any sort of plugin under the run() method of the workflow, then is the workflow always going to be using the linear plugin? Because I noticed when we never specify any sort of distribution plugin, the workflows always execute serially.

I've just tried using the Multiprocessing plugin, and the performance gains are great even for such a small workflow, reducing the time from 7 mins to 2 mins! So far, I've been able to keep track of the duration of my workflows by reading the information on the Terminal window (every time before a specific node is executed, the date and time will be printed) by subtracting the times from the start and the end. However, I'm just wondering if there's any way to get the workflow to automatically print its duration of execution at the end, just like how the time gets printed at the end when we run Nipype installation tests?

Best regards,

Augustine

Michael Waskom

unread,

Apr 15, 2014, 11:33:43 AM4/15/14

to nipy...@googlegroups.com

Not anything built into nipype that I'm aware of, but you could easily add that to your script by using the time module.

Satrajit Ghosh

unread,

Apr 15, 2014, 4:54:05 PM4/15/14

to nipy-user

hi augustine,

nipype stores provenance info for each node in a file called, provenance.provn and provenance.ttl also it generates a workflow provenance file. each node contains a timestamp for start, end and duration. these can help you get at the numbers you are looking for.

also, regarding sorting, nipype uses a depth first topological sort by default for any of the batch modes. this would create a slight bias towards nodes from a single disconnected subgraph. however, there is no guarantee between order of iterables and sequence of execution.

cheers,

satra

--

---
You received this message because you are subscribed to the Google Groups "NiPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nipy-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Augustine Koh

unread,

Apr 20, 2014, 10:24:29 PM4/20/14

to nipy...@googlegroups.com

Dear Satra,

I did not manage to find any files named provenance after executing my workflows. However, I found a file called "report.rst' inside each unique directory that is the place where each nodes itself gets executed (the sandboxes), and I think it might be the same as the provenance files you were telling me about?

The file did contain information such as "hierarchy", "exec id", "original inputs", "execution inputs", "execution outputs", "runtime info" (which was where I could find the duration of the node, but no explicit start or end time), and "Environment" (environment of execution) .

Also, I've noticed that after executing a workflow, a folder (which has the same name as that of the workflow itself) gets created in the very same directory as where the python script of the workflow is located. The workflow contains files named "d3.v3.min.js", "graph.json", "graph1.json", and "index.html". I've opened them up to take a look, and they all seem quite cryptic, except for index.html, which opens up as a diagram. Do you know what these files are meant for?

Ah that explains the order of execution.

Thank you!

Best regards,

Augustine

Augustine Koh

unread,

Apr 20, 2014, 10:26:12 PM4/20/14

to nipy...@googlegroups.com

Dear Michael,

Yes I think i could try doing that. To do so, I need to use the "function" interface and to create the interface for python's time module right?

Thank you.

Best regards,

Augustine

Augustine Koh

unread,

Apr 20, 2014, 10:31:24 PM4/20/14

to nipy...@googlegroups.com

Dear Satra,

Its good to be able to obtain all provenance info through the provenance files, but I was thinking that it would be very convenient if the time taken for the workflow execution could be printed out in the terminal just as how the time was printed out when we ran the installation tests for nipype. Perhaps to do so the only way to do so is to create an interface for the python time module and wrap it up as a node? But then the time module isn't something that can accept any neuroimage data 'inputs' from any nodes in a workflow....

Thank you.

Best regards,

Augustine

On Wednesday, 16 April 2014 04:54:05 UTC+8, satra wrote:

Reply all

Reply to author

Forward