Nipype with SGEGraph plugin: MapNode parallelization

44 views
Skip to first unread message

Nicolas Toussaint

unread,
Jul 10, 2014, 2:19:17 PM7/10/14
to nipy...@googlegroups.com
Hello,

I am trying to submit a workflow containing mapnodes on an SGE cluster. Correct me if i am wrong but:

When using the 'SGE' plugin, it submits jobs one after the other, and when encountering a MapNode it nicely submits all iterable nodes at the same time.
When using the 'SGEGraph' plugin, it nicely submits all nodes in a queue. However when encountering a MapNode, it considers the iterables sequentially and submits all of them sequentially to a single 'cluster node', therefore loosing the potential parallelization of the iterables.

Is there a simple solution to that ? Shouldn't nipype SGEGraph plugin be handling MapNode iterables submissions in a vector of jobs ?



Any help or comment welcome.

Thank you,

Nicolas Toussaint,
Research Associate,
Centre for Medical Image Computing,
University College London

Satrajit Ghosh

unread,
Jul 10, 2014, 4:36:25 PM7/10/14
to nipy-user
hi nicolas,

unfortunately the graph plugin currently assumes that it cannot submit jobs from jobs. it is possible that the job could go ahead and submit jobs to the queue but we would need to code that into the job for MapNodes. 

if you are up for sending a PR that will be great, but otherwise could you please open an issue?

cheers,

satra


--

---
You received this message because you are subscribed to the Google Groups "NiPy Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nipy-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicolas Toussaint

unread,
Jul 11, 2014, 5:44:04 AM7/11/14
to nipy...@googlegroups.com
Thank you for your prompt answer, I will have a look and see how feasible this can be done, and open an issue anyway.

I have another question: when using the SGE plugin, I'd like to launch my workflow via ssh to the cluster. I therefore need to use nohup or disown in order to keep the script running after ash disconnection. Whenever I do that it seems the python job then fails to submit cluster jobs after the disown. Any idea why is that happening ?

Because of those 2 related issues it is impossible to properly use nipype in a cluster environment :-/

Satrajit Ghosh

unread,
Jul 11, 2014, 6:10:41 AM7/11/14
to nipy-user
hi nicolas,

I have another question: when using the SGE plugin, I'd like to launch my workflow via ssh to the cluster. I therefore need to use nohup or disown in order to keep the script running after ash disconnection. Whenever I do that it seems the python job then fails to submit cluster jobs after the disown. Any idea why is that happening ?

Because of those 2 related issues it is impossible to properly use nipype in a cluster environment :-/

for running things without using nohup i would recommend looking into screen, tmux, and mosh.

we use nipype on many clusters :)

cheers,

satra

Arman Eshaghi

unread,
Jul 11, 2014, 7:13:53 AM7/11/14
to nipy...@googlegroups.com
Hi Nicolas,

How do you run your Python environment? What I usually do is to run an IPython notebook with Screen. In any case, disown -a should also work after starting your Python session. Before running disown please check your running jobs with "jobs" command. After you run disown -a then your "jobs" command should give you nothing, but "ps" should show you that you are running Python jobs that won't receive kill signal after logging out.

All the best,
Arman


Nicolas Toussaint

unread,
Jul 11, 2014, 9:00:48 AM7/11/14
to nipy...@googlegroups.com
Hello, thank you for your answers,

I am not using the IPython notebook, but as you say a disown -a should be sufficient. I tried with nohup, disown or screen strategies without success. My python job is actually not killed, it still runs after session exit and re-logging, but it is then unable to submit anything anymore to the cluster via qsub. It is clear now that it is a problem that I only am experiencing, perhaps it is due to the specifics of our cluster. I will investigate,

Thank you for your help,

Nicolas

Nicolas Toussaint

unread,
Jul 11, 2014, 12:33:29 PM7/11/14
to nipy...@googlegroups.com
So
for your information it seems that during an SGE execution, the while loop over 'qacct_verification' is querying a 'os.getlogin()' that fails and return None when we are in a nohup or screen situation.
It can be partially overcome by parsing the login to the class and only querying 'os.login()' at the SGEPlugin class iniialization:
https://github.com/ntoussaint/nipype/commit/c0df5b5a69b0a158125eb7285c517bcbc867df6e

I am not sure this hack is a very elegant solution but it solved my problem.

Nicolas
Reply all
Reply to author
Forward
0 new messages