SGELikeBatchManagerBase _get_result raises IOError

Baldur

unread,

Dec 7, 2012, 10:00:07 AM12/7/12

to nipy...@googlegroups.com

Running a workflow with the condor plugin (a subclass of SGELikeBatchManagerBase) the condor job is correctly submitted and begins running on a remote node. The working directory (nodeDir) is shared on an nfs ddrive (both condor master and workers have nfs clients with the drive mounted with cto and actimeo=3 options).

After launching the condor job the taskid key is correctly added to _pending dictionary with value nodeDir. The function 'run' in DistributedPluginBase then calls the '_get_result' function implemented in SGELikeBatchManagerBase which in turn tries to locate the file result_*.pklz in the nodeDir of the taskid. This times-out and fails, and the IOError is raised.

The reason for the failure is that the result_*.pklz is only created when the job finished.

Could this be and an error in my job class? I subclassed CommandLine and am running a tool that takes about 10 minutes to complete the job. Or is there something else going on, for example I saw Satra did an access to the '..' directory to allow stat. The failing job is running a few levels deeper because it is in a mapflow.

Cheers

Baldur

Chris Filo Gorgolewski

unread,

Dec 7, 2012, 10:04:07 AM12/7/12

to nipy...@googlegroups.com

I've seen this before. It seems that the mechanism to prevent synchronisation issues is actually doing the busywait. Michael can you have a look at this? Baldur - have you tried dagman plugin?

Best,

Chris

Baldur

--

Baldur

unread,

Dec 7, 2012, 10:17:56 AM12/7/12

to nipy...@googlegroups.com

Hi Chris

No I haven't looked at the dagman plugin (I also haven't used plain dagman with condor) . Is there a nipype/dagman example I could use as a guide?

Cheers

Baldur

unread,

Dec 7, 2012, 10:25:27 AM12/7/12

to nipy...@googlegroups.com

Looking at the dagman code I see it builds the dag file itself. So I can just use this a a straightforward replacement for plugin=condor. Is that right?

Cheers

Baldur

Chris Filo Gorgolewski

unread,

Dec 7, 2012, 10:32:23 AM12/7/12

to nipy...@googlegroups.com

yup

--

Satrajit Ghosh

unread,

Dec 7, 2012, 10:32:33 AM12/7/12

to nipy-user

On Fri, Dec 7, 2012 at 10:00 AM, Baldur <baldur...@gmail.com> wrote:

Running a workflow with the condor plugin (a subclass of SGELikeBatchManagerBase) the condor job is correctly submitted and begins running on a remote node. The working directory (nodeDir) is shared on an nfs ddrive (both condor master and workers have nfs clients with the drive mounted with cto and actimeo=3 options).

After launching the condor job the taskid key is correctly added to _pending dictionary with value nodeDir. The function 'run' in DistributedPluginBase then calls the '_get_result' function implemented in SGELikeBatchManagerBase which in turn tries to locate the file result_*.pklz in the nodeDir of the taskid. This times-out and fails, and the IOError is raised.

The reason for the failure is that the result_*.pklz is only created when the job finished.

but it should get to that point only if this call fails:

https://github.com/nipy/nipype/blob/master/nipype/pipeline/plugins/base.py#L469

so the problem is that this is where the problem is:

https://github.com/nipy/nipype/blob/master/nipype/pipeline/plugins/condor.py#L47

let me replace that with a patch and you can see if that helps. give me a few minutes.

Could this be and an error in my job class?

highly unlikely :)

cheers,

satra

Baldur

unread,

Dec 7, 2012, 11:15:11 AM12/7/12

to nipy...@googlegroups.com

Hi Satra,

Yes I tried the line manually and the .count doesn't seem to work.

I'm going to try the following trick instead : condor_q format <taskid> -format 1 ID. If the length of the result string is 1 then the task is correctly queued.

Baldur

unread,

Dec 7, 2012, 12:21:30 PM12/7/12

to nipy...@googlegroups.com

That -format trick works as a fix. I'm using a version 7 condor by the way.

Reply all

Reply to author

Forward