SGELikeBatchManagerBase _get_result raises IOError

9 views
Skip to first unread message

Baldur

unread,
Dec 7, 2012, 10:00:07 AM12/7/12
to nipy...@googlegroups.com
Running  a workflow with the condor plugin (a subclass of SGELikeBatchManagerBase) the condor job is correctly submitted and begins running on a remote node. The working directory (nodeDir) is shared on an nfs ddrive (both condor master and workers have nfs clients with the drive mounted with cto and actimeo=3 options). 

After launching the condor job the taskid key is correctly added to _pending dictionary with value nodeDir. The function 'run' in DistributedPluginBase then calls the '_get_result' function implemented in SGELikeBatchManagerBase which in turn tries to locate the file result_*.pklz in the nodeDir of the taskid. This times-out and fails, and the IOError is raised.

The reason for the failure is that the result_*.pklz is only created when the job finished. 

Could this be and an error in my job class? I subclassed CommandLine and am running a tool that takes about 10 minutes to complete the job. Or is there something else going on, for example I saw Satra did an access to the '..' directory to allow stat. The failing job is running a few levels deeper because it is in a mapflow.

Cheers
Baldur

Chris Filo Gorgolewski

unread,
Dec 7, 2012, 10:04:07 AM12/7/12
to nipy...@googlegroups.com
I've seen this before. It seems that the mechanism to prevent synchronisation issues is actually doing the busywait. Michael can you have a look at this? Baldur - have you tried dagman plugin?

Best,
Chris


Baldur

--
 
 
 

Baldur

unread,
Dec 7, 2012, 10:17:56 AM12/7/12
to nipy...@googlegroups.com
Hi Chris

No I haven't looked at the dagman plugin (I also haven't used plain dagman with condor) .  Is there a nipype/dagman example I could use as a guide? 

Cheers
Baldur

Baldur

unread,
Dec 7, 2012, 10:25:27 AM12/7/12
to nipy...@googlegroups.com
Looking at the dagman code I see it builds the dag file itself. So I can just use this a a straightforward replacement for plugin=condor. Is that right?

Cheers
Baldur

Chris Filo Gorgolewski

unread,
Dec 7, 2012, 10:32:23 AM12/7/12
to nipy...@googlegroups.com
yup


--
 
 
 

Satrajit Ghosh

unread,
Dec 7, 2012, 10:32:33 AM12/7/12
to nipy-user
On Fri, Dec 7, 2012 at 10:00 AM, Baldur <baldur...@gmail.com> wrote:
Running  a workflow with the condor plugin (a subclass of SGELikeBatchManagerBase) the condor job is correctly submitted and begins running on a remote node. The working directory (nodeDir) is shared on an nfs ddrive (both condor master and workers have nfs clients with the drive mounted with cto and actimeo=3 options). 

After launching the condor job the taskid key is correctly added to _pending dictionary with value nodeDir. The function 'run' in DistributedPluginBase then calls the '_get_result' function implemented in SGELikeBatchManagerBase which in turn tries to locate the file result_*.pklz in the nodeDir of the taskid. This times-out and fails, and the IOError is raised.

The reason for the failure is that the result_*.pklz is only created when the job finished. 

but it should get to that point only if this call fails:


so the problem is that this is where the problem is:


let me replace that with a patch and you can see if that helps. give me a few minutes.
 
Could this be and an error in my job class?

highly unlikely :)

cheers,

satra

Baldur

unread,
Dec 7, 2012, 11:15:11 AM12/7/12
to nipy...@googlegroups.com
Hi Satra,

Yes I tried the line manually and the .count doesn't seem to work. 

I'm going to try the following trick instead : condor_q format <taskid> -format 1 ID. If the length of the result string is 1 then the task is correctly queued.

Baldur

unread,
Dec 7, 2012, 12:21:30 PM12/7/12
to nipy...@googlegroups.com

That -format trick works as a fix. I'm using a version 7 condor by the way.
Reply all
Reply to author
Forward
0 new messages