Running a workflow with the condor plugin (a subclass of SGELikeBatchManagerBase) the condor job is correctly submitted and begins running on a remote node. The working directory (nodeDir) is shared on an nfs ddrive (both condor master and workers have nfs clients with the drive mounted with cto and actimeo=3 options).
After launching the condor job the taskid key is correctly added to _pending dictionary with value nodeDir. The function 'run' in DistributedPluginBase then calls the '_get_result' function implemented in SGELikeBatchManagerBase which in turn tries to locate the file result_*.pklz in the nodeDir of the taskid. This times-out and fails, and the IOError is raised.
The reason for the failure is that the result_*.pklz is only created when the job finished.
Could this be and an error in my job class? I subclassed CommandLine and am running a tool that takes about 10 minutes to complete the job. Or is there something else going on, for example I saw Satra did an access to the '..' directory to allow stat. The failing job is running a few levels deeper because it is in a mapflow.
Cheers
Baldur