DRMAA Exectution Error

Skip to first unread message

Martijn Starmans

Jan 13, 2017, 10:42:05 AM1/13/17
to Fastr Users
Dear all,

When locally run, my FASTR network executes just fine. However, I encountered an error using the DRMAA plugin while running a FASTR network on a cluster, already crashing upon the jobs for the source nodes. The source nodes for my network are multiple XNAT urls of images for two modalities and segmentations for both. The following error is encountered already in a seemingly random job of first node, after which all following jobs naturally crash as well:

[MainProcess::fastr_jobfinished_callback] ERROR: network:1037 >> Encountered error (FastrResultFileNotFound): Could not find/read job result file /scratch/mstarmans/tmp/fastr_WORC_NFDDLx/segmentation_m1/LGG-Radiogenomics-002/__fastr_result__.pickle.gz, assuming the job crashed before it created output. (/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/execution/executionpluginmanager.py:283)

I've attached the error report I get on the command line to this post.

The stderr and stdout are empty for this job. If I run " fastr execute" in the corresponding directory of the crashed job, the job executes fine. ..../LGG-Radiogenomics-001 was the previous job and appeared finished. However, the result map is empty of that first job and the following is stated in it's __fastr__stderr.txt:

no mem for new parser
Traceback (most recent call last):
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/execution/executionscript.py", line 36, in <module>
    import fastr
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/__init__.py", line 97, in <module>
    from fastr.core.network import Network
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/network.py", line 39, in <module>
    from fastr.core.node import Node, ConstantNode, SourceNode, SinkNode, MacroNode
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/node.py", line 45, in <module>
    from fastr.core.tool import Tool
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/tool.py", line 39, in <module>
    import fastr.core.target
  File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/target.py", line 68, in <module>
  File "/cm/shared/apps/python/2.7.11/lib/python2.7/collections.py", line 374, in namedtuple
    exec class_definition in namespace

I can reproduce the error, but not always on the same job. It might for example state the first 25 are finished and then crashed on LGG-Radiogenomics-026 instead of 001 and 002. The error report in the previous "finished task", in this case LGG-Radiogenomics-025, is however always the same as above.

Hence it seems there is something wrong with my memory management within the DRMAA plugin. Someone got an idea what it exactly is and how to fix this? Thanks in advance.

Kind regards,



Martijn Starmans

Mar 23, 2017, 1:03:08 PM3/23/17
to Fastr Users
After some test running, with help from Hakim, we found that the nipype package was the issue. Thus a suggestion might be to make the DRMAA plugin compatible with this packages in the next release.

Op vrijdag 13 januari 2017 16:42:05 UTC+1 schreef Martijn Starmans:
Reply all
Reply to author
0 new messages