Dear all,
When locally run, my FASTR network executes just fine. However, I encountered an error using the DRMAA plugin while running a FASTR network on a cluster, already crashing upon the jobs for the source nodes. The source nodes for my network are multiple XNAT urls of images for two modalities and segmentations for both. The following error is encountered already in a seemingly random job of first node, after which all following jobs naturally crash as well:
[MainProcess::fastr_jobfinished_callback] ERROR: network:1037 >> Encountered error (FastrResultFileNotFound): Could not find/read job result file /scratch/mstarmans/tmp/fastr_WORC_NFDDLx/segmentation_m1/LGG-Radiogenomics-002/__fastr_result__.pickle.gz, assuming the job crashed before it created output. (/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/execution/executionpluginmanager.py:283)
I've attached the error report I get on the command line to this post.
The stderr and stdout are empty for this job. If I run " fastr execute" in the corresponding directory of the crashed job, the job executes fine. ..../LGG-Radiogenomics-001 was the previous job and appeared finished. However, the result map is empty of that first job and the following is stated in it's __fastr__stderr.txt:
no mem for new parser
Traceback (most recent call last):
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/execution/executionscript.py", line 36, in <module>
import fastr
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/__init__.py", line 97, in <module>
from fastr.core.network import Network
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/network.py", line 39, in <module>
from fastr.core.node import Node, ConstantNode, SourceNode, SinkNode, MacroNode
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/node.py", line 45, in <module>
from fastr.core.tool import Tool
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/tool.py", line 39, in <module>
import fastr.core.target
File "/scratch/mstarmans/fastr-env/lib/python2.7/site-packages/fastr/core/target.py", line 68, in <module>
'write_bytes'])
File "/cm/shared/apps/python/2.7.11/lib/python2.7/collections.py", line 374, in namedtuple
exec class_definition in namespace
MemoryError
I can reproduce the error, but not always on the same job. It might for example state the first 25 are finished and then crashed on LGG-Radiogenomics-026 instead of 001 and 002. The error report in the previous "finished task", in this case LGG-Radiogenomics-025, is however always the same as above.
Hence it seems there is something wrong with my memory management within the DRMAA plugin. Someone got an idea what it exactly is and how to fix this? Thanks in advance.
Kind regards,
Martijn