H,
I am running SMuFin on my server.time mpirun -np 2 ~/bin/SMuFin --ref reference.fa.gz --normal_fastq_1 n1.txt --normal_fastq_2 n2.txt --tumor_fastq_1 t1.txt --tumor_fastq_2 t2.txt --cpus_per_node 40
But
I get the following error after 8 hours of running:
mpirun has exited due to process rank 1 with PID 35201 on node ip-172-30-1-212 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
real 494m29.044s
user 19259m51.031s
sys 46m34.115s
But when I run a small dataset which cost 20 minutes, everything is OK.
Can you help me find what's the problem?
Thank you for your time!
Clark
the parent is 90x, the mutated offspring is 45x
[349856.809391] Out of memory: Kill process 15283 (SMuFin) score 328 or sacrifice child
[349856.809399] Killed process 15283 (SMuFin) total-vm:68314208kB, anon-rss:20162252kB, file-rss:0kB
But this time I don't see this warning.
Anyway, we'll follow your advice and have a try.