mpirun exited due to process rank 1 exiting improperly

317 views
Skip to first unread message

Clark Lee

unread,
Jul 7, 2015, 8:58:53 AM7/7/15
to smu...@googlegroups.com

H, 

I am running SMuFin on my server.
It‘s a cloud platform with 40 physical cores and 160GB memory.
I have done the bwa index manually and run the following command: 

time mpirun -np 2 ~/bin/SMuFin --ref reference.fa.gz --normal_fastq_1 n1.txt --normal_fastq_2 n2.txt --tumor_fastq_1 t1.txt --tumor_fastq_2 t2.txt --cpus_per_node 40


But I get the following error after 8 hours of running: 


mpirun has exited due to process rank 1 with PID 35201 on node ip-172-30-1-212 exiting improperly. There are two reasons this could occur:

 

1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

 

This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).

--------------------------------------------------------------------------

real    494m29.044s

user    19259m51.031s

sys     46m34.115s


But when I run a small dataset which cost 20 minutes, everything is OK.

Can you help me find what's the problem?


Thank you for your time!

Clark 


montse

unread,
Jul 8, 2015, 4:24:47 AM7/8/15
to smu...@googlegroups.com
Hi,

Could you tell me which is the coverage of your samples? And what is the a whole human genome sequencing?

Thank you,
SMuFib

j_ma...@lbl.gov

unread,
Jul 11, 2015, 4:48:23 AM7/11/15
to smu...@googlegroups.com
Hello,
It's a rice genome, parent and FNR mutated progeny. ~400mb and not terribly complex.

the parent is 90x, the mutated offspring is 45x

montse

unread,
Jul 13, 2015, 6:56:00 AM7/13/15
to smu...@googlegroups.com
Hi Joel,

SMuFin has been designed to analyse samples in human, so we do not know what could be the behavior in other spicies. But taking into account the given information It seems a memory problem. Our test using a samples with 60x of coverage, the peaks of memory could arrive to 256 Gb, so maybe you should try a claster with a higher memory.

Best,
Montse

Clark Lee

unread,
Jul 13, 2015, 9:33:19 PM7/13/15
to smu...@googlegroups.com
Hello, 
  I used to run SMuFin on a machine of smaller memory. 
When meeting a memory problem, the system log will warn us sth. like below 

[349856.809391] Out of memory: Kill process 15283 (SMuFin) score 328 or sacrifice child

[349856.809399] Killed process 15283 (SMuFin) total-vm:68314208kB, anon-rss:20162252kB, file-rss:0kB

But this time I don't see this warning.

Anyway, we'll follow your advice and have a try.

Clark Lee

unread,
Jul 16, 2015, 8:51:24 AM7/16/15
to smu...@googlegroups.com
Hi, 
  This time we run the same test with parent downsampled to 43x coverage.
And I monitor ram usage by 
/usr/bin/time -f "mem_used: %M KB" mpirun ...

It shows "mem_used: 60191336 KB
And as I mentioned before, the machine has 157G memory.
                     total       used       free     shared    buffers     cached
Mem:          157G        30G       127G       344K        52M        28G
-/+ buffers/cache:     1.3G       156G
Swap:           0B              0B         0B

Joel Martin

unread,
Jul 28, 2015, 10:43:32 PM7/28/15
to SMufin, fengs...@gmail.com
fwiw, this turned out to be a formatting issue with the reference and the odd exit message.  Issue is solved on our end, thanks.
Reply all
Reply to author
Forward
0 new messages