mpirun exited due to process rank 1 exiting improperly

Clark Lee

unread,

Jul 7, 2015, 8:58:53 AM7/7/15

to smu...@googlegroups.com

H,

I am running SMuFin on my server.

It‘s a cloud platform with 40 physical cores and 160GB memory.

I have done the bwa index manually and run the following command:

time mpirun -np 2 ~/bin/SMuFin --ref reference.fa.gz --normal_fastq_1 n1.txt --normal_fastq_2 n2.txt --tumor_fastq_1 t1.txt --tumor_fastq_2 t2.txt --cpus_per_node 40

But I get the following error after 8 hours of running:

mpirun has exited due to process rank 1 with PID 35201 on node ip-172-30-1-212 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here).

--------------------------------------------------------------------------

real 494m29.044s

user 19259m51.031s

sys 46m34.115s

But when I run a small dataset which cost 20 minutes, everything is OK.

Can you help me find what's the problem?

Thank you for your time!

Clark

montse

unread,

Jul 8, 2015, 4:24:47 AM7/8/15

to smu...@googlegroups.com

Hi,

Could you tell me which is the coverage of your samples? And what is the a whole human genome sequencing?

Thank you,
SMuFib

j_ma...@lbl.gov

unread,

Jul 11, 2015, 4:48:23 AM7/11/15

to smu...@googlegroups.com

Hello,
It's a rice genome, parent and FNR mutated progeny. ~400mb and not terribly complex.

the parent is 90x, the mutated offspring is 45x

montse

unread,

Jul 13, 2015, 6:56:00 AM7/13/15

to smu...@googlegroups.com

Hi Joel,

SMuFin has been designed to analyse samples in human, so we do not know what could be the behavior in other spicies. But taking into account the given information It seems a memory problem. Our test using a samples with 60x of coverage, the peaks of memory could arrive to 256 Gb, so maybe you should try a claster with a higher memory.

Best,
Montse

Clark Lee

unread,

Jul 13, 2015, 9:33:19 PM7/13/15

to smu...@googlegroups.com

Hello,

I used to run SMuFin on a machine of smaller memory.

When meeting a memory problem, the system log will warn us sth. like below

[349856.809391] Out of memory: Kill process 15283 (SMuFin) score 328 or sacrifice child

[349856.809399] Killed process 15283 (SMuFin) total-vm:68314208kB, anon-rss:20162252kB, file-rss:0kB

But this time I don't see this warning.

Anyway, we'll follow your advice and have a try.

Clark Lee

unread,

Jul 16, 2015, 8:51:24 AM7/16/15

to smu...@googlegroups.com

Hi,

This time we run the same test with parent downsampled to 43x coverage.

And I monitor ram usage by

/usr/bin/time -f "mem_used: %M KB" mpirun ...

It shows "mem_used: 60191336 KB"

And as I mentioned before, the machine has 157G memory.

total used free shared buffers cached

Mem: 157G 30G 127G 344K 52M 28G

-/+ buffers/cache: 1.3G 156G

Swap: 0B 0B 0B

Joel Martin

unread,

Jul 28, 2015, 10:43:32 PM7/28/15

to SMufin, fengs...@gmail.com

fwiw, this turned out to be a formatting issue with the reference and the odd exit message. Issue is solved on our end, thanks.

Reply all

Reply to author

Forward