Fatal error in MPI_Send

1,482 views
Skip to first unread message

Sofia P

unread,
May 27, 2015, 6:36:59 PM5/27/15
to smu...@googlegroups.com
Hello,

I am trying to run SMuFin in my server. Details:
NUMA nodes:2
Architecture: x86-64
CPUs:24
Threads per core:2
Cores per socket:6
Sockets:2
Total memory: 257903 MB
Swp:5343 MB

I run the following command:
mpirun --np 2 ./SMuFin --ref feb2015.ref.fa --normal_fastq_1 normal_at1fastqs_1.txt --normal_fastq_2 normal_at1fastqs_2.txt --tumor_fastq_1 tumor_at1fastqs_1.txt --tumor_fastq_2 tumor_at1fastqs_2.txt --cpus_per_node 8

But I get the following error after 5 hours of running:
Fatal error in MPI_Send: Invalid count, error stack:
MPI_Send(173): MPI_Send(buf=0x7f21f5d90010, count=-2120186758, MPI_CHAR, dest=0, tag=9, MPI_COMM_WORLD) failed
MPI_Send(97).: Negative count, value is -2120186758

Does it have to do with the resources I am using? Unfortunately, the server has a lot of users and I cannot use the full resources available. Do you think that it really needs more, otherwise it will keep crashing? I tried to calculate the memory that should be available according to my samples in order for the program to run, and it didn't seem to exceed the available memory on the server.

Thank you for your time!
Sofia

SMufin

unread,
May 29, 2015, 6:40:28 AM5/29/15
to smu...@googlegroups.com
Hi Sofia,

Your resources seems to be enough to run SMUFIN. I would like you could answer some questions to try guess what is happening:

- How much time does it run?

- Does SMUFIN produce any file in the reference genome folder? It should

Thanks

sofi...@yahoo.gr

unread,
May 29, 2015, 1:54:58 PM5/29/15
to smu...@googlegroups.com
Hello, thank you for your response!

The programme runs 9 hours and then stops and displays the message. I have all of my files (including the reference genome) in the smufin_0.9.3_mpi_beta folder, and unfortunately I do not see any output file there from the program.
Should I see a specific intermediate file there?

sofi...@yahoo.gr

unread,
Jun 2, 2015, 4:42:40 AM6/2/15
to smu...@googlegroups.com
Regarding to my previous response, I run it on a different server this time with the same number of threads but more cpus and it gives me this error:

*** An error occurred in MPI_Send
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_COUNT: invalid count argument
*** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 13862 on
node pevzner exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

sofi...@yahoo.gr

unread,
Jun 4, 2015, 9:24:04 AM6/4/15
to smu...@googlegroups.com
And in the first server the error is:

montse

unread,
Jun 5, 2015, 8:18:15 AM6/5/15
to smu...@googlegroups.com
Hi,

One of the first steps of SMUFIN that should take much less than 9 hours is to create the genome indexes to perform the final alignments.  In the folder where the reference genome "feb2015.ref.fa"  is located you should observe other files named "feb2015.ref.fa.amb", "feb2015.ref.fa.ann", "feb2015.ref.fa.bwt", etc.

If thous files do not appear, it might be due to a problem with BWA, producing the program to be unable to use the aligner.

One possible way to try to solve this problem is producing the reference indexes manually with BWA using the command: bwa index reference_genoma.fa

Best,
SMuFin



Message has been deleted

sofi...@yahoo.gr

unread,
Jun 6, 2015, 9:52:00 AM6/6/15
to smu...@googlegroups.com
Hello,
thank you for your reponse, I was aware of that, so before running smufin I had already indexed my reference genome with BWA and all files were in the same folder. However, I still do not see any output file after running smufin, and the program continues to give the Error:

Fatal error in MPI_Send: Invalid count, error stack:
MPI_Send(173): MPI_Send(buf=0x7f21f5d90010, count=-2120186758, MPI_CHAR, dest=0, tag=9, MPI_COMM_WORLD) failed
MPI_Send(97).: Negative count, value is -2120186758

Me and a colleague are trying to figure out which of the MPI_Send statements in the script main.cpp cause the problem.

sofi...@yahoo.gr

unread,
Jun 9, 2015, 5:46:30 AM6/9/15
to smu...@googlegroups.com
Hello again,
we think that we found what is going wrong:
As I mentioned before, the program after 8-9 hours of running gives an error:
MPI_Send(173): MPI_Send(buf=0x7fdf02022010, count=-2120189196, MPI_CHAR, dest=0, tag=9, MPI_COMM_WORLD) failed
MPI_Send(97).: Negative count, value is -2120189196
application called MPI_Abort(MPI_COMM_WORLD, 336178690) - process 1

This corresponds to line 716 in main.cpp:
MPI_Send(merged_mem, merged_size, MPI_CHAR, MASTER_ID, MPI_GET_SEQS_FROM_PARTITION, MPI_COMM_WORLD)

My colleague things that this is probably because the variable “merged_size” became too large for an integer value. He monitored the value and it crashed when it reached 2,174,778,100. An integer on a 32 bit system has a max value of 2^31-1 (2,147,483,647). If the value gets larger than that, it will turn negative, so maybe that’s why MPI_Send complains about a negative count.

Do you think it could be because of that? My normal sample fastq is bigger than the tumor one, could this play a role? Personally I am not experienced with this code language, so is there something that I could do/try in this case?

skill...@gmail.com

unread,
Jul 2, 2015, 3:56:02 AM7/2/15
to smu...@googlegroups.com
Hi, did you get an answer to your question? I finally got SMuFin to run but after about 20 hours I got the same error. Except that I'm running on a 64 bit system.

Fatal error in MPI_Send: Invalid count, error stack:

MPI_Send(196): MPI_Send(buf=0x7ff05a3b5010, count=-1063321726, MPI_CHAR, dest=6, tag=10, MPI_COMM_WORLD) failed
MPI_Send(113): Negative count, value is -1063321726


chris....@gmail.com

unread,
Oct 6, 2015, 7:16:42 AM10/6/15
to SMufin
Hi all,

Ran into the same problem, at the same MPI_Send call. It doesn't matter whether your system is 32-bit or 64-bit -- in either case the 'count' parameter to MPI_Send is a 32-bit signed integer which is likely to overflow. If it happens to end up negative, as in this case, MPI_Send detects the error and aborts; there will also be silent undetected buffer truncation when the buffer size expressed as a size_t (unsigned 64-bit integer on Linux-x86-64+GCC+GLIBC; eligible to vary from platform to platform) is positive modulo 2^32.

I'm currently experimenting with a patch that replaces all suspect MPI_Send and MPI_Recv calls with a buffer-chunking strategy that can send up to 2^64 bytes. If we want it to work on 32-bit platforms too it would be advisable to replace size_t with unsigned long long everywhere. If it works in testing I'll post the patch.

Chris

Reply all
Reply to author
Forward
0 new messages