OpenMPI issue reported to SPEC HPG group. Thoughts?

0 views
Skip to first unread message

Sheppard, Raymond W

unread,
Jan 14, 2026, 11:46:16 AM (6 days ago) Jan 14
to 'George Bosilca' via Open MPI users, rlie...@amd.com
Hi,
Ray Sheppard here wearing my SPEC hat. We received a mail from AMD we are not sure how to deal with. So I thought I would pass it along in case anyone might have some relevant thoughts about it. It looks like Jeff S. filed the issue they site. We are sort of fishing for a response to them. So any info is appreciated. Thanks.
Ray

Dear Support.

I am an engineer at AMD who is currently running the SPECMPI2007 benchmarks, and we are experiencing issues with running the 122.Tachyon benchmark when compiled with OpenMPI 5. It is our goal to be able to run SPECMPI with OpenMPI 5 to minimize the overhead of MPI in our benchmarking.

In our usual configuration, running the benchmark on 256 ranks using OpenMPI 5 with the cross-memory attach (CMA) fabric. It appears that the 122.Tachyon benchmark deadlocks. When running Tachyon with OpenMPI 4.1.8 and the UCX fabric, this issue does not occur.

On investigating further, we observe:
With MPICH v4.3.0 the benchmark fails to run due to an MPI error detected by MPICH, due to an ‘MPI_Allgather()’ call using the same array for the send and receive buffer, which is disallowed by the MPI spec.
On modifying the benchmark to correct the issue with the Allgather call we see the following:
MPICH runs to completion, then crashes at finalization.
OpenMPI still deadlocks.
The deadlock is only observed when running on >35 ranks and is present in multiple versions of OpenMPI (v.5.0.5, v.5.0.8).
We discovered this issue for OpenMPI when investigating this: https://github.com/open-mpi/ompi/issues/12979, which may be relevant.

Is this a known issue with 122.Tachyon benchmark, and are you able to help us run 122.Tachyon on OpenMPI 5?

Thank you in advance for your help. If you require any further information, please do not hesitate to reach out to me.

Thanks
James

Pritchard Jr., Howard

unread,
Jan 14, 2026, 12:12:54 PM (6 days ago) Jan 14
to us...@lists.open-mpi.org, rlie...@amd.com
Hello Ray,

A few questions to help better understand the problem.

- Are you running the benchmark on a single node?
- If the answer to that question is yes, could you try using UCX for messaging - mpirun --mca pml ucx and see if you still observe the hang?

Also, it would help to open an issue for this problem at https://github.com/open-mpi/ompi/issues

Thanks,

Howard

On 1/14/26, 9:46 AM, "us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> on behalf of Sheppard, Raymond W" <us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> on behalf of rshe...@iu.edu <mailto:rshe...@iu.edu>> wrote:


Hi,
Ray Sheppard here wearing my SPEC hat. We received a mail from AMD we are not sure how to deal with. So I thought I would pass it along in case anyone might have some relevant thoughts about it. It looks like Jeff S. filed the issue they site. We are sort of fishing for a response to them. So any info is appreciated. Thanks.
Ray


Dear Support.


I am an engineer at AMD who is currently running the SPECMPI2007 benchmarks, and we are experiencing issues with running the 122.Tachyon benchmark when compiled with OpenMPI 5. It is our goal to be able to run SPECMPI with OpenMPI 5 to minimize the overhead of MPI in our benchmarking.


In our usual configuration, running the benchmark on 256 ranks using OpenMPI 5 with the cross-memory attach (CMA) fabric. It appears that the 122.Tachyon benchmark deadlocks. When running Tachyon with OpenMPI 4.1.8 and the UCX fabric, this issue does not occur.


On investigating further, we observe:
With MPICH v4.3.0 the benchmark fails to run due to an MPI error detected by MPICH, due to an ‘MPI_Allgather()’ call using the same array for the send and receive buffer, which is disallowed by the MPI spec.
On modifying the benchmark to correct the issue with the Allgather call we see the following:
MPICH runs to completion, then crashes at finalization.
OpenMPI still deadlocks.
The deadlock is only observed when running on >35 ranks and is present in multiple versions of OpenMPI (v.5.0.5, v.5.0.8).
We discovered this issue for OpenMPI when investigating this: https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$ <https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$> , which may be relevant.


Is this a known issue with 122.Tachyon benchmark, and are you able to help us run 122.Tachyon on OpenMPI 5?


Thank you in advance for your help. If you require any further information, please do not hesitate to reach out to me.


Thanks
James


To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org <mailto:users+un...@lists.open-mpi.org>.





Sheppard, Raymond W

unread,
Jan 14, 2026, 12:21:58 PM (6 days ago) Jan 14
to us...@lists.open-mpi.org
Thanks Howard,
Yes, I think they are using a single node. I believe they have nodes that now support 256 ranks. I am the middle man but I will pass it along and report the result. I am not sure it should be a new formal issue since they think it is part of 12979 (Sorry about the Windoze "Safe Links" the school now uses). If you think it should be a new issue though, I could pass back that they should open one. I do not know more than the email we received. They could fill in the necessary info.
Ray


________________________________________
From: 'Pritchard Jr., Howard' via Open MPI users <us...@lists.open-mpi.org>
Sent: Wednesday, January 14, 2026 12:12 PM
To: us...@lists.open-mpi.org
Cc: rlie...@amd.com
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI issue reported to SPEC HPG group. Thoughts?
To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Pritchard Jr., Howard

unread,
Jan 14, 2026, 2:50:00 PM (6 days ago) Jan 14
to us...@lists.open-mpi.org
HI Ray,

I think you're correct about opening a new issue. I added a link to this email chain on to that issue.

Howard

On 1/14/26, 10:22 AM, "us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> on behalf of Sheppard, Raymond W" <us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> on behalf of rshe...@iu.edu <mailto:rshe...@iu.edu>> wrote:


Thanks Howard,
Yes, I think they are using a single node. I believe they have nodes that now support 256 ranks. I am the middle man but I will pass it along and report the result. I am not sure it should be a new formal issue since they think it is part of 12979 (Sorry about the Windoze "Safe Links" the school now uses). If you think it should be a new issue though, I could pass back that they should open one. I do not know more than the email we received. They could fill in the necessary info.
Ray




________________________________________
From: 'Pritchard Jr., Howard' via Open MPI users <us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>>
Sent: Wednesday, January 14, 2026 12:12 PM
To: us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
Cc: rlie...@amd.com <mailto:rlie...@amd.com>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI issue reported to SPEC HPG group. Thoughts?


Hello Ray,


A few questions to help better understand the problem.


- Are you running the benchmark on a single node?
- If the answer to that question is yes, could you try using UCX for messaging - mpirun --mca pml ucx and see if you still observe the hang?


Also, it would help to open an issue for this problem at https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues__;!!Bt8fGhp8LhKGRg!HtBMV-qUF4xu6V0LQ9FL6GEj2IiRCbQzIdvzNsGmOxvJlKnwSqD9vPWnM4sn8ZdW_nuEXRb5liH0ESavHA$ <https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues__;!!Bt8fGhp8LhKGRg!HtBMV-qUF4xu6V0LQ9FL6GEj2IiRCbQzIdvzNsGmOxvJlKnwSqD9vPWnM4sn8ZdW_nuEXRb5liH0ESavHA$>


Thanks,


Howard


On 1/14/26, 9:46 AM, "us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> <mailto:us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>> on behalf of Sheppard, Raymond W" <us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> <mailto:us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>> on behalf of rshe...@iu.edu <mailto:rshe...@iu.edu> <mailto:rshe...@iu.edu <mailto:rshe...@iu.edu>>> wrote:




Hi,
Ray Sheppard here wearing my SPEC hat. We received a mail from AMD we are not sure how to deal with. So I thought I would pass it along in case anyone might have some relevant thoughts about it. It looks like Jeff S. filed the issue they site. We are sort of fishing for a response to them. So any info is appreciated. Thanks.
Ray




Dear Support.




I am an engineer at AMD who is currently running the SPECMPI2007 benchmarks, and we are experiencing issues with running the 122.Tachyon benchmark when compiled with OpenMPI 5. It is our goal to be able to run SPECMPI with OpenMPI 5 to minimize the overhead of MPI in our benchmarking.




In our usual configuration, running the benchmark on 256 ranks using OpenMPI 5 with the cross-memory attach (CMA) fabric. It appears that the 122.Tachyon benchmark deadlocks. When running Tachyon with OpenMPI 4.1.8 and the UCX fabric, this issue does not occur.




On investigating further, we observe:
With MPICH v4.3.0 the benchmark fails to run due to an MPI error detected by MPICH, due to an ‘MPI_Allgather()’ call using the same array for the send and receive buffer, which is disallowed by the MPI spec.
On modifying the benchmark to correct the issue with the Allgather call we see the following:
MPICH runs to completion, then crashes at finalization.
OpenMPI still deadlocks.
The deadlock is only observed when running on >35 ranks and is present in multiple versions of OpenMPI (v.5.0.5, v.5.0.8).
We discovered this issue for OpenMPI when investigating this: https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$ <https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$> <https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$ <https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/12979__;!!Bt8fGhp8LhKGRg!EUk9KbzE4IsIwvMccyKR9xRhhryblAA5dgx5ASXZA9YebM0wntlpKYGOHVaHw9c9-f0ZG-g8wHM7jPgsYA$>> , which may be relevant.




Is this a known issue with 122.Tachyon benchmark, and are you able to help us run 122.Tachyon on OpenMPI 5?




Thank you in advance for your help. If you require any further information, please do not hesitate to reach out to me.




Thanks
James




To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org <mailto:users+un...@lists.open-mpi.org> <mailto:users+un...@lists.open-mpi.org <mailto:users+un...@lists.open-mpi.org>>.










To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org <mailto:users+un...@lists.open-mpi.org>.


To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org <mailto:users+un...@lists.open-mpi.org>.





Reply all
Reply to author
Forward
0 new messages