Ralph Castain
unread,Jan 20, 2026, 12:32:56 PMJan 20Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to pmix-forum
Hello all
I was hoping we would have this month’s quarterly meeting, but guess I missed that it had been cancelled (old brain syndrome). However, there is at least one fairly urgent thing that needs to be resolved.
The current Standard stipulates that collectives such as PMIx_Fence are to:
1. Return PMIX_OPERATION_SUCCEEDED if the caller is a singleton
2. Return PMIX_OPERATION_SUCCEEDED if only local procs are involved in the collective (i.e., the local PMIx server handles the operation without upcalling the host)
3. Return PMIX_SUCCESS if the above two conditions are not met and there is no immediate error found in the info passed to the API
4. Return an appropriate error otherwise
The purpose behind returning OPERATING_SUCCEEDED was to indicate that the operation succeeded, but the user’s callback function was not going to be called because it was “atomically” completed.
The question lies in the #2 requirement. Looking at the library, this requirement is not currently supported. IIRC, the change was made because some host environments wanted/needed to see that a fence had been called, even if it was locally completed. In other words, the PMIx server needed to upcall the host, even though the operation did not require any host involvement. In that scenario, the host will be calling the PMIx server’s callback function, and it will return whatever status code the host provides (hopefully SUCCESS).
I’m okay with that behavior, but it does not really comply with the current Standard. The original document was aimed at avoiding a global host operation when one wasn’t truly needed - but perhaps that was an incorrect assumption. So should we change the Standard to remove #2? Or do I need to modify the library (e.g., check for local-only participants and convert SUCCESS to OPERATION_SUCCEEDED when the host calls the server back)?
Need to know in time for the next release, currently planned for end-of-Jan. In the absence of input, I’ll assume we update the Standard and leave the library alone.
Ralph