MPI Error Handling Inside PETSc

25 views
Skip to first unread message

Bland, Wesley

unread,
Jun 8, 2018, 12:17:42 PM6/8/18
to openco...@googlegroups.com
Hi OpenCoarrays folks,

In the MPI Forum, we're getting close to adopting a proposal to change the default communicator where errors are raised when they don't have anywhere else to go (think something like MPI_ALLOC_MEM, which doesn't have a communicator). Instead of getting those errors on the error handler for MPI_COMM_WORLD, it would move to MPI_COMM_SELF to allow more local error handling. Rather than having an error for passing an invalid argument potentially cause all processes to trigger the error handler, only the local process would see it.

This doesn't impact normal error handling, such as if an MPI_RECV fails for some reason. That would trigger the error handler attached to the communicator in the receive. However, this is potentially a backward incompatible change due to the fact that people might be changing the error handler of MPI_COMM_WORLD, but not MPI_COMM_SELF.

The details can be found on this GitHub issue and in this PDF (search for ticket3).

I see that in OpenCoarrays, you guys do some basic error handling by changing the default error handler to MPI_ERRORS_RETURN on MPI_COMM_WORLD. So in this case, I believe that in order to get the same error handling in all possible (though unlikely) cases, you would also need to set the same error handler on MPI_COMM_SELF. Probably something along the lines of:

MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);

Or, if you want to preserve the user error handler:

MPI_Comm_get_errhandler(MPI_COMM_SELF, &orig_errhandler);
if (orig_errhandler != MPI_ERRORS_ARE_FATAL) {
    /* Create custom error handler to deal with internal
     * errors and then call the user's error handler */
}

Before we vote this in next week, we wanted to reach out to some users to see if you have strong opinions about this. Despite the fact that this will have some impact on users, we think this is the right way to go to improve error management in the MPI Standard (there are other efforts going on if you're interested).

Thanks,
Wesley Bland

Alessandro Fanfarillo

unread,
Jun 8, 2018, 12:45:46 PM6/8/18
to Bland, Wesley, openco...@googlegroups.com
Hi Wesley,
thanks for reaching out. I personally see this change having a minor impact on what we currently do. As you mentioned, this can be handled by adding MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN).
In OpenCoarrays, we change the default error handler in order to take advantage of the fault tolerant features proposed by ULFM. The change being proposed should not impact the way we handle failures.
Summarizing, I personally don't have any strong opinion about this change.

Cheers,
Alessandro

--
You received this message because you are subscribed to the Google Groups "OpenCoarrays" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencoarrays+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/opencoarrays.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencoarrays/17300FE2-F004-4398-9987-B487BA37A5EE%40intel.com.
For more options, visit https://groups.google.com/d/optout.



--

Alessandro Fanfarillo

Bland, Wesley

unread,
Jun 8, 2018, 1:30:35 PM6/8/18
to Alessandro Fanfarillo, openco...@googlegroups.com
Thanks for the feedback (and sorry for the obviously badly copy-pasted email subject).
Reply all
Reply to author
Forward
0 new messages