Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Providing read access to a distributed matrix to more than one thread on the same process

72 views
Skip to first unread message

Kyle Schwiebert

unread,
May 21, 2024, 8:55:12 AM5/21/24
to deal.II User Group
Hello,

I have a question to which I think the answer may be "no" but I thought I would ask. I'll just ask and then explain the "why?" at the end in case there is a better work around from the outset.

I am initializing MPI myself with MPI_THREAD_MULTIPLE so that threads can each call MPI functions without interfering. To the extent possible each thread has its own copy of MPI_COMM_WORLD so that simultaneous calls do not get convoluted. However, I have a matrix of type TrilinosWrappers::SparseMatrix which BOTH threads need simultaneous access to. Since you must give one and only one MPI_Comm object in the constructor, these sorts of conflicts are inevitable.

For obvious reasons I would not like to require a copy of this matrix for each thread. The other obvious solution is a mutex on the matrix, but this could easily get costly as both threads are calling matrix.vmult(...) in an iterative solver. I thus have two questions:

1) Is initializing MPI with MPI_THREAD_MULTIPLE going to break the deal.ii internals for some reason and I should just not investigate this further?

2) I think the best solution, if possible, would be to get pointers to the internal data of my matrix which I can then associate with different MPI_Comm objects. Is this possible?

Why am I doing this?
This is a bit of a simplification, but imagine that I am solving a linear deferred correction problem. This means at each time step I solve A . x_1 = b_1 and A . x_2 = b_2. Let us assume that the matrix A does not have a well-known preconditioner which scales nicely with the number of processors. Then instead of using 2n processors on each linear system in series, we could instead use n processors on each linear system simultaneously and expect this to be faster. I hope this makes sense.

Thanks for any tips you can offer.

Regards,
Kyle

Wolfgang Bangerth

unread,
May 21, 2024, 11:42:12 AM5/21/24
to dea...@googlegroups.com

Kyle:

> I have a question to which I think the answer may be "no" but I thought I
> would ask. I'll just ask and then explain the "why?" at the end in case there
> is a better work around from the outset.
>
> I am initializing MPI myself with MPI_THREAD_MULTIPLE so that threads can each
> call MPI functions without interfering. To the extent possible each thread has
> its own copy of MPI_COMM_WORLD so that simultaneous calls do not get
> convoluted. However, I have a matrix of type TrilinosWrappers::SparseMatrix
> which BOTH threads need simultaneous access to. Since you must give one and
> only one MPI_Comm object in the constructor, these sorts of conflicts are
> inevitable.
>
> For obvious reasons I would not like to require a copy of this matrix for each
> thread. The other obvious solution is a mutex on the matrix, but this could
> easily get costly as both threads are calling matrix.vmult(...) in an
> iterative solver. I thus have two questions:
>
> 1) Is initializing MPI with MPI_THREAD_MULTIPLE going to break the deal.ii
> internals for some reason and I should just not investigate this further?

This should work. deal.II uses MPI_THREAD_SERIALIZED internally.


> 2) I think the best solution, if possible, would be to get pointers to the
> internal data of my matrix which I can then associate with different MPI_Comm
> objects. Is this possible?

No. You should never try to use anything but the public interfaces of classes.
Everything is bound to break things in unpredictable ways sooner or later.
Probably sooner.


> Why am I doing this?
> This is a bit of a simplification, but imagine that I am solving a linear
> deferred correction problem. This means at each time step I solve A . x_1 =
> b_1 and A . x_2 = b_2. Let us assume that the matrix A does not have a
> well-known preconditioner which scales nicely with the number of processors.
> Then instead of using 2n processors on each linear system in series, we could
> instead use n processors on each linear system simultaneously and expect this
> to be faster. I hope this makes sense.

Yes, this makes sense. But you should not expect to be able to solve multiple
linear systems at the same time over the same communicator. Each step in a
linear solver (vector dot products, matrix-vector products, etc.) consists of
multiple MPI messages where process wait for data to be sent from other
processes. If you have multiple solves running on the same process, you will
receive messages in unpredictable orders that may or may not belong to the
current solve. Nothing good can come out of this.

But if the linear solve is the bottleneck, you can always build the matrix
multiple times (or create copies), with different (sub-)communicators and run
one solve on each communicator.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/


Kyle Schwiebert

unread,
May 24, 2024, 7:28:05 PM5/24/24
to deal.II User Group
Dr. Bangerth,

Thank you very much for the help. My problem size is not so large as to make copying the matrix impossible, but it is still undesirable and would require a significant rewrite of my codes--and I plan to eventually scale so this is not a long term solution.

I also am not sure MPI_THREAD_MULTIPLE works anyway. I attempted an alternative approach: Protecting all calls to objects with a shared communicator by a mutex, and I ran into an odd issue: The TrilinosWrappers::MPI::Vector::Vector() constructor seems to be causing segfaults--I do not see how this could be the users' fault since there are no input parameters. This happens even when I run with only one thread unless I switch back to MPI_THREAD_SERIALIZED. I was able to verify that the MPI implementation I have is able to provide MPI_THREAD_MULTIPLE. As I am somewhat new to MPI I think I'm at the end of what I can try.

Thank you again for you time in addressing my question.

Regards,
Kyle

Kyle Schwiebert

unread,
May 24, 2024, 9:17:34 PM5/24/24
to deal.II User Group
I do actually have one question here that may be relevant. Whenever I am checking things out in the gdb, it claims I have twice as many threads running as I asked for using MultithreadInfo::set_max_threads(). Is this possibly germane to my issue here and what is the cause of this? As is common my CPU has two logical cores per physical core, but the CPU utilization suggests that only one thread of each core is ever being used at any given time.

Wolfgang Bangerth

unread,
May 24, 2024, 11:15:37 PM5/24/24
to dea...@googlegroups.com

> I also am not sure MPI_THREAD_MULTIPLE works anyway. I attempted an
> alternative approach: Protecting all calls to objects with a shared
> communicator by a mutex, and I ran into an odd issue: The
> TrilinosWrappers::MPI::Vector::Vector() constructor seems to be causing
> segfaults--I do not see how this could be the users' fault since there are no
> input parameters. This happens even when I run with only one thread unless I
> switch back to MPI_THREAD_SERIALIZED. I was able to verify that the MPI
> implementation I have is able to provide MPI_THREAD_MULTIPLE. As I am somewhat
> new to MPI I think I'm at the end of what I can try.

I don't know about the segfault -- a backtrace would be useful.

But I will say that the approach with the mutex isn't going to work. The mutex
only makes sure that only one copy of your code is running at any given time;
it does not ensure that the order of two operations is going to be the same on
two machines. And even if you could ensure that, I think you still might have
to use a parallel mutex (of which we have one in namespace Utilities::MPI).

In any case, the right way to deal with multiple threads communicating on the
same communicator is to ensure that each message is matched on both sides via
separate 'tag' values in all MPI communications. The analogy you should
consider is that an MPI communicator is a postal service that only knows about
street addresses. If multiple people are sending letters from the same address
at the same time, to multiple people at another address, you need to include a
name and/or reference number with each letter or you can't expect that the
receiver will be able to match an incoming letter to a specific open issue
(e.g., an invoice). That's what the 'tag' is there for. The problem, of
course, is that you can't select the tags when you use packages like deal.II
or PETSc or Trilinos -- they have all hardcoded these tags in their calls to
MPI somewhere deep down. As a consequence, you really cannot hope to use
multiple threads reliably on the same communicator unless you handle all of
the communication yourself.

Wolfgang Bangerth

unread,
May 24, 2024, 11:23:53 PM5/24/24
to dea...@googlegroups.com
On 5/24/24 19:17, Kyle Schwiebert wrote:
> **
>
> I do actually have one question here that may be relevant. Whenever I am
> checking things out in the gdb, it claims I have twice as many threads running
> as I asked for using MultithreadInfo::set_max_threads(). Is this possibly
> germane to my issue here and what is the cause of this? As is common my CPU
> has two logical cores per physical core, but the CPU utilization suggests that
> only one thread of each core is ever being used at any given time.

I don't have a good answer for this. It is possible that you are linking
(directly or indirectly) with a library that is build with OpenMP which
creates its own set of worker threads, and then deal.II uses TBB which also
creates its own set of worker threads. In practice, you will likely only ever
see one or the other of these worker threads being active.

Kyle Schwiebert

unread,
Jul 22, 2024, 2:26:47 PM7/22/24
to deal.II User Group
I wanted to follow up on this since I was able to find a solution which works for me. Hopefully it will help someone :)

Thank you very much for the help.

1) As suggested by Dr. Bangerth above, MPI_THREAD_MULTIPLE can work fine with deal.ii if you are willing to handle the initialization of the libraries and MPI yourself.
2) Also as suggested above, a typical use of serial mutex, parallel mutex, or a combination thereof will not do the trick.
3) However, while not extensively tested, I have run into no issues with something like the below. The idea is that in a parallel mutex, we do not have control over which thread gets access to communicator which must be shared, only that only one thread on each process gets access. The key is to put the root process in charge of which work group/thread id on each process gets released. Put more simply: Utilities::MPI::CollectiveMutex lets one thread through. My custom mutex lets a specific thread through.

I was a little surprised to see that neither the below way of solving the problem nor moving the MPI_THREAD_MULTIPLE caused a performance hit and indeed represents a sizable improvement over MPI only, but I am not currently scaling to a large machine so this could change.

void my_fun() // Function which will be called by several threads on each process.
{
int rank;
int thread_id = 0; // Your method of getting a thread id/work group here. Iteration number, sub-step, or what have you.
MPI_Comm_rank(communicators[thread_id], &rank); // communicators[thread_id] must hold a duplicate of MPI_COMM_WORLD
std::unique_lock<std::mutex> lock;
if (rank == 0){
    std::unique_lock<std::mutex> lock_(my_mutex); // my_mutex is a std::mutex which must be accessible to all threads which might enter the current function on each MPI process.
    MPI_Barrier(communicators[thread_id]);
    lock.swap(lock_);
}
else{
    MPI_Barrier(communicators[thread_id]);
}
// Call to Trilinos or PETSc function (like a matrix mult) which is communicated over MPI_COMM_WORLD here.
// Now exit function or otherwise release lock.
}

4) The bug I was having was unrelated to any concurrency issues actually and was instead due to mistakenly renumbering DoFs component wise in parallel computations.
5) I have no idea either why twice as many threads as asked for are being spawned, but as suggested only half of them are ever taking up compute resources so the impact is probably minimal. I am not using any external (to deal.II) libraries and I'm strictly using the dealii::threads functions for launching threads. This happens even running the tutorial steps in serial.
6) One quick comment, in debug mode, deal.II throws an assert regarding mixing threads and MPI with PETSc. However, as I do not actually intend to do multithreaded reading/writing to/of the PETSc data structures and am using threads for extra parallelism on top of the FEM parallelism so this is a bit inconvenient. But I understand my use case is uncommon.

Thank you for all the help.

Regards,
Kyle



Kyle Schwiebert

unread,
Jul 22, 2024, 2:34:58 PM7/22/24
to deal.II User Group
My apologies for a second message. Right after posting, I realized my code snippet is potentially wrong due to a copy and paste error. You have to call MPI_Barrier before releasing the mutex or else rank 0 could release to mutex early, depending on what the underlying call library call is doing. Here is the correct snippet:

void my_fun() // Function which will be called by several threads on each process.
{
int rank;
int thread_id = 0; // Your method of getting a thread id/work group here. Iteration number, sub-step, or what have you.
MPI_Comm_rank(communicators[thread_id], &rank); // communicators[thread_id] must hold a duplicate of MPI_COMM_WORLD
std::unique_lock<std::mutex> lock;
if (rank == 0){
    std::unique_lock<std::mutex> lock_(my_mutex); // my_mutex is a std::mutex which must be accessible to all threads which might enter the current function on each MPI process.
    MPI_Barrier(communicators[thread_id]);
    lock.swap(lock_);
}
else{
    MPI_Barrier(communicators[thread_id]);
}
// Call to Trilinos or PETSc function (like a matrix mult) which is communicated over MPI_COMM_WORLD here.
MPI_Barrier(communicators[thread_id]);
// Now exit function or otherwise release lock.
}

Reply all
Reply to author
Forward
0 new messages