Trilinos::Amesos_Superludist not scaling across multiple nodes

100 views
Skip to first unread message

Paras Kumar

unread,
Jul 14, 2022, 5:34:23 AM7/14/22
to deal.II User Group
Dear dealii Community,

I am working on solving the a nonlinear coupled problem involving a vector displacement field and a scalar phase-field variable. The code is MPI parallelized using p:d:t and TrilinosWrappers for linear algebra.

Usually I use CG+AMG for solving the SLEs when solving for each of the variables within a staggered scheme.  But for certain scenarios, the iterative linear solver fails and we switch to Amesos_Superludist solver. The code is run on 2 nodes (144 MPI processes in total) and as shown by the code performance monitor, the flop count of one of the nodes drops to (almost) zero and only one one node seems to be doing the computations once the solver switch from iterative to direct solver occurs. Please see attached flops and memory bandwidth plots. The blue and red lines here represent the two nodes. Similar observations were also made for  a larger problem involving 8 nodes.

These plots seem to  hint that Superlu-dist solver does not scale across multiple nodes. One possible reason I could think of is that I probably missed some option while installing dealii with trilinos and superlu-dist using spack. I also attach the spack spec which I installed on the cluster.  The gcc compiler and corresponding ope...@4.1.2 are available form the cluster.

Any ideas on solving this issue would be of great help.

Kind regards,
Paras Kumar







job107652-mem_bw.png
job107652-flops.png
dealii-spack.txt

Wolfgang Bangerth

unread,
Jul 14, 2022, 1:09:50 PM7/14/22
to dea...@googlegroups.com
Paras:
I'm not sure any of us have experience with Amesos:SuperLU, so I'm not sure
anyone will know right away what the problem may be.

But here are a couple of questions:
* What happens if you run the program with just two MPI jobs on one machine?
In that case, you can watch what the two programs are doing by having 'top'
run in a separate window.
* How do you distribute the matrix and right hand side? Are they both fully
distributed?
* Is the solution you get correct?
* If the answer to the last question is yes, then either Amesos or SuperLU is
apparently copying the data of the linear system from all other processes to
just one process that then solves the linear system. It might be useful to
take a debugger, running with just two MPI processes, to step into the Amesos
routines to see if you get to a place where that is happening, and then to
read the code in that place to see what flags need to be set to make sure the
solution really does happen in a distributed way.

That's about all I can offer.
Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Paras Kumar

unread,
Jul 15, 2022, 5:43:01 AM7/15/22
to dea...@googlegroups.com
Dear Wolfgang,

Thank you for the response.


Paras:
I'm not sure any of us have experience with Amesos:SuperLU, so I'm not sure
anyone will know right away what the problem may be.

I was wondering if, while writing the wrappers and testing them out, someone managed to figure out the requisite combination of installation time options.
 
But here are a couple of questions:
* What happens if you run the program with just two MPI jobs on one machine?
In that case, you can watch what the two programs are doing by having 'top'
run in a separate window.
I ran a job with a direct solver from the beginning on one node with 72 processes (see job107396-flops graph) and it is evident that all except a few cores do a similar amount of flops. I guess some of the processes are meant to "coordinate the work" and not do much actual computation.
 
* How do you distribute the matrix and right hand side? Are they both fully
distributed?
I use BlockSpasrseMatrix and BlockVector objects available through TrilinosWrappers.

 
* Is the solution you get correct?
I do not have an exact solution for comparison, but the simulation results, both visual as well as global quantities like force, energy etc., seem to indicate that the physics is captured correctly. Thus, I claim that the solution is correct.
 
* If the answer to the last question is yes, then either Amesos or SuperLU is
apparently copying the data of the linear system from all other processes to
just one process that then solves the linear system. It might be useful to
take a debugger, running with just two MPI processes, to step into the Amesos
routines to see if you get to a place where that is happening, and then to
read the code in that place to see what flags need to be set to make sure the
solution really does happen in a distributed way.

One would probably need to debug the code while running on at least two nodes . I do not
have much experience with debugging an MPI code. Will try to learn more about this.

Best regards,
Paras
job107396-flops.png

Wolfgang Bangerth

unread,
Jul 15, 2022, 3:23:12 PM7/15/22
to dea...@googlegroups.com

> I'm not sure any of us have experience with Amesos:SuperLU, so I'm not sure
> anyone will know right away what the problem may be.
>
> I was wondering if, while writing the wrappers and testing them out, someone
> managed to figure out the requisite combination of installation time options.

I don't recall who wrote the wrappers and when. You might have to do some
git-archeology to find out.


> * What happens if you run the program with just two MPI jobs on one machine?
> In that case, you can watch what the two programs are doing by having 'top'
> run in a separate window.
>
> I ran a job with a direct solver from the beginning on one node with 72
> processes (see job107396-flops graph) and it is evident that all except a few
> cores do a similar amount of flops. I guess some of the processes are meant to
> "coordinate the work" and not do much actual computation.

Too many processes. Try to get it down to two.


> * If the answer to the last question is yes, then either Amesos or SuperLU is
> apparently copying the data of the linear system from all other processes to
> just one process that then solves the linear system. It might be useful to
> take a debugger, running with just two MPI processes, to step into the Amesos
> routines to see if you get to a place where that is happening, and then to
> read the code in that place to see what flags need to be set to make sure the
> solution really does happen in a distributed way.
>
> One would probably need to debug the code while running on at least two nodes
> . I do not
> have much experience with debugging an MPI code. Will try to learn more about
> this.

There's a video lecture on that topic :-)
Reply all
Reply to author
Forward
0 new messages