Mumps and superludist do not scale

88 views
Skip to first unread message

yy.wayne

unread,
Feb 23, 2023, 5:34:51 AM2/23/23
to deal.II User Group
I'm using mumps and superludist through Trilinos' Amesos wrapper directly.
The code is simple, just step4 with TrilinosWrappers:SparseMatrix and
LinearAlgebra::distributed:Vector. 
Here's the result from my virtual machine for a 98k DoFs problem.
Snipaste_2023-02-23_18-31-08.png

I've test a 300k+ DoFs problem with Amesos_mumps on server as well.
Multiple processes is not faster than one process.

Mumps_test.cc

Wolfgang Bangerth

unread,
Feb 23, 2023, 9:32:39 AM2/23/23
to dea...@googlegroups.com

yy.wayne:

> I'm using mumps and superludist through Trilinos' Amesos wrapper directly.
> The code is simple, just step4 with TrilinosWrappers:SparseMatrix and
> LinearAlgebra::distributed:Vector.
> Here's the result from my virtual machine for a 98k DoFs problem.
> Snipaste_2023-02-23_18-31-08.png
>
> I've test a 300k+ DoFs problem with Amesos_mumps on server as well.
> Multiple processes is not faster than one process.

It's not clear to me what you are doing. step-4 is not parallel, and unless
you extended it like we show in step-40, for example, all that happens when
you run the program with
mpirun -np 4 ./step-4
is that the same program is run four times on four different processes, but
all these four copies do is run the exact same instructions. This might also
explain why you see the output printed multiple times. If that is so, you
shouldn't be surprised that that takes exactly as long as running one instance.

If you want to test these kinds of solvers, start with step-40.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/


yy.wayne

unread,
Feb 23, 2023, 9:37:10 AM2/23/23
to deal.II User Group
I'm making a big mistake then. I thought step-4 and step-40's only difference are parallel
distributed triangulation and handing ghost dofs.
Thank you.

Best,
Wayne

Wolfgang Bangerth

unread,
Feb 23, 2023, 9:40:14 AM2/23/23
to dea...@googlegroups.com
On 2/23/23 07:37, 'yy.wayne' via deal.II User Group wrote:
> **
>
> I'm making a big mistake then. I thought step-4 and step-40's only difference
> are parallel
> distributed triangulation and handing ghost dofs.

I don't know what you're doing in your modification of step-4, so I really
can't tell. But step-4 itself has no notion that there are other processors.
Unless you modified it extensively, every process will build and store the
entire matrix, and I assume that you are also only calling MUMPS and SuperLUU
with that entire matrix.

But of course I don't know -- I don't have the code you are running.

Best
W>

yy.wayne

unread,
Feb 23, 2023, 9:42:25 AM2/23/23
to deal.II User Group
Basic I modify the SparseMatrix and Vector type to Trilinoswrappered ones.
So if step-4 doesn't support parallel then what I got is reasonable.

Wolfgang Bangerth

unread,
Feb 23, 2023, 9:44:57 AM2/23/23
to dea...@googlegroups.com
On 2/23/23 07:42, 'yy.wayne' via deal.II User Group wrote:
> So if step-4 doesn't support parallel then what I got is reasonable.

Correct.
W.

yy.wayne

unread,
Feb 24, 2023, 6:37:17 AM2/24/23
to deal.II User Group
Step-32 and step-50 are important for parallelization of TrilinosWrappered data type.
For someone has to write codes with Trilinos data type, here are necessary modification
from serial to parallel programs:
  1.  Change triangulation to parallel::distributed::triangulation rather than 
     GridTools::partition_triangulation()  or  parallel::shared::Triangulation.
  2.  SparsityPattern and solution/rhs Vector initialization specified for Trilinos, see Step-42
  3.  Choose locally owned cells or level cells during assembling.
Though they both allow parallel, step-42's DoFTools::make_sparsity_pattern has better performance
in matrix LU decomposition (30%-50% less time) than tep-50's.

However another problem pops out. When test a vector-valued problem, renumber DoFs 
component wise  make reinit of vectors fail. If DoFRenumbering::component_wise is called,
  1. LinearAlgebra::distributed::Vector<double> v1(dof_handler.locally_owned_dofs(), mpi_communicator); v1.reinit(dof_handler.locally_owned_dofs(), mpi_communicator);
    Success
  2. LinearAlgebra::distributed::Vector<double> v2;
    v2.reinit(dof_handler.locally_owned_dofs(), mpi_communicator);
    Fails.
Is that relates to trilinos date type?

yy.wayne

unread,
Feb 24, 2023, 6:47:08 AM2/24/23
to deal.II User Group
Sorry I forgot to present the error.
--------------------------------------------------------
An error occurred in line <164> of file </home/ubuntu/deal.II/dealii/source/base/partitioner.cc> in function
    void dealii::Utilities::MPI::Partitioner::set_owned_indices(const dealii::IndexSet&)
The violated condition was:
    locally_owned_indices.is_contiguous() == true
Additional information:
    The index set specified in locally_owned_indices is not contiguous.

Stacktrace:
-----------
#0  /home/ubuntu/deal.II/installed/lib/libdeal_II.g.so.9.4.0: dealii::Utilities::MPI::Partitioner::set_owned_indices(dealii::IndexSet const&)
#1  /home/ubuntu/deal.II/installed/lib/libdeal_II.g.so.9.4.0: dealii::Utilities::MPI::Partitioner::Partitioner(dealii::IndexSet const&, ompi_communicator_t* const&)
...
#9  /home/ubuntu/deal.II/installed/lib/libdeal_II.g.so.9.4.0: std::shared_ptr<dealii::Utilities::MPI::Partitioner> std::make_shared<dealii::Utilities::MPI::Partitioner, dealii::IndexSet const&, ompi_communicator_t* const&>(dealii::IndexSet const&, ompi_communicator_t* const&)
#10  /home/ubuntu/deal.II/installed/lib/libdeal_II.g.so.9.4.0: dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>::reinit(dealii::IndexSet const&, ompi_communicator_t* const&)
#11  /home/ubuntu/deal.II/installed/lib/libdeal_II.g.so.9.4.0: dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::

Wolfgang Bangerth

unread,
Feb 24, 2023, 6:54:45 AM2/24/23
to dea...@googlegroups.com
On 2/24/23 04:37, 'yy.wayne' via deal.II User Group wrote:
>
> However another problem pops out. When test a vector-valued problem, renumber
> DoFs
> component wise  make reinit of vectors fail. If DoFRenumbering::component_wise
> is called,
>
> 1. LinearAlgebra::distributed::Vector<double>
> v1(dof_handler.locally_owned_dofs(), mpi_communicator);
> v1.reinit(dof_handler.locally_owned_dofs(), mpi_communicator);
> Success
> 2. LinearAlgebra::distributed::Vector<double> v2;
> v2.reinit(dof_handler.locally_owned_dofs(), mpi_communicator);
> Fails.
>
> Is that relates to trilinos date type?

If you do not arrange the different components of a vector-valued finite
element into a block vector (i.e., if you use a single vector for all
components of the solution), then there is no need to call
DoFRenumbering::component_wise().

Best

yy.wayne

unread,
Feb 24, 2023, 7:22:43 AM2/24/23
to deal.II User Group
Well I renumbered the DoFs because this discussion.
I use trilinos because MatrixFree method relys on LinearAlbegra::distributed Vector or Block Vector.
When solve the coarsest level of MatrixFree multigrid problem directly, trilinos direct solver only
handles non-blocked vectors, so I do transfer between BlockVectors and Vectors. The ordering of 
DoFs is destructed when fe degree rises to 3 and higher, causing the transfer error.

Best,
Wayne

Wolfgang Bangerth

unread,
Feb 24, 2023, 7:27:57 AM2/24/23
to dea...@googlegroups.com
On 2/24/23 05:22, 'yy.wayne' via deal.II User Group wrote:
> I use trilinos because MatrixFree method relys on LinearAlbegra::distributed
> Vector or Block Vector.
> When solve the coarsest level of MatrixFree multigrid problem directly,
> trilinos direct solver only
> handles non-blocked vectors, so I do transfer between BlockVectors and
> Vectors. The ordering of
> DoFs is destructed when fe degree rises to 3 and higher, causing the transfer
> error.

The problem with renumbering by component is that the DoFs owned by each
process are no longer a contiguous range, and that just isn't allowed by the
vector class you are trying to use. It needs to be a contiguous index range.

If you need to copy from block vectors to non-block vectors, you might just
have to copy in a way that doesn't just map from src[i] to dst[i], but from
src[i] to dst[mapping[i]] where mapping takes care of the translation of
corresponding vector elements.

yy.wayne

unread,
Feb 24, 2023, 7:30:54 AM2/24/23
to deal.II User Group
Yes I think handing the transfer with mapping is more reasonable...
Thanks for your sugeestion.

Best,
Wayne

Reply all
Reply to author
Forward
0 new messages