Problem with scaling with parallel::distributed::Triangulation and direct solver

98 views
Skip to first unread message

Nihar Bhardwaj Darbhamulla

unread,
Nov 5, 2025, 1:37:46 PMNov 5
to deal.II User Group
Hello all,

I have a parallel implementation, wherein my domain is constituted of two subdomains, with an hp::FECollection used to setup the different finite elements on each subdomain. In order to load balance, I use parallel::CellWeights with the coefficients (1,1), which assigns equal weighting to each dof. Furthermore, I plan to use a block preconditioner with an Iterative solver while using direct solves for the inner solves. This is to evaluate the parallel scaling of my solution process. Therefore, I reorder my DoFs by the following

DoFRenumbering::Cuthill_McKee(dof_handler);

std::vector<unsigned int> block_component(dim+2,1);
block_component[0] = 0;
block_component[dim+1] = 2;
DoFRenumbering::component_wise(dof_handler, block_component);


When constructing the grid, I transfer material Id information across processes to update ghost cells, so that each owned and active cell knows if it is at the interface of the subdomain, and builds the coupling matrix accordingly. I have tested and verified this implementation with a manufactured solution and using a direct sparse solver and it performs as expected. I also make timing measurements of the setup. 

I would like to run this problem with iterative solvers and a setup of block preconditioners, for which I do a block-wise reordering, and use vectors and matrices from the PETScWrappers::MPI namespace. I use GMRES as the choice of iterative solver, and define a block preconditioner as follows:

template<class PreconditionerA, class PreconditionerS1, class PreconditionerS2>
    class BlockLowerTriangularPreconditioner : public Subscriptor
    {
        public:
            BlockLowerTriangularPreconditioner( const PreconditionerA  &M11,
                                                const PreconditionerS1 &M22,
                                                const PreconditionerS2 &M33,
                                                const LA::MPI::BlockSparseMatrix  &L,
                                                const std::vector<IndexSet> &owned_dofs)
                : M1(M11), M2(M22), M3(M33), L(&L),
                    owned_dofs(owned_dofs),
                    R21(owned_dofs[1], MPI_COMM_WORLD),
                    R32(owned_dofs[2], MPI_COMM_WORLD) {}
           
            void vmult( LA::MPI::BlockVector &dst,
                        const LA::MPI::BlockVector &src  ) const;
       
        private:
            const PreconditionerA &M1;
            const PreconditionerS1 &M2;
            const PreconditionerS2 &M3;
           
            const SmartPointer<const LA::MPI::BlockSparseMatrix> L;
            const std::vector<IndexSet> &owned_dofs;

            mutable LA::MPI::Vector R21, R32;
    };

    template<class PreconditionerA, class PreconditionerS1, class PreconditionerS2>
    void BlockLowerTriangularPreconditioner<PreconditionerA, PreconditionerS1, PreconditionerS2>::vmult(
        LA::MPI::BlockVector &dst,
        const LA::MPI::BlockVector &src) const
    {
        M1.vmult(dst.block(0), src.block(0));

        L->block(1,0).vmult(R21, dst.block(0));
        R21 *= -1.0;
        R21 += src.block(1);

        M2.vmult(dst.block(1), R21);
        // M2.vmult(dst.block(1), src.block(1));

        // L.block(2,0).vmult(R31, dst.block(0));
        L->block(2,1).vmult(R32, dst.block(1));

        R32 *= -1.0;
        R32 += src.block(2);

        M3.vmult(dst.block(2), R32);
        // M3.vmult(dst.block(2), src.block(2));
    }

where PreconditionerA, PreconditionerS1 and Preconditioner S2, are members of the class LinearSolver::DirectInverseMatrix<VectorType> defined as

template<class Matrix>
    class DirectInverseMatrix : public Subscriptor
    {
        public:
            DirectInverseMatrix(  const Matrix &K,
                                  const IndexSet &relevant_dofs )
                : K(&K), relevant_dofs(relevant_dofs),
                  u_distributed(relevant_dofs, MPI_COMM_WORLD) {}
           
            template<typename VectorType>
            void vmult(     VectorType &dst,
                            const VectorType &src) const;

        private:
            const SmartPointer<const Matrix> K;
            const IndexSet &relevant_dofs;

            mutable LA::MPI::Vector u_distributed;
    };

    template<class Matrix>
    template<typename VectorType>
    void DirectInverseMatrix<Matrix> :: vmult(
                VectorType &dst,
                const VectorType &src ) const
    {
        SolverControl   solver_control;
        PETScWrappers::SparseDirectMUMPS solver(solver_control, MPI_COMM_WORLD);
       
        // LA::MPI::Vector u_distributed(relevant_dofs, MPI_COMM_WORLD);

        try{
            solver.solve( *K, u_distributed, src );
        } catch ( std::exception &e ){
            Assert( false, ExcMessage(e.what()));
        }

        dst = u_distributed;
    }

Within my solve() function, I have the following calls

using invDirect = LinearSolver::DirectInverseMatrix< LA::MPI::SparseMatrix >;

        const invDirect inv_Ad( Mp.block(0, 0), owned_partitioning[0]);
        const invDirect inv_As( Mp.block(1, 1), owned_partitioning[1]);
        const invDirect inv_Mp( Mp.block(2, 2), owned_partitioning[2]);

        // const LinearSolver::BlockDiagonalPreconditioner<invDirect, invDirect, invDirect>
        //         prec_M(inv_Ad, inv_As, inv_Mp);

        const LinearSolver::BlockLowerTriangularPreconditioner<invDirect, invDirect, invDirect>
                    prec_M(inv_Ad, inv_As, inv_Mp, Mp, owned_partitioning);

        SolverControl solver_control(A.m(), 1e-8*R.l2_norm());

        SolverGMRES<LA::MPI::BlockVector>::AdditionalData gmres_data(20);
        SolverGMRES<LA::MPI::BlockVector> solver( solver_control, gmres_data);

        LA::MPI::BlockVector distributed_solution(owned_partitioning, mpi_communicator);

        constraints.set_zero(distributed_solution);

        solver.solve(A, distributed_solution, R, prec_M);

        pcout<<"Solved in "<< solver_control.last_step() <<" iterations" << std::endl;

        constraints.distribute(distributed_solution);

        U_local = distributed_solution;


In the parallel setup, I am testing with 214788 dofs, with 1, 2,4,8 and 16 processes, and the timings are attached below. As can be seen, the assembly scales perfectly well with doubling the processes, but it is the solve which ends up taking quite long. The direct solve is based on the following sub-sizes of dofs:
M1 - 66049
M2 - 132098
M3 - 16641

None of these sizes are prohibitively expensive for a Direct Solver in parallel. Given this premise, which directions should I potentially probe into.

Regards,
Nihar





np4.jpg
np2.jpg
np1.jpg

Nihar Bhardwaj Darbhamulla

unread,
Nov 5, 2025, 1:48:56 PMNov 5
to deal.II User Group

Added here are timing data for execution on  for 8 and 16 processes.
np8.jpg
np16.jpg

Wolfgang Bangerth

unread,
Nov 8, 2025, 7:47:05 PMNov 8
to dea...@googlegroups.com
On 11/5/25 11:37, Nihar Bhardwaj Darbhamulla wrote:
>
> In the parallel setup, I am testing with 214788 dofs, with 1, 2,4,8 and 16
> processes, and the timings are attached below. As can be seen, the assembly
> scales perfectly well with doubling the processes, but it is the solve which
> ends up taking quite long. The direct solve is based on the following sub-
> sizes of dofs:
> M1 - 66049
> M2 - 132098
> M3 - 16641
>
> None of these sizes are prohibitively expensive for a Direct Solver in
> parallel. Given this premise, which directions should I potentially probe into.

Nihar:
without having looked at your code in detail:
* These are very small problems. I would not expect that a parallel solver can
provide any kind of speedup on a problem this small -- you will be spending
all of your time in communication and none in actual computations. You'll want
to test with much larger problems -- a good rule of thumb is that parallel
solvers only scale if you have perhaps 50k or 100k unknowns per MPI process;
you should apply that to the smallest of the blocks you have, M3. So you'll
have to make it at least ten times larger to even be relevant to 2 processes.
* I believe that in your code, you compute the sparse direct LU decomposition
every time the vmult() function is called for the preconditioner. That's
inefficient. You will want to compute the decomposition only once, in the
constructor of the DirectInverseMatrix, and then only *apply* the
decomposition in each call to vmult.
* The results you show indicate that your method leads to a roughly constant
number of GMRES iterations. That's a good start. In order to find out which
part of your implementation is leading to the growth in run time, put timer
sections into the various parts of your preconditioner so that you get
finer-grained information about your algorithm's individual components.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Nihar Bhardwaj Darbhamulla

unread,
Nov 8, 2025, 9:48:47 PMNov 8
to deal.II User Group
Hello Prof. Wolfgang,

Thank you for the response. 

" These are very small problems. I would not expect that a parallel solver can
provide any kind of speedup on a problem this small -- you will be spending
all of your time in communication and none in actual computations. You'll want
to test with much larger problems -- a good rule of thumb is that parallel
solvers only scale if you have perhaps 50k or 100k unknowns per MPI process;
you should apply that to the smallest of the blocks you have, M3. So you'll
have to make it at least ten times larger to even be relevant to 2 processes. "

My idea was to observe timing behaviour similar to step-55, as the preconditioner template I have followed most closely emulates the behaviour of the implementation in step-55. The number of degrees of freedom are not significantly different from those observed at cycles 4 and 5. The output I observe from cycles 4 and 5 when i run step-55 using 2 processes with the same version of deal.II, I get the performance as shown below. This indicated to me that for 2 mpi processes and the number of dofs, the solution time would be quite low. Once I had reasonable solution times, I was planning to scale the code to a larger number of degrees of freedom. 
step55_np2_45.jpg

"I believe that in your code, you compute the sparse direct LU decomposition
every time the vmult() function is called for the preconditioner. That's
inefficient. You will want to compute the decomposition only once, in the
constructor of the DirectInverseMatrix, and then only *apply* the
decomposition in each call to vmult. "

Prof. Bangerth, would there be a way to do this when using SparseDirectMUMPS? From reading the documentation, I only see a solve() function. The alternative would be to use sparseILU. Do you recommend using sparseILU instead?

Since there is clearly an inefficiency when using DirectInverseMatrix objects as my preconditioner, I switched to using InverseMatrix as my inner solver, wherein I have CG with AMG preconditioner as shown in the code below:

using invCG = LinearSolver::InverseMatrix<  LA::MPI::SparseMatrix,
                                                    LA::MPI::PreconditionAMG>;

                                                   
        LA::MPI::PreconditionAMG precAd;
        {
            LA::MPI::PreconditionAMG::AdditionalData data;
            #ifdef USE_PETSC_LA
                data.symmetric_operator = true;
                data.max_iter = 4;
            #endif
            precAd.initialize(Mp.block(0,0), data);
        }

        LA::MPI::PreconditionAMG precAs;
        {
            LA::MPI::PreconditionAMG::AdditionalData data;
            #ifdef USE_PETSC_LA
                data.symmetric_operator = true;
                data.max_iter = 4;
            #endif
            precAs.initialize(Mp.block(1,1), data);
        }

        LA::MPI::PreconditionAMG precMp;
        {
            LA::MPI::PreconditionAMG::AdditionalData data;
            #ifdef USE_PETSC_LA
                data.symmetric_operator = true;
            #endif
            precMp.initialize(Mp.block(2,2), data);
        }

        const invCG inv_Ad( Mp.block(0,0), precAd );
        const invCG inv_As( Mp.block(1,1), precAs );
        const invCG inv_Mp( Mp.block(2,2), precMp );

        const LinearSolver::BlockDiagonalPreconditioner<invCG, invCG, invCG>
            prec_M( inv_Ad, inv_As, inv_Mp );

I don't seem to observe any form of improvement in performance. From my observations the second CG solve with the (1,1) block takes around 70 iterations to converge, which adds to the bulk of the computation time. I would most likely have to add some performance improvements here for precAs, which might bring down the iteration counts and speed up things. Do you think this would be the right way to approach this problem?

Thank you once again for your response.

Regards,
Nihar

Wolfgang Bangerth

unread,
Nov 11, 2025, 1:58:33 PMNov 11
to dea...@googlegroups.com

> My idea was to observe timing behaviour similar to step-55, as the
> preconditioner template I have followed most closely emulates the
> behaviour of the implementation in step-55. The number of degrees of
> freedom are not significantly different from those observed at cycles 4
> and 5. The output I observe from cycles 4 and 5 when i run step-55 using
> 2 processes with the same version of deal.II, I get the performance as
> shown below. This indicated to me that for 2 mpi processes and the
> number of dofs, the solution time would be quite low. Once I had
> reasonable solution times, I was planning to scale the code to a larger
> number of degrees of freedom.

Ah, I see -- so the question you're really asking is why it takes 493
seconds to solve a problem with 215,000 unknowns. That's likely because
you do 19 outer iterations, and in each you call a direct solver to
decompose the same matrix 19 times.


> Prof. Bangerth, would there be a way to do this when using
> SparseDirectMUMPS? From reading the documentation, I only see a solve()
> function. The alternative would be to use sparseILU. Do you recommend
> using sparseILU instead?

I don't recall the exact interface of SparseDirectMUMPS from past
releases. SparseDirectUMFPACK allows you to compute a decomposition only
once, and then apply it repeatedly in vmult(). The interface of
SparseDirectMUMPS that's in the current developer version also allows
you to do that. If you can switch to the current developer version (or
check whether 9.7 can do that as well), you may want to try that.

SparseILU works sometimes, but it typically does not scale well to large
problems. (Sparse direct solvers often do not either, but at least they
don't require endless fiddling with settings.)


> Since there is clearly an inefficiency when using DirectInverseMatrix
> objects as my preconditioner, I switched to using InverseMatrix as my
> inner solver, wherein I have CG with AMG preconditioner as shown in the
> code below:
> [...]
> I don't seem to observe any form of improvement in performance. From my
> observations the second CG solve with the (1,1) block takes around 70
> iterations to converge, which adds to the bulk of the computation time.
> I would most likely have to add some performance improvements here for
> precAs, which might bring down the iteration counts and speed up things.
> Do you think this would be the right way to approach this problem?

I can't see where you use the AMG preconditioners, but the same applies:
You should only set them up once and then re-use many times. That is,
the preconditioners need to live *outside* the place where you solve the
inner (1,1) block.

Perhaps as a general rule, people spend whole PhD theses to develop good
parallel solvers and preconditioners. In industry, consultants are paid
thousands or tens of thousands of dollars to figure out good solvers and
preconditioners. You should expect that figuring this out is a long
learning process that involves developing the skills to set up block
preconditioners in the right place, and to find ways to timing the right
places. This is not going to be an easy process; there is also not going
to be a magic bullet the good people on this mailing list have that will
magically make it work for you.

Best
W.

Nihar Bhardwaj Darbhamulla

unread,
Nov 14, 2025, 1:20:15 PM (12 days ago) Nov 14
to deal.II User Group
Hello Prof. Wolfgang,

"I don't recall the exact interface of SparseDirectMUMPS from past
releases. SparseDirectUMFPACK allows you to compute a decomposition only
once, and then apply it repeatedly in vmult(). The interface of
SparseDirectMUMPS that's in the current developer version also allows
you to do that. If you can switch to the current developer version (or
check whether 9.7 can do that as well), you may want to try that. "

I used SparseDirectMUMPS as you suggested, i.e. initialize it in the constructor, and use it repeatedly in the solve. I believe, it applies the factored form each time vmult is called. 

" I can't see where you use the AMG preconditioners, but the same applies:
You should only set them up once and then re-use many times. That is,
the preconditioners need to live *outside* the place where you solve the
inner (1,1) block. "

I have added the code snippet initializing and running the CG+AMG preconditioner below. As it happens, the inner (1,1) block arises out of the symmetric discretization of a Stokes system (as in Step-22), and using BoomerAMG as is does not allow for the recognition of multiple DoFs associated with each node, while this is an option available in using BoomerAMG through PETSc ( -pc_hypre_boomeramg_nodal_coarsen ), but is not exposed through the dealii interface. The solutions I can think of are:
  1. adding the option to the PETScWrappers source code and testing it. I'm not sure if it has been tried before with the PETSc Wrappers, and whether there is a certain re-ordering that is to be kept in mind to execute this successfully.
  2. Switch to using Trilinos wrappers. However, my research group has more background on using PETSc, and the switch would come at a loss of support available from a linear algebra library POV.
  3. Modifying the formulation to go from a symmetric gradient weak-form to a vector laplacian weak-form. This comes with modeling justifications and would be problem specific. 
For now, I would prefer option 1, but would like to know if it has been tried before, and if any caveats are associated with using it.

The code for using CG+AMG as a preconditioner
{
       
         using invCG = LinearSolver::InverseMatrix<  LA::MPI::SparseMatrix,
                                                     LA::MPI::PreconditionAMG>;

                                                   
         LA::MPI::PreconditionAMG precAd;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
                 data.max_iter = 4;
                 data.output_details = true;
             #endif
             precAd.initialize(A.block(0,0), data);
         }

         LA::MPI::PreconditionAMG precAs;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
                 data.max_iter = 4;
                 data.output_details=true;
             #endif
             precAs.initialize(A.block(1,1), data);
         }

         LA::MPI::PreconditionAMG precMp;
         {
             LA::MPI::PreconditionAMG::AdditionalData data;
             #ifdef USE_PETSC_LA
                 data.symmetric_operator = true;
             #endif
             precMp.initialize(Mp.block(2,2), data);
         }

         const invCG inv_Ad( A.block(0,0), precAd );
         const invCG inv_As( A.block(1,1), precAs );
         const invCG inv_Mp( Mp.block(2,2), precMp );

         const LinearSolver::BlockDiagonalPreconditioner<invCG, invCG, invCG>
             prec_M( inv_Ad, inv_As, inv_Mp );

with the inverse matrix class defined as
template<class Matrix, class Preconditioner>
    class InverseMatrix : public Subscriptor
    {
        public:
            InverseMatrix(  const Matrix &K,
                            const Preconditioner &M )
                : K(&K), M(M),
                  pcout( std::cout, (Utilities::MPI::this_mpi_process(MPI_COMM_WORLD) == 0)) {}
           
            template<typename VectorType>
            void vmult(     VectorType &dst,
                            const VectorType &src) const;

        private:
            const SmartPointer<const Matrix> K;
            const Preconditioner             &M;
            ConditionalOStream pcout;
    };

    template<class Matrix, class Preconditioner>
    template<typename VectorType>
    void InverseMatrix<Matrix, Preconditioner> :: vmult(
                VectorType &dst,
                const VectorType &src ) const
    {
        SolverControl   solver_control(src.size(), 1e-8 * src.l2_norm());
        solver_control.log_history(true);
        SolverCG<VectorType>    cg(solver_control);
        dst = 0;

        try{
            cg.solve(*K, dst, src, M);
            pcout<<"CG solved in "<<solver_control.last_step() <<" iterations" << std::endl;
        } catch (std::exception &e) {
            Assert(false, ExcMessage(e.what()));
        }
    }

Thank you for your help once again.

Regards,
Nihar

Wolfgang Bangerth

unread,
Nov 14, 2025, 6:23:42 PM (12 days ago) Nov 14
to dea...@googlegroups.com


On 11/14/25 11:20, Nihar Bhardwaj Darbhamulla wrote:
> I have added the code snippet initializing and running the CG+AMG
> preconditioner below. As it happens, the inner (1,1) block arises out of
> the symmetric discretization of a Stokes system (as in Step-22), and
> using BoomerAMG as is does not allow for the recognition of multiple
> DoFs associated with each node, while this is an option available in
> using BoomerAMG through PETSc ( /*-pc_hypre_boomeramg_nodal_coarsen*/ ),
> but is not exposed through the dealii interface. The solutions I can
> think of are:
>
> 1. adding the option to the PETScWrappers source code and testing it.
> I'm not sure if it has been tried before with the PETSc Wrappers,
> and whether there is a certain re-ordering that is to be kept in
> mind to execute this successfully.
> 2. Switch to using Trilinos wrappers. However, my research group has
> more background on using PETSc, and the switch would come at a loss
> of support available from a linear algebra library POV.
> 3. Modifying the formulation to go from a symmetric gradient weak-form
> to a vector laplacian weak-form. This comes with modeling
> justifications and would be problem specific.
>
> For now, I would prefer option 1, but would like to know if it has been
> tried before, and if any caveats are associated with using it.

I don't know whether it has been tried before, but in practice all you'd
have to do is add the necessary information to the AdditionalData
object, and in the place where we set up the AMG preconditioner in
deal.II, i.e., the place where we currently already use the
AdditionalData information, extend the code to now also support the
nodal coarsening. You'll want to look up how the existing
PreconditionAMG class is implemented.

Of course, we would be excited to merge such a patch into deal.II!

Best
W.

Nihar Bhardwaj Darbhamulla

unread,
Nov 18, 2025, 11:50:19 AM (8 days ago) Nov 18
to deal.II User Group
Thank you Prof. Wolfgang.

I had another question with regard to the code performance. I am using an hp::FECollection as in step-46 to have 2 different elements in different regions of the mesh. Furthermore, I have setup my matrices as in step-55, with the following snippet

{
        A.clear();
        // A0.clear();

        // Initialize the block dynamic sparsity pattern
        BlockDynamicSparsityPattern dsp( relevant_partitioning );
        // DynamicSparsityPattern dsp0( locally_relevant_dofs );

        // Instantiate the cell coupling table. The coupling across the
        // faces go through the dofs corresponding to the cells.
        Table<2, DoFTools::Coupling> cell_coupling(dim+2, dim+2);

        for(unsigned int c = 0; c < dim+2; c++)
            for(unsigned int d = 0; d < dim+2; d++)
                if( c == 0 && d == dim+1 ) // Coupling between pD and p
                    cell_coupling[c][d] = DoFTools::none;
                else if( c == dim+1 && d == 0 ) // Coupling between pD and p
                    cell_coupling[c][d] = DoFTools::none;
                else if( c == dim+1 && d == dim+1 ) // Invoking the saddle point constraint
                    cell_coupling[c][d] = DoFTools::none;
                else
                    cell_coupling[c][d] = DoFTools::always; // All other variables are coupled

        Table<2, DoFTools::Coupling> face_coupling(dim+2, dim+2);

        for(unsigned int c = 0; c < dim+2; c++)
            for(unsigned int d = 0; d < dim+2; d++){
                face_coupling[c][d] = DoFTools::none;
                if(   (c == 0 && (d != 0 && d != dim+1)) ||
                      (d == 0 && (c != 0 && c != dim+1)) ||
                      ((c != 0 && c!=dim+1) && (d != 0 && d!=dim+1)))
                    face_coupling[c][d] = DoFTools::always;
            }


        // Build the sparsity pattern from the DoFHandler
        DoFTools::make_flux_sparsity_pattern(   dof_handler,
                                                dsp,
                                                cell_coupling,
                                                face_coupling);
     
        constraints.condense(dsp);
        // constraints.condense(dsp0);

        // distribute the sparsity patttern across all partitions
        SparsityTools::distribute_sparsity_pattern(
                        dsp,
                        dof_handler.locally_owned_dofs(),
                        mpi_communicator,
                        locally_relevant_dofs );       
       
        // Re-initialize the system matrix
        A.reinit( owned_partitioning, dsp, mpi_communicator);
        // A0.reinit( locally_owned_dofs, dsp0, mpi_communicator);
    }

    {
        Mp.clear();

        // cell_couplingM corresponds to the coupling for making the
        // preconditioner coupling sparsity pattern
        Table<2, DoFTools::Coupling> cell_couplingM(dim+2, dim+2);
        for(unsigned int c = 0; c < dim+2; c++)
            for( unsigned int d = 0; d < dim+2; d++){
                if( c == dim+1 && d == dim+1 )
                    cell_couplingM[c][d] = DoFTools::always;
                else
                    cell_couplingM[c][d] = DoFTools::none;
            }

        // Initialize the block dynamic sparsity pattern
        BlockDynamicSparsityPattern dsp( relevant_partitioning );
       
        // Build the sparsity pattern from the DoFHandler
        DoFTools::make_sparsity_pattern(    dof_handler,
                                            cell_couplingM,
                                            dsp,
                                            constraints,
                                            false);
        constraints.condense(dsp);

        // distribute the sparsity patttern across all partitions
        SparsityTools::distribute_sparsity_pattern(
                        dsp,
                        dof_handler.locally_owned_dofs(),
                        mpi_communicator,
                        locally_relevant_dofs );
       
        // Re-initialize the system matrix
        Mp.reinit( owned_partitioning, dsp, mpi_communicator);
    }
   
    U_local.reinit( owned_partitioning,
                    relevant_partitioning,
                    mpi_communicator);
   
    R.reinit(   owned_partitioning,
                mpi_communicator);

The issue appears to be very long times taken for setting up the sparsity patterns and initializing the matrices. These are the timing measurements I have made with 1 and 2 processes with 214788 and 855556 degrees of freedom. I'm not particularly sure as to what maybe causing this bottleneck - my hunch has to do with load-balancing the individual blocks. 

Thank you once again.

Regards,
Nihar
np2_855556.jpg
np2_214788.jpg
np1_214788.jpg

Wolfgang Bangerth

unread,
Nov 18, 2025, 1:19:51 PM (8 days ago) Nov 18
to dea...@googlegroups.com

On 11/18/25 09:50, Nihar Bhardwaj Darbhamulla wrote:
>
> The issue appears to be very long times taken for setting up the
> sparsity patterns and initializing the matrices. These are the timing
> measurements I have made with 1 and 2 processes with 214788 and 855556
> degrees of freedom. I'm not particularly sure as to what maybe causing
> this bottleneck - my hunch has to do with load-balancing the individual
> blocks.

I don't know either -- but put timers around individual lines or blocks
of lines to see which of the many functions you call is the expensive one!

Best
W.
Reply all
Reply to author
Forward
0 new messages