Problem with scaling with parallel::distributed::Triangulation and direct solver

18 views
Skip to first unread message

Nihar Bhardwaj Darbhamulla

unread,
Nov 5, 2025, 1:37:46 PM (12 hours ago) Nov 5
to deal.II User Group
Hello all,

I have a parallel implementation, wherein my domain is constituted of two subdomains, with an hp::FECollection used to setup the different finite elements on each subdomain. In order to load balance, I use parallel::CellWeights with the coefficients (1,1), which assigns equal weighting to each dof. Furthermore, I plan to use a block preconditioner with an Iterative solver while using direct solves for the inner solves. This is to evaluate the parallel scaling of my solution process. Therefore, I reorder my DoFs by the following

DoFRenumbering::Cuthill_McKee(dof_handler);

std::vector<unsigned int> block_component(dim+2,1);
block_component[0] = 0;
block_component[dim+1] = 2;
DoFRenumbering::component_wise(dof_handler, block_component);


When constructing the grid, I transfer material Id information across processes to update ghost cells, so that each owned and active cell knows if it is at the interface of the subdomain, and builds the coupling matrix accordingly. I have tested and verified this implementation with a manufactured solution and using a direct sparse solver and it performs as expected. I also make timing measurements of the setup. 

I would like to run this problem with iterative solvers and a setup of block preconditioners, for which I do a block-wise reordering, and use vectors and matrices from the PETScWrappers::MPI namespace. I use GMRES as the choice of iterative solver, and define a block preconditioner as follows:

template<class PreconditionerA, class PreconditionerS1, class PreconditionerS2>
    class BlockLowerTriangularPreconditioner : public Subscriptor
    {
        public:
            BlockLowerTriangularPreconditioner( const PreconditionerA  &M11,
                                                const PreconditionerS1 &M22,
                                                const PreconditionerS2 &M33,
                                                const LA::MPI::BlockSparseMatrix  &L,
                                                const std::vector<IndexSet> &owned_dofs)
                : M1(M11), M2(M22), M3(M33), L(&L),
                    owned_dofs(owned_dofs),
                    R21(owned_dofs[1], MPI_COMM_WORLD),
                    R32(owned_dofs[2], MPI_COMM_WORLD) {}
           
            void vmult( LA::MPI::BlockVector &dst,
                        const LA::MPI::BlockVector &src  ) const;
       
        private:
            const PreconditionerA &M1;
            const PreconditionerS1 &M2;
            const PreconditionerS2 &M3;
           
            const SmartPointer<const LA::MPI::BlockSparseMatrix> L;
            const std::vector<IndexSet> &owned_dofs;

            mutable LA::MPI::Vector R21, R32;
    };

    template<class PreconditionerA, class PreconditionerS1, class PreconditionerS2>
    void BlockLowerTriangularPreconditioner<PreconditionerA, PreconditionerS1, PreconditionerS2>::vmult(
        LA::MPI::BlockVector &dst,
        const LA::MPI::BlockVector &src) const
    {
        M1.vmult(dst.block(0), src.block(0));

        L->block(1,0).vmult(R21, dst.block(0));
        R21 *= -1.0;
        R21 += src.block(1);

        M2.vmult(dst.block(1), R21);
        // M2.vmult(dst.block(1), src.block(1));

        // L.block(2,0).vmult(R31, dst.block(0));
        L->block(2,1).vmult(R32, dst.block(1));

        R32 *= -1.0;
        R32 += src.block(2);

        M3.vmult(dst.block(2), R32);
        // M3.vmult(dst.block(2), src.block(2));
    }

where PreconditionerA, PreconditionerS1 and Preconditioner S2, are members of the class LinearSolver::DirectInverseMatrix<VectorType> defined as

template<class Matrix>
    class DirectInverseMatrix : public Subscriptor
    {
        public:
            DirectInverseMatrix(  const Matrix &K,
                                  const IndexSet &relevant_dofs )
                : K(&K), relevant_dofs(relevant_dofs),
                  u_distributed(relevant_dofs, MPI_COMM_WORLD) {}
           
            template<typename VectorType>
            void vmult(     VectorType &dst,
                            const VectorType &src) const;

        private:
            const SmartPointer<const Matrix> K;
            const IndexSet &relevant_dofs;

            mutable LA::MPI::Vector u_distributed;
    };

    template<class Matrix>
    template<typename VectorType>
    void DirectInverseMatrix<Matrix> :: vmult(
                VectorType &dst,
                const VectorType &src ) const
    {
        SolverControl   solver_control;
        PETScWrappers::SparseDirectMUMPS solver(solver_control, MPI_COMM_WORLD);
       
        // LA::MPI::Vector u_distributed(relevant_dofs, MPI_COMM_WORLD);

        try{
            solver.solve( *K, u_distributed, src );
        } catch ( std::exception &e ){
            Assert( false, ExcMessage(e.what()));
        }

        dst = u_distributed;
    }

Within my solve() function, I have the following calls

using invDirect = LinearSolver::DirectInverseMatrix< LA::MPI::SparseMatrix >;

        const invDirect inv_Ad( Mp.block(0, 0), owned_partitioning[0]);
        const invDirect inv_As( Mp.block(1, 1), owned_partitioning[1]);
        const invDirect inv_Mp( Mp.block(2, 2), owned_partitioning[2]);

        // const LinearSolver::BlockDiagonalPreconditioner<invDirect, invDirect, invDirect>
        //         prec_M(inv_Ad, inv_As, inv_Mp);

        const LinearSolver::BlockLowerTriangularPreconditioner<invDirect, invDirect, invDirect>
                    prec_M(inv_Ad, inv_As, inv_Mp, Mp, owned_partitioning);

        SolverControl solver_control(A.m(), 1e-8*R.l2_norm());

        SolverGMRES<LA::MPI::BlockVector>::AdditionalData gmres_data(20);
        SolverGMRES<LA::MPI::BlockVector> solver( solver_control, gmres_data);

        LA::MPI::BlockVector distributed_solution(owned_partitioning, mpi_communicator);

        constraints.set_zero(distributed_solution);

        solver.solve(A, distributed_solution, R, prec_M);

        pcout<<"Solved in "<< solver_control.last_step() <<" iterations" << std::endl;

        constraints.distribute(distributed_solution);

        U_local = distributed_solution;


In the parallel setup, I am testing with 214788 dofs, with 1, 2,4,8 and 16 processes, and the timings are attached below. As can be seen, the assembly scales perfectly well with doubling the processes, but it is the solve which ends up taking quite long. The direct solve is based on the following sub-sizes of dofs:
M1 - 66049
M2 - 132098
M3 - 16641

None of these sizes are prohibitively expensive for a Direct Solver in parallel. Given this premise, which directions should I potentially probe into.

Regards,
Nihar





np4.jpg
np2.jpg
np1.jpg

Nihar Bhardwaj Darbhamulla

unread,
Nov 5, 2025, 1:48:56 PM (11 hours ago) Nov 5
to deal.II User Group

Added here are timing data for execution on  for 8 and 16 processes.
np8.jpg
np16.jpg
Reply all
Reply to author
Forward
0 new messages