Temporary distributed vectors

63 views
Skip to first unread message

Jie Cheng

unread,
Dec 13, 2017, 2:42:13 PM12/13/17
to deal.II User Group
Dear developers and users

I am trying to parallelize my incompressible navierstokes solver based on step-57. I am not sure how to initialize the various temporary distributed vectors in preconditioners. For example, I have a class called ApproximateMassSchur which equals B(diag(M))^{-1}B^T.
Here is the serial code which works fine:

  class ApproximateMassSchur : public Subscriptor
  {
  public:
    ApproximateMassSchur(const BlockSparseMatrix<double> &M);
    void vmult(Vector<double> &dst, const Vector<double> &src) const;

  private:
    const SmartPointer<const BlockSparseMatrix<double>> mass_matrix;
    mutable Vector<double> tmp1, tmp2;
  };

  ApproximateMassSchur::ApproximateMassSchur(const BlockSparseMatrix<double> &M)
    : mass_matrix(&M), tmp1(M.block(0, 0).m()), tmp2(M.block(0, 0).m())
  {
  }

  void ApproximateMassSchur::vmult(Vector<double> &dst,
                                   const Vector<double> &src) const
  {
    mass_matrix->block(0, 1).vmult(tmp1, src);
    mass_matrix->block(0, 0).precondition_Jacobi(tmp2, tmp1);
    mass_matrix->block(1, 0).vmult(dst, tmp2);
  }

  mass_matrix is a block matrix with subblocks [Mu, B^T; B, Mp], and you can see in the vmult I am computing dst = B*(diag(Mu))^{-1}*B^T*src.

  Here is how I attempted to adapt it with PETSc, it is just a incomplete sketch:

  class ApproximateMassSchur : public Subscriptor
  {
  public:
    ApproximateMassSchur(const PETScWrappers::MPI::BlockSparseMatrix &M);
    void vmult(PETScWrappers::MPI::Vector &dst, const PETScWrappers::MPI::Vector &src) const;

  private:
    const SmartPointer<const PETScWrappers::MPI::BlockSparseMatrix> mass_matrix;
    const MPI_Comm& comm;
    mutable PETScWrappers::MPI::Vector tmp1, tmp2;
  };

  ApproximateMassSchur::ApproximateMassSchur(const PETScWrappers::MPI::BlockSparseMatrix &M)
    : mass_matrix(&M), comm(mass_matrix->get_mpi_communicator()),
      tmp1(owned_partitioning[0], comm), tmp2(owned_partitioning[0], comm)
  {
  }

  void ApproximateMassSchur::vmult(PETScWrappers::MPI::Vector &dst,
                                   const PETScWrappers::MPI::Vector &src) const
  {
    mass_matrix->block(0, 1).vmult(tmp1, src);
    mass_matrix->block(0, 0).precondition_Jacobi(tmp2, tmp1);
    mass_matrix->block(1, 0).vmult(dst, tmp2);
  }

  The temporary vectors tmp1 and tmp2 should have the size as mass_matrix->block(0, 0).m(), which is the number of velocity dofs, and should be distributed. I would like to know:
  1. What is the best practice here to initialize tmp1 and tmp2? One way I can think of is constructing a std::vector<IndexSet> owned_partitioning as in step-55, and pass it into ApproximateMassSchur.
      (I'm assuming owned_partitioning[0] returns the locally owned velocity dofs) But I don't like this idea because I think the distribution information should somehow be obtained from mass_matrix.
  2. PETScWrappers::SparseMatrix does not have a precondition_Jacobi member function. So should I construct a temporary matrix and invert the diagonal terms manually?

  Thank you very much!

Jie


Wolfgang Bangerth

unread,
Dec 13, 2017, 11:43:57 PM12/13/17
to dea...@googlegroups.com, Jie Cheng
On 12/13/2017 12:42 PM, Jie Cheng wrote:
>
> The temporary vectors tmp1 and tmp2 should have the size as
> mass_matrix->block(0, 0).m(), which is the number of velocity dofs, and should
> be distributed. I would like to know:
> 1. What is the best practice here to initialize tmp1 and tmp2? One way I
> can think of is constructing a *std::vector<IndexSet> owned_partitioning* as
> in step-55, and pass it into ApproximateMassSchur.
> (I'm assuming owned_partitioning[0] returns the locally owned velocity
> dofs) But I don't like this idea because I think the distribution information
> should somehow be obtained from mass_matrix.

We typically just initialize vectors using another vector as a "template". For
example, all vector classes have a function so that you can say
Vector vec (original, false);
which will create 'vec' to have the same kind of element structure as
'original' but without copying the elements (that's what the 'false' is for).
Likewise, you can typically say
vec.reinit (original);

This way, you won't have to deal with index sets -- the new vector will simply
look like the other one, regardless of how it was initialized.

In your case, these temporary vectors should all only have locally owned
elements, no ghosts. That's exactly what you get from the solution and rhs
vectors passed to linear solvers. So, in your case, you will have that
'original' equals 'solution.block(0)' or '.block(1)', or if you prefer,
'system_rhs.block(0)' or '.block(1)'.


> 2. PETScWrappers::SparseMatrix does not have a precondition_Jacobi member
> function. So should I construct a temporary matrix and invert the diagonal
> terms manually?

Jacobi is a terrible preconditioner -- don't use it. Or do you want to use it
for your (diag(M))^{-1}?

You could use PETScWrappers::PreconditionJacobi for what you want to do. Its
base class has a vmult() function that applies a single Jacobi step.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Jie Cheng

unread,
Dec 14, 2017, 12:11:57 AM12/14/17
to deal.II User Group
Dear Wolfgang

Thank you a lot for your help.

Unfortunately ApproximateMassSchur does not know the system_rhs or solution because of the way I wrote it: the src and dst vectors are of size n_dof_p, however I need n_dof_u. But your idea is still valid, I can find pass in a "template" vector.

In your case, these temporary vectors should all only have locally owned
elements, no ghosts. That's exactly what you get from the solution and rhs
vectors passed to linear solvers. So, in your case, you will have that
'original' equals 'solution.block(0)' or '.block(1)', or if you prefer,
'system_rhs.block(0)' or '.block(1)'.

I want to use it for diag(M)^{-1}. What about PETScWrappers::PreconditionBlockJacobi? I use them in CG solvers,
is it bad practice too?

Jacobi is a terrible preconditioner -- don't use it. Or do you want to use it
for your (diag(M))^{-1}?


Got it! 
You could use PETScWrappers::PreconditionJacobi for what you want to do. Its
base class has a vmult() function that applies a single Jacobi step.

Jie 

Jie Cheng

unread,
Dec 15, 2017, 1:50:08 PM12/15/17
to deal.II User Group
Dear Wolfgang

Another issue related to this question is that, ApproximateMassSchur has a vmult member function defined, so it can be used in CG solver to get the product of its inverse with some vector, just like step-20 does.
However unlike the dealii SolverCG class, PETScWrappers::SolverCG only takes MatrixBase type. In this case, is deriving my ApproximateMassSchur from PETScWrappers::MatrixBase with an overridden vmult function a correct solution?
Or I simply should compute and store B*diag(M)^{-1}*B^T as a PETScWrappers::MPI::SparseMatrix?

Thanks
Jie

Wolfgang Bangerth

unread,
Dec 16, 2017, 9:52:20 AM12/16/17
to dea...@googlegroups.com, Jie Cheng

Jie,

> Another issue related to this question is that, ApproximateMassSchur has a
> vmult member function defined, so it can be used in CG solver to get the
> product of its inverse with some vector, just like step-20 does.
> However unlike the dealii SolverCG class, PETScWrappers::SolverCG only takes
> MatrixBase type.

That is correct. Whereas the deal.II SolverCG class does not care about the
type of the matrix as long as it has a vmult() function, the
PETScWrappers::SolverCG class specifically wants a PETSc matrix object. That
means that you can use dealii::SolverCG with a PETSc matrix, but not the other
way around.

So, in your case, you could just use the dealii::SolverCG class. Or...

> In this case, is deriving my ApproximateMassSchur from
> PETScWrappers::MatrixBase with an overridden vmult function a correct solution?
> Or I simply should compute and store B*diag(M)^{-1}*B^T as a
> PETScWrappers::MPI::SparseMatrix?

...you need to make sure that you represent B*diag(M)^-1*B^T in the form of a
PETSc matrix object. You can do this by either multiplying these out, thereby
computing the element-by-element structure of the matrix and putting things
into a PETScWrappers::MPI::SparseMatrix -- which is inefficient -- or, you
make sure you describe to PETSc that a vmult with this matrix is implemented
differently. For the latter, take a look at the PETScWrappers::MatrixFree
class. In essence, this class does exactly what you want and have in your
ApproximateSchurComplement class. You would have to derive your
ApproximateSchurComplement class and overload several vmult-like functions.

This is probably the more elegant approach (if you insist on using
PETScWrappers::SolverCG), but I feel like it is not terribly well explained.
You may want to look at tests/petsc/petsc_mf_testmatrix.h for an example. If
you go this route, I think it would be quite nice to get your example and see
if we could possibly put it into the documentation of the
PETScWrappers::MatrixFree class as a demonstration of what a user needs to do.

Jie Cheng

unread,
Dec 16, 2017, 10:50:30 AM12/16/17
to deal.II User Group
Wolfgang

I went with the first route. But I want to know if I can benefit from the second route in the future: is there any performance difference between dealii CG and PETSc CG? Similarly, what about dealii SolverFGMRES vs PETScWrappers::SolverGMRES,
and UMFPACK vs MUMPS (Yes, direct solver is still used like in step-57...)? I was under the impression that dealii CG, FGMRES and UMFPACK only work in serial. But surprisingly they do work with distributed vector/matrix! Although the scaling of my parallel code is not particular exciting. But I am not sure if it is due to the use of direct solver, or the use of dealii built-in solvers. How well does direct solver scale in parallel assuming memory is not a problem? 

Also, unlike UMFPACK, which can be initialized only once and used for many times, MUMPS has to reallocate memory every time when it is used, right? I figure this might be bad.

Thank you as always! 

Jie

Wolfgang Bangerth

unread,
Dec 17, 2017, 5:45:30 PM12/17/17
to dea...@googlegroups.com, Jie Cheng

Jie,

> I went with the first route. But I want to know if I can benefit from the
> second route in the future: is there any performance difference between dealii
> CG and PETSc CG?

Not that I know of.


> Similarly, what about dealii SolverFGMRES vs
> PETScWrappers::SolverGMRES,

Same -- I suspect that all of these solvers spend all of their time in the
matrix-vector and vector-vector operations, and that the performance of the
operations in the class itself is unimportant.


> and UMFPACK vs MUMPS (Yes, direct solver is still used like in step-57...)?

These are actually quite different algorithms, so I would very much expect
them to perform differently.

> I
> was under the impression that dealii CG, FGMRES and UMFPACK only work in
> serial. But surprisingly they do work with distributed vector/matrix!

That's correct. All they care about is that vectors and matrices allow certain
operations, but whether these operations happen on a single machine, or in
parallel, is unimportant to these iterative solvers. Of course, that's
different for UMFPACK, which is an algorithm that computes a decomposition of
the matrix, and the matrix needs to be available in its entirety on the
current processor.


> Although
> the scaling of my parallel code is not particular exciting. But I am not sure
> if it is due to the use of direct solver, or the use of dealii built-in
> solvers. How well does direct solver scale in parallel assuming memory is not
> a problem?

MUMPS scales pretty well, as far as I know. I would expect that it scales to a
few hundred processors at least, but you should look up information on their
project's website.


> Also, unlike UMFPACK, which can be initialized only once and used for many
> times, MUMPS has to reallocate memory every time when it is used, right? I
> figure this might be bad.

I imagine this is possible, but it may not be possible through the deal.II
interfaces. Though it looks like a once-created SparseDirectMUMPS object will
re-use the same factorization of the matrix if you call the solve() function
multiple times.

Jie Cheng

unread,
Dec 17, 2017, 11:14:50 PM12/17/17
to deal.II User Group
Dear Wolfgang

Thanks to your advice, now I have a working serial solver and a petsc-based parallel solver now. I manged to reuse the factorization in MUMPS solver, as I commented in another thread

I did some further experiments and found something interesting. For my problems, solving B*(diag(M))^{-1}*B^T actually uses much more time than the direct solver for the A inverse. 
I took for granted that calculating B*(diag(M))^{-1}*B^T out explicitly as a SparseMatrix isn't efficient, that is why I used ApproximateMassSchur, just like step-20. However, I found this is not true.
If I use ApproximateMassSchur, I couldn't use any built-in preconditioner when solving it with CG. If I use mmult to calculate it explicitly then I could use SparseILU to precondition it. Experiment showed the second approach outperforms significantly:
in a smaller test case, the time spent in the CG solver is 57s vs 5s and in a bigger test case, it was 523s vs 25s. Of course when I timed the second approach I included the mmult time. The second approach can be further improved if I knew how to make a nice estimation of the sparsity pattern beforehand. Does the results sound reasonable to you?

Unfortunately I couldn't figure out how to compute B*(diag(M))^{-1}*B^T in PETSc-based parallel code because PETScWrappers::MPI::SparseMatrix does not have a mmult function. I also tried to assemble B*(diag(M))^{-1}*B^T from cells but I didn't know how to "predict" its sparsity pattern based on the BlockDynamicSparsityPattern of the whole system. Any thoughts on how I should proceed with the parallel code?

PS. the purely dealii-based serial code using the second approach beats the parallel petsc-based code using the first approach with 4 processors, which is pretty ironic.

Thank you
Jie

 

Wolfgang Bangerth

unread,
Dec 17, 2017, 11:40:27 PM12/17/17
to dea...@googlegroups.com
On 12/17/2017 09:14 PM, Jie Cheng wrote:
>
> I did some further experiments and found something interesting. For my
> problems, solving B*(diag(M))^{-1}*B^T actually uses much more time than the
> direct solver for the A inverse.
> I took for granted that calculating B*(diag(M))^{-1}*B^T out explicitly as a
> SparseMatrix isn't efficient, that is why I used ApproximateMassSchur, just
> like step-20. However, I found this is not true.
> If I use ApproximateMassSchur, I couldn't use any built-in preconditioner when
> solving it with CG. If I use mmult to calculate it explicitly then I could use
> SparseILU to precondition it. Experiment showed the second approach
> outperforms significantly:
> in a smaller test case, the time spent in the CG solver is 57s vs 5s and in a
> bigger test case, it was 523s vs 25s. Of course when I timed the second
> approach I included the mmult time. The second approach can be further
> improved if I knew how to make a nice estimation of the sparsity pattern
> beforehand. Does the results sound reasonable to you?

Yes. Though I would caution that you should establish how the various options
scale as the problem becomes larger and larger. For example, no
preconditioning works pretty well for small problems. But it is terrible if
you have moderate to large problems.


> Unfortunately I couldn't figure out how to compute B*(diag(M))^{-1}*B^T in
> PETSc-based parallel code because PETScWrappers::MPI::SparseMatrix does not
> have a mmult function. I also tried to assemble B*(diag(M))^{-1}*B^T from
> cells but I didn't know how to "predict" its sparsity pattern based on the
> BlockDynamicSparsityPattern of the whole system. Any thoughts on how I should
> proceed with the parallel code?

The way to deal with the sparsity pattern is to "simulate" what entries you
would produce in the mmult, add them to the sparsity pattern, attach a matrix,
and then do the whole thing again but this time actually put the entries into
the matrix.


As for the missing mmult function: There is nothing wrong in proposing a patch
to the wrapper classes, if you can find the necessary function in the PETSc
interfaces:
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/index.html
In fact, I think one of these is your function:

http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatMatMatMult.html
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatPtAP.html#MatPtAP

While there, also take a look at this page describing the "simulation" step above:
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatMatMultSymbolic.html#MatMatMultSymbolic


> PS. the purely dealii-based serial code using the second approach beats the
> parallel petsc-based code using the first approach with 4 processors, which is
> pretty ironic.

Yes, this is funny. But it again may not scale to large problems.

Jie Cheng

unread,
Dec 18, 2017, 12:10:51 AM12/18/17
to deal.II User Group

The way to deal with the sparsity pattern is to "simulate" what entries you
would produce in the mmult, add them to the sparsity pattern, attach a matrix,
and then do the whole thing again but this time actually put the entries into
the matrix.
 
That's exactly what this github issue is about, right?
 
As for the missing mmult function: There is nothing wrong in proposing a patch
to the wrapper classes

I'd like to give it a try. I am not quite familiar with PETSc itself, just to get me started, even with mmult do we still require the dst matrix to have a correct or at least similar sparsity pattern beforehand, right? I should probably find some PETSc mmult examples first.

Wolfgang Bangerth

unread,
Dec 18, 2017, 10:08:00 AM12/18/17
to dea...@googlegroups.com, Jie Cheng
On 12/17/2017 10:10 PM, Jie Cheng wrote:
>
> The way to deal with the sparsity pattern is to "simulate" what entries you
> would produce in the mmult, add them to the sparsity pattern, attach a
> matrix,
> and then do the whole thing again but this time actually put the entries into
> the matrix.
>
> That's exactly what this github issue
> <https://github.com/dealii/dealii/issues/5598> is about, right?

Yes, in some sense. But it's also exactly what DoFTools::make_sparsity_pattern
does, for example.


> As for the missing mmult function: There is nothing wrong in proposing a
> patch
> to the wrapper classes
>
>
> I'd like to give it a try. I am not quite familiar with PETSc itself, just to
> get me started, even with mmult do we still require the dst matrix to have a
> correct or at least similar sparsity pattern beforehand, right? I should
> probably find some PETSc mmult examples first.

Yes, that's a good start.
Reply all
Reply to author
Forward
0 new messages