Mpi & domain decomposition with recovering

86 views

Skip to first unread message

thomas.b...@gmail.com

unread,

Jan 28, 2019, 2:06:03 PM1/28/19

to amgcl

Hello,

first of all, a great thank to M.Demidov for the release of AMGCL. It's quite nice to see people sharing their valuable job to the research community.

I'm not a great expert of linear solver technologies. I'm rather working on CFD problems with some focuses on numerical aspects like numerical diffusion. I have a code based on mixed Finite Element / Finite volume technology (meaning that my unknows are vertices of the primal mesh). Accordingly, the parallelism is taken into account by a domain decomposition method with one layer covering. At the level of linear solvers, I have only some classical tools: (f)GMRES with ILU(k) preconditionner and, possibly, 0 order deflation (constant). I have also an interface to Pastix as a direct linear solver.

Now, since I'm often facing with quite hard problems for which convergence is bad, I'm quite interested in testing the strenght of AMG methods. Of course, here is coming your library that I find, at first glance, quite readable and promising.

I "plugged" in your code and I begin to have a view of the overall machinery. My problem is that the amg::mpi implementation is not directly applicable to the datastructure of my code since my one layer covering is not compatible with your framework. Keeping in mind that I want to avoid systematic copy of containers between amg and my code, I think that a crude way to deal with this discrepancy is simply to manage my nodal field numbering to ensure the "external" vertices (those belonging to another mpi process) being always at the end of my containers. Doing so, I can safely use some "resize" functionalities in my containers to use directly your framework.

Ok, I'm currently implementing the previous trick and I hope it will work...

Nevertheless, and it is the goal of my mail, I wonder if this trick is the good/best way to deal with my one layer recovering within amg? Do you have another idea? Better, nicer...???? It could be usefull since it is not so easy / practical / performant to force my external vertices to be always at the end of the numbering....

Thanks a lot for your attention.

Sincerely.

PS: for information, note also that I begin to work around heavy (>1e7 DOF) mechanical problems (nonlinear viscoelasticity mainly) for which I would like to use some kinds of domain decomposition methods. The problem is that my iterative solvers (GMRES/ILU) cannot achieve any convergence in a distributed setting and that a direct solver like PastiX only provides quite limited (strong) scalability. So I'm quite curious to know if playing with some of your AMG based technologies could enable to work in a distributed context with some "good" scalability performance. Have you some experience in this topic?

PS 2: my english is quite bad... Sorry;);)

Denis Demidov

unread,

Jan 29, 2019, 5:04:58 AM1/29/19

to amgcl

Hello Thomas,

I would recommend to try if the method is suitable for your problem before investing into writing any code. It should be easy to do if you can save your system matrix (and the right-hand side) to the MatrixMarket (MM) format [1]. Even if there is no immediate possibility to do this, it should be easier to just add code for MM export than to fully integrate with amgcl.

Once you have the matrix, you can use examples provided in amgcl to test various options with your matrix.

- examples/solver.cpp is for serial (no MPI) solve,

- examples/mpi/ampi_amg.cpp is for distributed (MPI) solve with almost the same command line interface.

see ./solver --help or ./mpi_amp --help for more details.

For example, if your matrix is saved to A.mm, and the RHS is saved to b.mm, you can try something like:

./solver -A A.mtx -f b.mtx -p solver.type=lgmres precond.relax.type=ilu0 precond.coarsening.type=smoothed_aggregation
Solver
======
Type: LGMRES(30,3)
Unknowns: 160000
Memory footprint: 46.40 M

Preconditioner
==============
Number of levels: 4
Operator complexity: 1.34
Grid complexity: 1.19
Memory footprint: 59.39 M

level unknowns nonzeros memory
---------------------------------------------
0 160000 798400 46.62 M (74.69%)
1 26800 239728 11.31 M (22.43%)
2 3016 27090 1.29 M ( 2.53%)
3 366 3686 174.93 K ( 0.34%)

Iterations: 11
Error: 4.15313e-09

[Profile: 1.290 s] (100.00%)
[ self: 0.002 s] ( 0.13%)
[ reading: 0.879 s] ( 68.13%)
[ setup: 0.124 s] ( 9.61%)
[ solve: 0.285 s] ( 22.13%)

Here options provided with -p flag have the same structure as amgcl::make_solver<Backend>::params.

You can select any iterative solver, coarsening or relaxation scheme available in amgcl this way.

Once you make sure that the method works for you (converges withing reasonable time/number of iterations),

you can start working on using the library in your project (and we can discuss it here at that time).

[1] The Matrix Market File Format, https://people.sc.fsu.edu/~jburkardt/data/mm/mm.html

Hope this helps,

Denis

thomas.b...@gmail.com

unread,

Jan 29, 2019, 2:32:20 PM1/29/19

to amgcl

Hello Denis,

thanks for you reply. It is quite nice that you take time;)

Today, I have worked on the strategy I described yesterday:

managing my numbering to put my "external vertices" at the end of my containers -- which are just std::vector
playing with vector.resize to "hide" my external vertices before passing all my datastructures (i.e my rhs and my val/col/ptr) to AMGCL (I am lucky my code use CRS so i can used zero_adapter).
after solving the linear system with AMGCL, I use again vector.resize to "show" my external vertices and I use my own communication pattern to take update values.

I haven't fully validated but my first tests seems to work quite fine so I will try to finish this approach this week.

If it fails, I will go back to your strategy. It will be better for me if my strategy really works since it will enable me to use my code and, by this way, to build a meaningfull benchmark between my linear solvers and yours. By meaningfull, I means I can use all my physical models. I plane to test the performance on 3 main solvers:

a high Reynolds incompressible fluid solver based on projection/correction method with a FEFV discretization
a low Reynolds incompressible flow solver based on coupled U/p method with a stabilized FE discretization
a linear elasticity model / pure FE

I will focus on "industrial" test cases, meaning meshes around 1e6/2e7 vertices, highly unstructured and anisotropic.
If you are interested, i could try to write a small (informal) report that you could put in your github. I find the objective interesting because your library is quite interesting because, beyond the power of the technologies (for which I don't have any valuable input to give), I found the C++ architecture quite neat and efficient (it is the main reason why i have chooosed your library). I guess such an architecture could be easily extended in the future to add new functionalities / technologies. In view of this objective, it would be benefical if more and more people try to use it. In this spirit, more examples and documentations there are, better it is.

In the same idea, I have spent some times to understand how to define precisely input xml files (it is the input format of the code I used - yes I am quite lucky) to use "all" your possible solvers. So I think it could be good to put an "exhaustive" database of xml (and/or json) files in your examples folder to help new users.

Lastly, I have a specific question. I have read your article "subdomain deflation with local amg..." and I wonder what is the direct solver you have used for aMGCL? Is it PastiX? If yes, do you have linked it against an optimized Blas library? (MKL? Blis?). And I have the same question for the Trilinos version you have used (I haven't used Trilinos but I guess there is the possibility to use direct solver with highly optimized blas libraries). It is important to know that in order to judge adequately the results you mention in your article.

Sincerely.

Denis Demidov

unread,

Jan 30, 2019, 2:05:29 AM1/30/19

to amgcl

Hi Thomas,

On Tuesday, January 29, 2019 at 10:32:20 PM UTC+3, thomas.b...@gmail.com wrote:

Hello Denis,

thanks for you reply. It is quite nice that you take time;)

Today, I have worked on the strategy I described yesterday:
managing my numbering to put my "external vertices" at the end of my containers -- which are just std::vector

You don't need to do this yourself for the matrix: amgcl expects that the matrix parts you pass have global columns numbering, and does not make any assumptions about order of the columns (in particular, local vs global columns). The library will rearrange the matrix into its own internal structures anyway.

The only obligation you have, as a user of the library, is to make sure that rows in the matrix stripes (partitions) are numbered sequentially, so that each MPI process has solid chunk of matrix rows without any gaps (see discussion in https://github.com/ddemidov/amgcl/issues/71).

Each MPI process should have subset of matrix rows, but all the columns belonging to the subset (in other words, the matrix should be sliced horizontally).

playing with vector.resize to "hide" my external vertices before passing all my datastructures (i.e my rhs and my val/col/ptr) to AMGCL (I am lucky my code use CRS so i can used zero_adapter).

Ok, this makes your intention a bit clearer. Instead of playing with resize (which should work, but feels a bit dirty), you could use boost::iterator_range to slice your vectors. You could even use boost::permutation_iterator to skip reordering your vectors alltogether. See some examples in examples/solver.cpp and amgcl/adapter/reorder.hpp.

adapter::zero_copy in mpi case does not really help, because matrices are reorganized and copied into internal structures anyway. If it works, there is no harm, but zero_copy has additional requirement of your index type to be binary compatible with std::size_t. If that is a problem, you could use adapter::crs_tuple without any memory overhead.

after solving the linear system with AMGCL, I use again vector.resize to "show" my external vertices and I use my own communication pattern to take update values.

See above. I think boost::permutation_iterator should really help you here. If you do not want depend on boost, you can provide your own implementation. I think it should be enough for the wrapper class to provide size() and operator[]() methods.

I haven't fully validated but my first tests seems to work quite fine so I will try to finish this approach this week.

If it fails, I will go back to your strategy. It will be better for me if my strategy really works since it will enable me to use my code and, by this way, to build a meaningfull benchmark between my linear solvers and yours. By meaningfull, I means I can use all my physical models. I plane to test the performance on 3 main solvers:
a high Reynolds incompressible fluid solver based on projection/correction method with a FEFV discretization
a low Reynolds incompressible flow solver based on coupled U/p method with a stabilized FE discretization

You may want to look at the schur_pressure_correction preconditioner (field-split in PETSC terms) with these.

a linear elasticity model / pure FE

For linear elasticity you may need to use block-solve approach (e.g. use backend::builtin<amgcl::static_matrix<>> instead of amgcl::backend::builtin<double>). See examples here:

https://github.com/ddemidov/amgcl/blob/d5990756522d4947f4b6d2efa8b3e138becbb71a/examples/solver.cpp#L66

https://github.com/ddemidov/amgcl/blob/master/examples/mpi/mpi_amg.cpp#L236

Unfortunately, MPI version of amgcl does not yet accepts null-space vectors from user (https://github.com/ddemidov/amgcl/issues/82).

I will focus on "industrial" test cases, meaning meshes around 1e6/2e7 vertices, highly unstructured and anisotropic.
If you are interested, i could try to write a small (informal) report that you could put in your github.

Yes, I am interested in your results. Also, do not hesitate to ask any questions you have.

By the way, have you heard of the Kratos multi-physics package? amgcl is used as the default linear solver there, and I think their scope is very similar to what you outlined here.

I find the objective interesting because your library is quite interesting because, beyond the power of the technologies (for which I don't have any valuable input to give), I found the C++ architecture quite neat and efficient (it is the main reason why i have chooosed your library). I guess such an architecture could be easily extended in the future to add new functionalities / technologies. In view of this objective, it would be benefical if more and more people try to use it. In this spirit, more examples and documentations there are, better it is.

In the same idea, I have spent some times to understand how to define precisely input xml files (it is the input format of the code I used - yes I am quite lucky) to use "all" your possible solvers. So I think it could be good to put an "exhaustive" database of xml (and/or json) files in your examples folder to help new users.

I agree documentation needs a lot of improvement, thanks for reminding me. Also, any help is welcome here :).

I only use json files as a way to pass runtime parameters to the solver examples. Having some examples of those is a nice idea, but I think it would be even better to provide some external repository with various problems (matrices and right-hand sides) together with solver parameters as json files that work well with these problems. Or may be it could reference matrices from external collections, such as https://sparse.tamu.edu/.

Lastly, I have a specific question. I have read your article "subdomain deflation with local amg..." and I wonder what is the direct solver you have used for aMGCL? Is it PastiX? If yes, do you have linked it against an optimized Blas library? (MKL? Blis?). And I have the same question for the Trilinos version you have used (I haven't used Trilinos but I guess there is the possibility to use direct solver with highly optimized blas libraries). It is important to know that in order to judge adequately the results you mention in your article.

We used pastix as the direct MPI solver there, but I am not sure what BLAS we linked it with. Anyway, I think the choice of the direct coarse solver or its configuration only becomes important at very large scales (many thousands of MPI processes). We used Trilinos version installed by the administrators of the clusters; I think it was version 12.* in all cases.

To be fair, I feel a bit disappointed in the subdomain deflation approach that was used in the paper. In theory, it is easy to implement (and it sort of is), as we can reuse local (non-MPI) preconditioners, and provide a relatively thin layer of subdomain deflation code on top of that. The convergence should improve (and it does) with the number of MPI processes, because the coarse system describes the problem better and better. But in practice, the convergence rate is still much slower than MPI-aware AMG, such as the one implemented in Trilinos.

So, I have implemented MPI-aware version of AMG since that paper, which allowed to have performance on par or better than Trilinos ML. Some results are published here: https://arxiv.org/abs/1811.05704.

Cheers,

Denis

Denis Demidov

unread,

Feb 9, 2019, 12:10:53 AM2/9/19

to am...@googlegroups.com

---------- Forwarded message ---------
From: Thomas Boucheres <thomas.b...@gmail.com>
Date: Sat, Feb 9, 2019 at 1:08 AM
Subject: Re: Mpi & domain decomposition with recovering
To: Denis Demidov <dennis....@gmail.com>

Hi Denis,

sorry to have taken so much time to answer after your last post where you gave me many hints to use AMGCL but I was a little busy at work... and I encountered many small problems to link AMGCL correctly to my code!

Now, all is fine: I have succesfully link AMGCL against my code and I have began to play around benchmarks. It is a great achievment for me and I am quite happy! Again, a great thank to you for your fruitfull advices.

In view of your help, I find natural to give you a feedback on my results. I began to play with my segregated incompressible flow solver, comparing my GMRES/ILU0 and GMRES/ILU1 with your GMRES/ILU0, GMRES/ILU1 and GMRES/AMG(ILU0). I worked with a turbulent (Re~1e6) flow in a pipe with a sphere as an obstacle. The mesh has 1.4e6 tetraedras with sligh anisotropy along the stream axis. I run on 8 MPI processors (Intel Xeon 6132 / skylake). The flow is somewhat simple but not so easy to converge (turbulence + unstructured mesh). The results are as follow:

my GMRES/ILU0:

linear solver time ~ 160s
mean number of iteration ~ 240

my GMRES/ILU1:

linear solver time ~ 130s
mean number of iteration ~ 110

your GMRES/ILU0:

linear solver time ~ 75s
mean number of iteration ~ 150

your GMRES/ILU1:

linear solver time ~ 50s
mean number of iteration ~ 102

your GMRES/AMG:

linear solver time ~ 25s
mean number of iteration ~ 11

Excepted the fact that I found quite strange the difference between the number of iteration for GMRES/ILU0 (your formula is not exactly as mine -- I'm currently investigating to understand what goes wrong with me), the results are quite coherent AND, mainly, show that your library is quite better than my native (naive!) implementation. It is obviously a quite good news for me and I will definitively adopt your library as the cornerstone of my solvers.
At current time, there are 2 things that are impressive for me in AMGCL:

the performance of the AMG framework (I have never used AMG before -- it is a quite powerful technology)
the efficiency of your overall implementation. Indeed, as you have suggested me, I use your library "as it is". Notably, I build from scratch -- at each time step -- the framework AMGCL needs: the matrix from my matrix, the solver/precondtionner. What I found amazing is that the time spent in constructors (vectors allocation and connectivity informations mainly) is quite limited. Not negligeable but limited. It is impressive for me because in my code, typically, I always try to do allocation and connectivity structure only one time. Now I understand it is not the only way...

Ok so far so good but I still have many works to do with AMGCL!

Completing my benchmarks with segregated flow solver (easy now) then playing with your Navier-Stokes block preconditioner (I begin next week and will take time to do it cleanly -- says February. Here I am quite confident that AMGCL will give me great improvement on global CPU time) and, lastly, trying to see if AMGCL can help me with the very stiff problem of elasticity in a distributed parallel framework (I plane this topic on March -- it is the most difficult by far). At end, as I have said you, I will write a report that I will send you if you want to put it in your library package for helping/interesting others users.

Ok, as I have said, finding the c++ pattern of your library nice, it would be a pleasure for me to help you to develop the library with the hope that many others people used it and improve it. In this spirit, since I have no competence in linear solvers, I would like to propose you my help on the development of the "examples" part. After reflexion, I could write a lightweight framework wich provides:

a reader for a tetraedral mesh file and a builder of the associated (P1) matrix pattern. As a format, I would like to use the MEDIT format. It is a quite simple format that has several advantages for our purpose:

the open source mesher Gmsh can read/write this format so a user could easily build some meshes to test AMGCL
I know it quite well!

a builder providing the matrix values arising from the P1 discretization of the following problems:

Stokes problem
linear isotropic elasticity
linear convection / diffusion scalar equation
==> the advantage of these restricted choices is that they needs a minimal set of input parameters (which agrees quite well with the examples you already provides) but are still quite representative of a lot of problems. So they could interest some users.

a Paraview writer (mesh and solution)

Please let me know if these things are of interest for you. If yes, I will do them. Ok I need to be precise --> since I have a work which take me (unfortunately too much!) time, it will take me time to do the job I described cleanly. I think I could finish it maybe in 3 or 4 months, not before. Good for you?

To end this post, some little things I have found during my study of your library:

in amgcl::mpi::coarsening::pmis.hpp, at line 436, my compiler (gcc 7.2.0) doesn't like the use of <undone> (problem with template as usual;). I need to use:

int UNDONE = this->undone
std::vector<ptrdiff_t> rem_state(Sp.recv.count(), UNDONE):
==> I guess it is specific to my compiler but maybe some other users will have the problem.

I like the profiler<mpi_aggregator> but I cannot use it as I use the profiler<clock> (i.e declaring extern profile<> prof) because at construction, mpi_aggregator use the MPI_COMM_WORLD to have rank/size. The problem is that if I declare somewher in my code a global profiler<mpi_aggregator>, some compilers could build this object before the main, so before the MPI_Init that many people (whose me!) declare classicaly in the main. At end, the application crashes because mpi_aggregator try to use MPI_COMM_Rank before Mpi_Init is launch. Ok, to remedy to the problem, I have delayed all the construction of counters after construction (it is done inside the function current). Works fine for me but not so clean. I think we could change slighty the mpi_aggregator implementation to let the external user gives it the MPI communicator (like you do for the solvers).
during my benchmarks, I have found that the inner_product (the default one of backend::buildtin) is (in sequential / not OMP) 3 times lower than classical implementation. So I would suggest to have complete distinction betwwen OMP implementation and sequential one (and, for the later, to use classical loop implementation or stl one). I think it can have some importance since for many stiff problems, the inner product can takes more than 33% of the total time
I think a good thing for users would be to provide a mechanism to output the residuals during a resolution. Two things are interesting:

the classical "residual versus iteration"
the (less) classical "spatial" residual (says at some sampled iterations). Not natural from "algebraic" linear solver point of view. But quite natural & usefull if you think at the people which link the library towards its solver for which he has the notion of underlying mesh (so visualization way).
==> well the idea seems interesting (for me at least!) but I need to go beyond and maybe to propose you a practical implementation.

Sincerely.

Thomas.

Cheers,
Denis

Denis Demidov

unread,

Feb 9, 2019, 2:43:36 AM2/9/19

to Thomas Boucheres, am...@googlegroups.com

Hi Thomas,

Thank you for the results, they do look nice.

Were you using single node for these tests, with a single core per MPI process? If that is the case, please make sure you disable OpenMP (by exporting an environment variable `OMP_NUM_THREADS=1`). Otherwise, you will get reduced performance due to over-subscription (each MPI process will use all 8 available cores, so you'll run 64 threads on 8 cores). Although, judging by your timings, you are probably already doing this.

Things I would also try:

* Replace AMG(ILU0) with AMG(SPAI0), just to see if this will be able to converge. If it does, SPAI0 should be faster to setup, and has better parallel scalability.

* It looks like you have a non-scalar problem (three velocity components in case of 3D flow). I would try to see if replacing `backend<double>` with `backend<amgcl::static_matrix<double, 3,3>>` will improve performance (number of iterations should not change too much, but each iteration should be cheaper due to improved cache locality). An example can be found here:

https://github.com/ddemidov/amgcl/blob/817ce4ef67ce8bc0d6ad00fad0a57a367654a6f1/examples/mpi/mpi_amg.cpp#L235-L324

Also, if you save the input matrix to a matrix-market format, it should be easy to test this with amgcl/examples/solver (with -b3 command line parameter).

amgcl has MM saver functions here:

* save matrix: https://github.com/ddemidov/amgcl/blob/817ce4ef67ce8bc0d6ad00fad0a57a367654a6f1/amgcl/io/mm.hpp#L388-L390

* save rhs: https://github.com/ddemidov/amgcl/blob/817ce4ef67ce8bc0d6ad00fad0a57a367654a6f1/amgcl/io/mm.hpp#L354-L356

The inputs to these functions are the same that make_solver takes, so it should be easy for you to do this.

Ok so far so good but I still have many works to do with AMGCL!
Completing my benchmarks with segregated flow solver (easy now) then playing with your Navier-Stokes block preconditioner (I begin next week and will take time to do it cleanly -- says February. Here I am quite confident that AMGCL will give me great improvement on global CPU time) and, lastly, trying to see if AMGCL can help me with the very stiff problem of elasticity in a distributed parallel framework (I plane this topic on March -- it is the most difficult by far). At end, as I have said you, I will write a report that I will send you if you want to put it in your library package for helping/interesting others users.

Ok, as I have said, finding the c++ pattern of your library nice, it would be a pleasure for me to help you to develop the library with the hope that many others people used it and improve it. In this spirit, since I have no competence in linear solvers, I would like to propose you my help on the development of the "examples" part. After reflexion, I could write a lightweight framework wich provides:
a reader for a tetraedral mesh file and a builder of the associated (P1) matrix pattern. As a format, I would like to use the MEDIT format. It is a quite simple format that has several advantages for our purpose:
the open source mesher Gmsh can read/write this format so a user could easily build some meshes to test AMGCL
I know it quite well!
a builder providing the matrix values arising from the P1 discretization of the following problems:
Stokes problem
linear isotropic elasticity
linear convection / diffusion scalar equation
==> the advantage of these restricted choices is that they needs a minimal set of input parameters (which agrees quite well with the examples you already provides) but are still quite representative of a lot of problems. So they could interest some users.
a Paraview writer (mesh and solution)

Please let me know if these things are of interest for you.

I think having these in the examples section, together with the recommended amgcl configuration for each problem, would be great!

If yes, I will do them. Ok I need to be precise --> since I have a work which take me (unfortunately too much!) time, it will take me time to do the job I described cleanly. I think I could finish it maybe in 3 or 4 months, not before. Good for you?

That is completely understandable, and there is no hurry.

To end this post, some little things I have found during my study of your library:
in amgcl::mpi::coarsening::pmis.hpp, at line 436, my compiler (gcc 7.2.0) doesn't like the use of <undone> (problem with template as usual;). I need to use:
int UNDONE = this->undone
std::vector<ptrdiff_t> rem_state(Sp.recv.count(), UNDONE):
==> I guess it is specific to my compiler but maybe some other users will have the problem.

Would replacing undone with pmis::undone help? As in https://github.com/ddemidov/amgcl/commit/a47b77f0cf53c8732b42d8a6e6c94f908752d1a1

I like the profiler<mpi_aggregator> but I cannot use it as I use the profiler<clock> (i.e declaring extern profile<> prof) because at construction, mpi_aggregator use the MPI_COMM_WORLD to have rank/size. The problem is that if I declare somewher in my code a global profiler<mpi_aggregator>, some compilers could build this object before the main, so before the MPI_Init that many people (whose me!) declare classicaly in the main. At end, the application crashes because mpi_aggregator try to use MPI_COMM_Rank before Mpi_Init is launch. Ok, to remedy to the problem, I have delayed all the construction of counters after construction (it is done inside the function current). Works fine for me but not so clean. I think we could change slighty the mpi_aggregator implementation to let the external user gives it the MPI communicator (like you do for the solvers).

Instead of defining the global profiler, you could use a trick with a global function and local static variable like this:

amgcl::profiler<mpi_aggregator> profiler() {

static amgcl::profiler<mpi_aggregator> p;

return p;

}

this way, it would be initialized at the first call of the function, not at program startup. So as long as you call the function after mpi initialization (e.g. as profiler().tic("solve"), profiler().toc("solve")), you should be good.

I am not sure how would we provide a communicator to the aggregator though in a generic way. May be we could accept arbitratry set of arguments (with a variadic template) in the profiler constructor here:

https://github.com/ddemidov/amgcl/blob/817ce4ef67ce8bc0d6ad00fad0a57a367654a6f1/amgcl/profiler.hpp#L63

and pass the arguments for initialization of the `counter` member variable. Something like

template <class... CounterArgs>

profiler(const std::string &name, CounterArgs && ...args) : name(name), counter(args...) {}

during my benchmarks, I have found that the inner_product (the default one of backend::buildtin) is (in sequential / not OMP) 3 times lower than classical implementation. So I would suggest to have complete distinction betwwen OMP implementation and sequential one (and, for the later, to use classical loop implementation or stl one). I think it can have some importance since for many stiff problems, the inner product can takes more than 33% of the total time

Thanks for pointing that out. Yes, I think providing an implementation for the serial case should be easy.

I think a good thing for users would be to provide a mechanism to output the residuals during a resolution. Two things are interesting:
the classical "residual versus iteration"
the (less) classical "spatial" residual (says at some sampled iterations). Not natural from "algebraic" linear solver point of view. But quite natural & usefull if you think at the people which link the library towards its solver for which he has the notion of underlying mesh (so visualization way).
==> well the idea seems interesting (for me at least!) but I need to go beyond and maybe to propose you a practical implementation.

We could probably take an optional callback(scalar iteration, scalar residual, const vector &x) argument for iterative amgcl solvers.

Although for some solvers the current x is not always defined well.

Thanks again for these notes and suggestions!

Sincerely.

Thomas.

--
You received this message because you are subscribed to a topic in the Google Groups "amgcl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/amgcl/xvRruig9s4A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to amgcl+un...@googlegroups.com.
To post to this group, send email to am...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/41c832af-76e7-41b9-9b30-fb589af48439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cheers,
Denis

Denis Demidov

unread,

Feb 9, 2019, 3:28:29 AM2/9/19

to amgcl

Thomas,

I've pushed some changed to the master branch:

* https://github.com/ddemidov/amgcl/commit/0fc6b503bdb9222062b32ab16e9f2e2bb92c4a48

allows to send arbitrary counter arguments in profiler constructor (e.g. your mpi communicator).

would that work for you?

* https://github.com/ddemidov/amgcl/commit/31e6da7b4d7b34e0f1f8d606bfcb1ca1fd2e68be

dispatches inner_product to a simpler implementation in the serial case (either no openmp, or single openmp thread).

Is this what you had in mind? Does it have any effect on performance in your tests?

thomas.b...@gmail.com

unread,

Feb 10, 2019, 10:32:19 AM2/10/19

to amgcl

Hi Denis,

thanks for your commits;
I haven't be able to check all because I'm at my home and on my laptop I didn't install mpi;)
I will do the job tomorrow at office. At first glance:

for pmis::undone, it should work clearly!
for the profiler, your suggestion to use a global function with static local variable is the good one I guess. I will play along it together with the new constructor you provides for the mpi_aggregator. Let me give you some news on this later but it should be good indeed.

In the same spirit, note that I have found usefull to add two compil flags to AMGCL:

AMGCL_HAS_MPI --> if defined, then in <amgcl/util.hpp>, I switch the declaration of

namespace amgcl { extern profiler<> prof; }
by
# include <amgcl/perf_counter/mpi_aggregator.hpp>
namespace amgcl { extern profiler<perf_counter::mpi_aggregator<perf_counter::clock,false>> prof; }

AMGCL_INTSIZE32 --> in <amgcl/util.hpp>, I add:

#ifdef AMGCL_INTSIZE32
#define Ptrdiff_t int
#else
#define Ptrdiff_t ptrdiff_t
#endif

Accordingly with < find . -type f --exec sed -i "s/ptrdiff_t/Ptrdiff_t/g" {} \; >, I change all the code and it works fine for me.

Mmm, maybe it is not clean but it is quite usefull because many codes (my but also PastiX, Scotch, OpenFoam...) allows to choose the integer representation. So I think it is good that amgcl do it also otherwise users will have obligation to recompile all their sources i think. Do you think it could be usefull?

Next is the "problem" of the inner product. I have seen your commit but from my point of view it doesn't solve the problem. In fact, to understand why I think this, I send you a "dot.png" picture which is the result of a bench using severals ways to compute the inner product. The picture correspond to a bench done on a my laptop (intel i5 4310 2Ghz). Note the results are really similar on some others (intel) procs. Tomorrow, i could send you some bench results from several others CPU. I also joined you a main.cpp & run.sh which should enable you to make your idea by yourself. Note that if you find quite different results, i would be interested to discuss about because i do this kind of benchmarks since a long time on my different CPU's and i always find the same behavior.

Next is your comments on the results I find you comparing my code with yours. First, I'm quite sure that my runs don't use OpenMPI mode (I manage myself all the environnement of the cluster). But you have right my timings were strange! I think I know why now... In fact, I think I have done a mistake between tol and abstol when defining the settings of amgcl (my code only has <tol> and running your library I have unfortunately used abstol). Ok, I will run again the benchs tomorrow and send you updated results;);) Thanks for pointing me the strangness og the results.

Last but not least I find interesting to discuss around your following comments:

Things I would also try:
* Replace AMG(ILU0) with AMG(SPAI0), just to see if this will be able to converge. If it does, SPAI0 should be faster to setup, and has better parallel scalability.

* It looks like you have a non-scalar problem (three velocity components in case of 3D flow). I would try to see if replacing `backend<double>` with `backend<amgcl::static_matrix<double, 3,3>>` will improve performance (number of iterations should not change too much, but each iteration should be cheaper due to improved cache locality). An example can be found here:

Ok, first, concerning the SPAI0 I have tested it but for my problem it slown down the global CPU time since the problem is not so easy to converge. Nevertheless, I have a remark on it. On my code, I has the possibility to "couple" diagonal precondtionning with ILU(k). Something like a "composite" preconditionning. Do you think it would be possible to do such things with amgcl?

Now, the most interesting part for me is the second one. Let me explain why. As I have said, I work often (for high Reynolds flow) with segregated solver which roughly speaking means that I solve two parts:

the first one (group A) refers to all the variables except the pressure. It means: 1- the 3 components of the velocity (note we don't need to couple them for high reynolds flow), 2- all the densities (not just one in fact because we can think about multiphase flows) and 3- all the remaining scalars (they can some passive scalars, some mass fraction of chemical species and so on). At end, it is quite common to have something like around 10 scalar unknows on this part. Note also that the group 1 and 3 share the same structure: advection/diffusion/source whereas the density group is simple: pure advection.
the second one (group B) is the equation on the pressure. Ok, something like a pure Poisson problem.

Classically, all the group A problems converge easily (something like 10 iterations with basic GMRES/ILU0) whereas the pressure problem is quite difficult to converge (quite common to have some hundred of iterations with GMRES/ILU1).

With your AMG solver, I'm now quite faster with pressure so the main bottleneckness is the group A. So I have a strong interest to speedup this group. Your idea of using a static_matrix is quite good but I think we could do a little better. What I have in mind is to write a static_diagonal_matrix and a static_sperical_matrix types. With these types, we could solve in an "optimal" fashion all ours scalar problems of the group A I think. Note that the static_diagonal_matrix would be applied to the group A.1 and A.3 (for which variables don't share the same matrix) whereas the static_spherical_matrix would be used for the densities (which all share the same velocity, so the same matrix -- only the rhs differs). I hope it is a good idea. What do you think about it? And from strict implementation point of view, if I have well understand the code, it is not difficult. I think it is sufficient to do a kind of copy/paste from static_matrix. Do you think so also?

Sincerely.

Thomas.

dot.png

main.cpp

run.sh

Denis Demidov

unread,

Feb 10, 2019, 2:02:48 PM2/10/19

to amgcl

Hello Thomas,

On Sunday, February 10, 2019 at 6:32:19 PM UTC+3, thomas.b...@gmail.com wrote:

Hi Denis,

thanks for your commits;
I haven't be able to check all because I'm at my home and on my laptop I didn't install mpi;)
I will do the job tomorrow at office. At first glance:
for pmis::undone, it should work clearly!

Ok, I'll merge the commit to the master branch.

for the profiler, your suggestion to use a global function with static local variable is the good one I guess. I will play along it together with the new constructor you provides for the mpi_aggregator. Let me give you some news on this later but it should be good indeed.
In the same spirit, note that I have found usefull to add two compil flags to AMGCL:
AMGCL_HAS_MPI --> if defined, then in <amgcl/util.hpp>, I switch the declaration of
namespace amgcl { extern profiler<> prof; }
by
# include <amgcl/perf_counter/mpi_aggregator.hpp>
namespace amgcl { extern profiler<perf_counter::mpi_aggregator<perf_counter::clock,false>> prof; }

AMGCL_PROFILING was intended more as an internal mechanism. If you think this should be made available for a generic user of the library, I would rather make AMGCL_TIC and AMGCL_TOC macros user-configurable. That is, just guard the definitions with #ifndef AMGCL_TIC and #ifndef AMGCL_TOC. This way, users may provide their own versions of the macros and employ any profiling mechanism they like (not even necessarily based on amgcl::profiler).

AMGCL_INTSIZE32 --> in <amgcl/util.hpp>, I add:
#ifdef AMGCL_INTSIZE32
#define Ptrdiff_t int
#else
#define Ptrdiff_t ptrdiff_t
#endif

Accordingly with < find . -type f --exec sed -i "s/ptrdiff_t/Ptrdiff_t/g" {} \; >, I change all the code and it works fine for me.
Mmm, maybe it is not clean but it is quite usefull because many codes (my but also PastiX, Scotch, OpenFoam...) allows to choose the integer representation. So I think it is good that amgcl do it also otherwise users will have obligation to recompile all their sources i think. Do you think it could be usefull?

I can understand the need to choose the index type, but I am a bit afraid to rely on sed for this.

Also, this is more of an internal implementation detail, as amgcl should take input matrices with any index types, and users are not limited in any way by this.

There is a corresponding typedef in the builtin backend, may be it is worth to make it a template parameter (with a default value), and make sure it is respected throughout the code.

Next is the "problem" of the inner product. I have seen your commit but from my point of view it doesn't solve the problem. In fact, to understand why I think this, I send you a "dot.png" picture which is the result of a bench using severals ways to compute the inner product. The picture correspond to a bench done on a my laptop (intel i5 4310 2Ghz). Note the results are really similar on some others (intel) procs. Tomorrow, i could send you some bench results from several others CPU. I also joined you a main.cpp & run.sh which should enable you to make your idea by yourself. Note that if you find quite different results, i would be interested to discuss about because i do this kind of benchmarks since a long time on my different CPU's and i always find the same behavior.

Well, what exactly do you propose to do then? Keep in mind the definition of the inner product should be generic enough to support value types other than float or double. amgcl allows to work with complex values via std::complex<T>, or dense statically sized matrices via amgcl::static_matrix<T,N,N>, or even user-defined custom value types. So I can not rely on, for example, a limited BLAS API. Also, in the context of AMG, I don't think that inner product computation would be noticeable. It is done once per rather expensive V-cycle, so it should be relatively inexpensive even with suboptimal implementation. Here is the example of full profile of Poisson problem solution on a 3D 128^3 mesh on my machine. As you can see, the inner product takes less than 3% of the total time or 4% of the solve step.

./solver -n 128

Solver

======

Type: BiCGStab

Unknowns: 2097152

Memory footprint: 112.00 M

Preconditioner

==============

Number of levels: 4

Operator complexity: 1.62

Grid complexity: 1.13

Memory footprint: 744.72 M

level unknowns nonzeros memory

---------------------------------------------

0 2097152 14581760 553.22 M (61.61%)

1 263552 7918340 168.88 M (33.46%)

2 16128 1114704 20.01 M ( 4.71%)

3 789 53055 2.61 M ( 0.22%)

Iterations: 10

Error: 2.50965e-09

[Profile: 3.636 s] (100.00%)

[ self: 0.008 s] ( 0.23%)

[ assembling: 0.151 s] ( 4.15%)

[ setup: 0.841 s] ( 23.13%)

[ self: 0.066 s] ( 1.82%)

[ coarse operator: 0.373 s] ( 10.27%)

[ coarsest level: 0.030 s] ( 0.83%)

[ move to backend: 0.010 s] ( 0.26%)

[ relaxation: 0.039 s] ( 1.07%)

[ transfer operators: 0.323 s] ( 8.89%)

[ self: 0.106 s] ( 2.93%)

[ aggregates: 0.109 s] ( 3.00%)

[ smoothing: 0.095 s] ( 2.61%)

[ tentative: 0.013 s] ( 0.36%)

[ solve: 2.636 s] ( 72.49%)

[ axpby: 0.049 s] ( 1.35%)

[ axpbypcz: 0.099 s] ( 2.73%)

[ clear: 0.016 s] ( 0.44%)

[ coarse: 0.005 s] ( 0.15%)

[ copy: 0.005 s] ( 0.13%)

[ inner_product: 0.104 s] ( 2.87%)

[ relax: 1.130 s] ( 31.09%)

[ residual: 0.978 s] ( 26.89%)

[ vmul: 0.152 s] ( 4.19%)

[ residual: 0.511 s] ( 14.05%)

[ spmv: 0.715 s] ( 19.67%)

Next is your comments on the results I find you comparing my code with yours. First, I'm quite sure that my runs don't use OpenMPI mode (I manage myself all the environnement of the cluster). But you have right my timings were strange! I think I know why now... In fact, I think I have done a mistake between tol and abstol when defining the settings of amgcl (my code only has <tol> and running your library I have unfortunately used abstol). Ok, I will run again the benchs tomorrow and send you updated results;);) Thanks for pointing me the strangness og the results.

Last but not least I find interesting to discuss around your following comments:
Things I would also try:
* Replace AMG(ILU0) with AMG(SPAI0), just to see if this will be able to converge. If it does, SPAI0 should be faster to setup, and has better parallel scalability.
* It looks like you have a non-scalar problem (three velocity components in case of 3D flow). I would try to see if replacing `backend<double>` with `backend<amgcl::static_matrix<double, 3,3>>` will improve performance (number of iterations should not change too much, but each iteration should be cheaper due to improved cache locality). An example can be found here:

Ok, first, concerning the SPAI0 I have tested it but for my problem it slown down the global CPU time since the problem is not so easy to converge. Nevertheless, I have a remark on it. On my code, I has the possibility to "couple" diagonal precondtionning with ILU(k). Something like a "composite" preconditionning. Do you think it would be possible to do such things with amgcl?

You should be able to provide your own relaxation class, a thin wrapper coupling existing implementations. If that provides standard amgcl interface for relaxation (see e.g. the definition of spai0 for an example), you could use it as template parameter for relaxation in amgcl::amg<>.

Now, the most interesting part for me is the second one. Let me explain why. As I have said, I work often (for high Reynolds flow) with segregated solver which roughly speaking means that I solve two parts:
the first one (group A) refers to all the variables except the pressure. It means: 1- the 3 components of the velocity (note we don't need to couple them for high reynolds flow), 2- all the densities (not just one in fact because we can think about multiphase flows) and 3- all the remaining scalars (they can some passive scalars, some mass fraction of chemical species and so on). At end, it is quite common to have something like around 10 scalar unknows on this part. Note also that the group 1 and 3 share the same structure: advection/diffusion/source whereas the density group is simple: pure advection.
the second one (group B) is the equation on the pressure. Ok, something like a pure Poisson problem.
Classically, all the group A problems converge easily (something like 10 iterations with basic GMRES/ILU0) whereas the pressure problem is quite difficult to converge (quite common to have some hundred of iterations with GMRES/ILU1).

With your AMG solver, I'm now quite faster with pressure so the main bottleneckness is the group A. So I have a strong interest to speedup this group. Your idea of using a static_matrix is quite good but I think we could do a little better. What I have in mind is to write a static_diagonal_matrix and a static_sperical_matrix types. With these types, we could solve in an "optimal" fashion all ours scalar problems of the group A I think. Note that the static_diagonal_matrix would be applied to the group A.1 and A.3 (for which variables don't share the same matrix) whereas the static_spherical_matrix would be used for the densities (which all share the same velocity, so the same matrix -- only the rhs differs). I hope it is a good idea. What do you think about it? And from strict implementation point of view, if I have well understand the code, it is not difficult. I think it is sufficient to do a kind of copy/paste from static_matrix. Do you think so also?

I think schur_pressure_correction preconditioner is suited better to what you describe. This is really a field-split type preconditioner, where you split your problem into two (or more, using recursion) subproblems with an indicator vector (containing 0 or 1), and specify solvers for each of the subproblems. For example, you can treat the pressure (Poisson-like) part with AMG, and the velocity part with single-level ILU0.

The approach I was talking about treats the problem as a monolithic one, but considers the matrix to be block-valued, with block sized fixed at compile time. This way, for example, the division by diagonal value in the diagonal preconditioner becomes multiplication by an inverted diagonal block. I hope this makes sense.

Cheers,

Denis

thomas.b...@gmail.com

unread,

Feb 11, 2019, 11:14:05 AM2/11/19

to amgcl

Hello Denis,

I have run again my previous benchmarks using the tol parameter instead of abstol.

Comparing with my code, the situation is as follows:

CASE SEQUENTIAL (mesh 3^e4 vertices) :

my GMRES + ILU0

Overall CPU linear solver : 26s (8s for ILU)
Mean number of iteration : 82

my GMRES + ILU0

Overall CPU linear solver : 50s (39s for ILU)
Mean number of iteration: 52

AMGCL GMRES + ILU0:

Overall CPU linear solver: 26s (8s for the ILU)
Mean number of iteration: 83

AMGCL GMRES + ILU1:

Overall CPU linear solver: 26s (19s for ILU)
Mean number of iteration: 50

AMGCL GMRES + AMG:

Overall number of iteration: 13s (6.5s for the global build)
Mean number of iteration: 12

CASE PARALLEL (2e5 vertices / 8 processors)

my GMRES + ILU0

Overall CPU linear solver : 160s (22s for ILU)
Mean number of iteration : 214

my GMRES + ILU1

Overall CPU linear solver : 130s (65s for ILU)
Mean number of iteration: 109

AMGCL GMRES + ILU0:

Overall CPU linear solver: 139s (24s for the ILU)
Mean number of iteration: 269

AMGCL GMRES + ILU1:

Overall CPU linear solver: 96s (35s for ILU)
Mean number of iteration: 140

AMGCL GMRES + AMG:

Overall number of iteration: 28s (12s for the global build)
Mean number of iteration: 15

The results are more coherent. Nevertheless, the discrepancy between the number of iterations for ILU(k) when running on parallel hurts me. So I have discussed with the friend who has implemented the GMRES/ILU(k) kernel of my code. I have tried to explain it what you have done. He is much more competent than me and he see a difference which, he thinks, could explain the behavior --> in our version, the ILU decomposition is not done only on the local part of the matrix but also on the remote part. He tells me that he thinks it is much more robust and should also explain why our solver needs fewer iterations than AMGCL. In view of your expertise, are you agree? Do you see something else? And do you think that in the AMG framework it is still usefull to handle such a perfectionnement of the ILU(k) family (if yes, I could see with my friend how to do it)?

Otherwise, to close with the other details, a few comments on your last mail.

* For the inner_product, I didn't suggest to use optimized blas kernel or something else. In fact, I just wonder myself why you need the trick:

for (c=s=0 , ptrdiff_t i...) {
d = x(i)*y(i) - c;
t = s + d; // t = x(i)*y(i)
c = (t-s) - d; // c = d - d = 0
s = t; //s = x(i)*y(i)
}
return s;

instead of simpler version (twice as fast for a scalar value_type):

for (s=0, ptrdiff_t i...) {
s += x(i)*y(i);
}

What happens, now I understand also that when using AMGCL it is of no concern nevertheless.

* For the ptrdiff_t, I can be fair. What happens is that I'm not confortable with prtdiff_t that I have seldom used in my life. And when I tried to use AMGCL for the first time, knowing that my code use int* ptr, int* col for the CRS matrix storage and knowing that I tried to use the zero_adpater to gain a little time (i began with sequential case), the code doesn't compile because zero_adapter requires explicitely that the size of my index for CRS storage has the size of ptrdiff_t. Knowing also that it could leads to failure if I try to use an int* instead of an ptrdiff_t*, I was afraid and choose to do my trick of Ptrdiff_t ~ int. I'm quite agree that it is quite dirty but I was not sure I could do otherwise. So to close this discuss, I just have one question: what is the best and safer way -- with the library "as it is" -- to deal with my storage "int* ptr, int* col"? I just to be sure there won't be some "compatibility" problems...

* Last, for the use of schur_pressure_correction to handle my "multi-valued" case, the idea is quite nice but I'm not sure I'm able to do it. When you speak about "recursion", I'm not sure to know how I can do it. Do I need to "modify" the schur_pressure_correction class to handle multiple (template) solvers? It seems a little technical from strict c++ point of view no? Maybe you thought something else no?

Great thanks...

Thomas.

Denis Demidov

unread,

Feb 11, 2019, 1:35:12 PM2/11/19

to amgcl

That should definitely explain the discrepancy. The ILU(k) in AMGCL is block-local (inter domain connections are ignored), so the quality deteriorates with the number of mpi processes.

Do you see something else? And do you think that in the AMG framework it is still usefull to handle such a perfectionnement of the ILU(k) family (if yes, I could see with my friend how to do it)?

Having a better quality preconditioner is always nice, but I am not sure I would be able to find the time right now to implement this. Any help however would of course be welcome.

Otherwise, to close with the other details, a few comments on your last mail.

* For the inner_product, I didn't suggest to use optimized blas kernel or something else. In fact, I just wonder myself why you need the trick:
for (c=s=0 , ptrdiff_t i...) {
d = x(i)*y(i) - c;
t = s + d;    // t = x(i)*y(i)
c = (t-s) - d; // c = d - d = 0
s = t; //s = x(i)*y(i)
}
return s;
instead of simpler version (twice as fast for a scalar value_type):
for (s=0, ptrdiff_t i...) {
  s += x(i)*y(i);
}
What happens, now I understand also that when using AMGCL it is of no concern nevertheless.

The approach I use in AMGCL is called Kahan summation (https://en.wikipedia.org/wiki/Kahan_summation_algorithm),

and it is used to reduce the round-off errors in summation of finite precision numbers.

It was introduced after some users reported unstable results of the inner product with some badly conditioned problems.

In case you really want to shave off the last percents from the solution time, note that every iterative solver in amgcl has

template parameter for the implementation of inner product:

https://github.com/ddemidov/amgcl/blob/a647408584c1454f69a382d094c22ffbd2fa4711/amgcl/solver/cg.hpp#L63

so you can easily replace it with your own implementation. Here is the definition of the default_inner_product:

https://github.com/ddemidov/amgcl/blob/a647408584c1454f69a382d094c22ffbd2fa4711/amgcl/solver/detail/default_inner_product.hpp#L42-L50

* For the ptrdiff_t, I can be fair. What happens is that I'm not confortable with prtdiff_t that I have seldom used in my life. And when I tried to use AMGCL for the first time, knowing that my code use int* ptr, int* col for the CRS matrix storage and knowing that I tried to use the zero_adpater to gain a little time (i began with sequential case), the code doesn't compile because zero_adapter requires explicitely that the size of my index for CRS storage has the size of ptrdiff_t. Knowing also that it could leads to failure if I try to use an int* instead of an ptrdiff_t*, I was afraid and choose to do my trick of Ptrdiff_t ~ int. I'm quite agree that it is quite dirty but I was not sure I could do otherwise. So to close this discuss, I just have one question: what is the best and safer way -- with the library "as it is" -- to deal with my storage "int* ptr, int* col"? I just to be sure there won't be some "compatibility" problems...

You can safely pass your matrix (e.g. as crs_tuple adapter) with 32bit indices to any amgcl constructor. It will be transparently copied to internal structures with 64bit indices.

zero_adapter indeed requires your matrices to be binary compatible with internal amgcl structures, but as I wrote before, in case of distributed memory solution zero_adapter has no real advantages.

* Last, for the use of schur_pressure_correction to handle my "multi-valued" case, the idea is quite nice but I'm not sure I'm able to do it. When you speak about "recursion", I'm not sure to know how I can do it. Do I need to "modify" the schur_pressure_correction class to handle multiple (template) solvers? It seems a little technical from strict c++ point of view no? Maybe you thought something else no?

In order to use the schur_pressure_correction preconditioner, you choose solvers for each of the two subproblems U and P:

typedef amgcl::make_solver<

amgcl::preconditioner::schur_pressure_correction<USolver, PSolver>,

amgcl::solver::bicgstab<Backend>

> Solver;

where Usolver and Psolver are solvers, most probably each defined as amgcl::make_solver<> with own parameters.

You also pass a std::vector<char> filled with 0s and 1s to the Solver constructor to let amgcl know which variable corresponds to which subproblem.

Now, if you think you need more than two subproblems, and you want to treat each of those with specific solvers,

you could in turn define USolver with a schur_pressure_correction preconditioner, thus effectively splitting the problem into three:

UU, UP, and P. Then you could similarly divide P into PU and PP, and get four subproblems etc etc. Does that make sense?

Great thanks...

Thomas.

Denis Demidov

unread,

Feb 11, 2019, 3:25:30 PM2/11/19

to Thomas Boucheres, am...@googlegroups.com

Thomas,

On Mon, Feb 11, 2019 at 10:49 PM Thomas Boucheres <thomas.b...@gmail.com> wrote:

Hi Denis,

for the preconditioner ILU(k), as my friend explained me, his implementation is not really a "distributed" one (no intra-proc communication). He simply uses the information of the remote part to handle a little more coupling between the domains. It seems easy to do (well I can hope at least;). Ok we will do the job with my friend and give you feedback / source if it is usefull (my friend says it is the case but I will try to find good cases demonstrating this fact).

It is nice to know that such results can be achieved without any extra communication, thanks.

I'll keep this in mind, and look into updating the amgcl implementation when I have time.

If you can contribute the patch before that, it would be even greater :).

For the Kahan summation, I understand better what you have done now. I had strictly never heard about this trick before. I'm a little ashamed since I do some numerical analysis for more than 10 years! Well in fact it is quite good since it is always great to learn something new;) Thanks for it.

For the ptrdiff_t ok it is clearer and I can safely give up my so beautiful find & sed trick;)

For the schur_pressure, ok I understand what you mean now. Modern c++ features are quite useful & beautiful... I will ask a coworker to deal with this topic, making comparison with the more crude approach of using static_matric or something related.

Ok, many work to do on my side. Let me give you some news when I have more results... (I don't forget the implementation of some mesh functionalities & associated problems that could be good for the examples part).

Sincerely.