./solver -A A.mtx -f b.mtx -p solver.type=lgmres precond.relax.type=ilu0 precond.coarsening.type=smoothed_aggregationSolver======Type: LGMRES(30,3)Unknowns: 160000Memory footprint: 46.40 MPreconditioner==============Number of levels: 4Operator complexity: 1.34Grid complexity: 1.19Memory footprint: 59.39 Mlevel unknowns nonzeros memory---------------------------------------------0 160000 798400 46.62 M (74.69%)1 26800 239728 11.31 M (22.43%)2 3016 27090 1.29 M ( 2.53%)3 366 3686 174.93 K ( 0.34%)Iterations: 11Error: 4.15313e-09[Profile: 1.290 s] (100.00%)[ self: 0.002 s] ( 0.13%)[ reading: 0.879 s] ( 68.13%)[ setup: 0.124 s] ( 9.61%)[ solve: 0.285 s] ( 22.13%)
Hello Denis,
thanks for you reply. It is quite nice that you take time;)
Today, I have worked on the strategy I described yesterday:
- managing my numbering to put my "external vertices" at the end of my containers -- which are just std::vector
- playing with vector.resize to "hide" my external vertices before passing all my datastructures (i.e my rhs and my val/col/ptr) to AMGCL (I am lucky my code use CRS so i can used zero_adapter).
- after solving the linear system with AMGCL, I use again vector.resize to "show" my external vertices and I use my own communication pattern to take update values.
I haven't fully validated but my first tests seems to work quite fine so I will try to finish this approach this week.
If it fails, I will go back to your strategy. It will be better for me if my strategy really works since it will enable me to use my code and, by this way, to build a meaningfull benchmark between my linear solvers and yours. By meaningfull, I means I can use all my physical models. I plane to test the performance on 3 main solvers:
- a high Reynolds incompressible fluid solver based on projection/correction method with a FEFV discretization
- a low Reynolds incompressible flow solver based on coupled U/p method with a stabilized FE discretization
- a linear elasticity model / pure FE
I will focus on "industrial" test cases, meaning meshes around 1e6/2e7 vertices, highly unstructured and anisotropic.
If you are interested, i could try to write a small (informal) report that you could put in your github.
I find the objective interesting because your library is quite interesting because, beyond the power of the technologies (for which I don't have any valuable input to give), I found the C++ architecture quite neat and efficient (it is the main reason why i have chooosed your library). I guess such an architecture could be easily extended in the future to add new functionalities / technologies. In view of this objective, it would be benefical if more and more people try to use it. In this spirit, more examples and documentations there are, better it is.
In the same idea, I have spent some times to understand how to define precisely input xml files (it is the input format of the code I used - yes I am quite lucky) to use "all" your possible solvers. So I think it could be good to put an "exhaustive" database of xml (and/or json) files in your examples folder to help new users.
Lastly, I have a specific question. I have read your article "subdomain deflation with local amg..." and I wonder what is the direct solver you have used for aMGCL? Is it PastiX? If yes, do you have linked it against an optimized Blas library? (MKL? Blis?). And I have the same question for the Trilinos version you have used (I haven't used Trilinos but I guess there is the possibility to use direct solver with highly optimized blas libraries). It is important to know that in order to judge adequately the results you mention in your article.
Ok so far so good but I still have many works to do with AMGCL!Completing my benchmarks with segregated flow solver (easy now) then playing with your Navier-Stokes block preconditioner (I begin next week and will take time to do it cleanly -- says February. Here I am quite confident that AMGCL will give me great improvement on global CPU time) and, lastly, trying to see if AMGCL can help me with the very stiff problem of elasticity in a distributed parallel framework (I plane this topic on March -- it is the most difficult by far). At end, as I have said you, I will write a report that I will send you if you want to put it in your library package for helping/interesting others users.Ok, as I have said, finding the c++ pattern of your library nice, it would be a pleasure for me to help you to develop the library with the hope that many others people used it and improve it. In this spirit, since I have no competence in linear solvers, I would like to propose you my help on the development of the "examples" part. After reflexion, I could write a lightweight framework wich provides:
- a reader for a tetraedral mesh file and a builder of the associated (P1) matrix pattern. As a format, I would like to use the MEDIT format. It is a quite simple format that has several advantages for our purpose:
- the open source mesher Gmsh can read/write this format so a user could easily build some meshes to test AMGCL
- I know it quite well!
- a builder providing the matrix values arising from the P1 discretization of the following problems:
- Stokes problem
- linear isotropic elasticity
- linear convection / diffusion scalar equation
- ==> the advantage of these restricted choices is that they needs a minimal set of input parameters (which agrees quite well with the examples you already provides) but are still quite representative of a lot of problems. So they could interest some users.
- a Paraview writer (mesh and solution)
Please let me know if these things are of interest for you.
If yes, I will do them. Ok I need to be precise --> since I have a work which take me (unfortunately too much!) time, it will take me time to do the job I described cleanly. I think I could finish it maybe in 3 or 4 months, not before. Good for you?
To end this post, some little things I have found during my study of your library:
- in amgcl::mpi::coarsening::pmis.hpp, at line 436, my compiler (gcc 7.2.0) doesn't like the use of <undone> (problem with template as usual;). I need to use:
- int UNDONE = this->undone
- std::vector<ptrdiff_t> rem_state(Sp.recv.count(), UNDONE):
- ==> I guess it is specific to my compiler but maybe some other users will have the problem.
- I like the profiler<mpi_aggregator> but I cannot use it as I use the profiler<clock> (i.e declaring extern profile<> prof) because at construction, mpi_aggregator use the MPI_COMM_WORLD to have rank/size. The problem is that if I declare somewher in my code a global profiler<mpi_aggregator>, some compilers could build this object before the main, so before the MPI_Init that many people (whose me!) declare classicaly in the main. At end, the application crashes because mpi_aggregator try to use MPI_COMM_Rank before Mpi_Init is launch. Ok, to remedy to the problem, I have delayed all the construction of counters after construction (it is done inside the function current). Works fine for me but not so clean. I think we could change slighty the mpi_aggregator implementation to let the external user gives it the MPI communicator (like you do for the solvers).
- during my benchmarks, I have found that the inner_product (the default one of backend::buildtin) is (in sequential / not OMP) 3 times lower than classical implementation. So I would suggest to have complete distinction betwwen OMP implementation and sequential one (and, for the later, to use classical loop implementation or stl one). I think it can have some importance since for many stiff problems, the inner product can takes more than 33% of the total time
- I think a good thing for users would be to provide a mechanism to output the residuals during a resolution. Two things are interesting:
- the classical "residual versus iteration"
- the (less) classical "spatial" residual (says at some sampled iterations). Not natural from "algebraic" linear solver point of view. But quite natural & usefull if you think at the people which link the library towards its solver for which he has the notion of underlying mesh (so visualization way).
- ==> well the idea seems interesting (for me at least!) but I need to go beyond and maybe to propose you a practical implementation.
Sincerely.Thomas.
--
You received this message because you are subscribed to a topic in the Google Groups "amgcl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/amgcl/xvRruig9s4A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to amgcl+un...@googlegroups.com.
To post to this group, send email to am...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/41c832af-76e7-41b9-9b30-fb589af48439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Denis,
thanks for your commits;
I haven't be able to check all because I'm at my home and on my laptop I didn't install mpi;)
I will do the job tomorrow at office. At first glance:
- for pmis::undone, it should work clearly!
In the same spirit, note that I have found usefull to add two compil flags to AMGCL:
- for the profiler, your suggestion to use a global function with static local variable is the good one I guess. I will play along it together with the new constructor you provides for the mpi_aggregator. Let me give you some news on this later but it should be good indeed.
- AMGCL_HAS_MPI --> if defined, then in <amgcl/util.hpp>, I switch the declaration of
namespace amgcl { extern profiler<> prof; }
by
# include <amgcl/perf_counter/mpi_aggregator.hpp>
namespace amgcl { extern profiler<perf_counter::mpi_aggregator<perf_counter::clock,false>> prof; }
- AMGCL_INTSIZE32 --> in <amgcl/util.hpp>, I add:
#ifdef AMGCL_INTSIZE32
#define Ptrdiff_t int
#else
#define Ptrdiff_t ptrdiff_t
#endif
Accordingly with < find . -type f --exec sed -i "s/ptrdiff_t/Ptrdiff_t/g" {} \; >, I change all the code and it works fine for me.Mmm, maybe it is not clean but it is quite usefull because many codes (my but also PastiX, Scotch, OpenFoam...) allows to choose the integer representation. So I think it is good that amgcl do it also otherwise users will have obligation to recompile all their sources i think. Do you think it could be usefull?
Next is the "problem" of the inner product. I have seen your commit but from my point of view it doesn't solve the problem. In fact, to understand why I think this, I send you a "dot.png" picture which is the result of a bench using severals ways to compute the inner product. The picture correspond to a bench done on a my laptop (intel i5 4310 2Ghz). Note the results are really similar on some others (intel) procs. Tomorrow, i could send you some bench results from several others CPU. I also joined you a main.cpp & run.sh which should enable you to make your idea by yourself. Note that if you find quite different results, i would be interested to discuss about because i do this kind of benchmarks since a long time on my different CPU's and i always find the same behavior.
Next is your comments on the results I find you comparing my code with yours. First, I'm quite sure that my runs don't use OpenMPI mode (I manage myself all the environnement of the cluster). But you have right my timings were strange! I think I know why now... In fact, I think I have done a mistake between tol and abstol when defining the settings of amgcl (my code only has <tol> and running your library I have unfortunately used abstol). Ok, I will run again the benchs tomorrow and send you updated results;);) Thanks for pointing me the strangness og the results.
Last but not least I find interesting to discuss around your following comments:Things I would also try:
* Replace AMG(ILU0) with AMG(SPAI0), just to see if this will be able to converge. If it does, SPAI0 should be faster to setup, and has better parallel scalability.* It looks like you have a non-scalar problem (three velocity components in case of 3D flow). I would try to see if replacing `backend<double>` with `backend<amgcl::static_matrix<double, 3,3>>` will improve performance (number of iterations should not change too much, but each iteration should be cheaper due to improved cache locality). An example can be found here:
Ok, first, concerning the SPAI0 I have tested it but for my problem it slown down the global CPU time since the problem is not so easy to converge. Nevertheless, I have a remark on it. On my code, I has the possibility to "couple" diagonal precondtionning with ILU(k). Something like a "composite" preconditionning. Do you think it would be possible to do such things with amgcl?
Now, the most interesting part for me is the second one. Let me explain why. As I have said, I work often (for high Reynolds flow) with segregated solver which roughly speaking means that I solve two parts:Classically, all the group A problems converge easily (something like 10 iterations with basic GMRES/ILU0) whereas the pressure problem is quite difficult to converge (quite common to have some hundred of iterations with GMRES/ILU1).
- the first one (group A) refers to all the variables except the pressure. It means: 1- the 3 components of the velocity (note we don't need to couple them for high reynolds flow), 2- all the densities (not just one in fact because we can think about multiphase flows) and 3- all the remaining scalars (they can some passive scalars, some mass fraction of chemical species and so on). At end, it is quite common to have something like around 10 scalar unknows on this part. Note also that the group 1 and 3 share the same structure: advection/diffusion/source whereas the density group is simple: pure advection.
- the second one (group B) is the equation on the pressure. Ok, something like a pure Poisson problem.
With your AMG solver, I'm now quite faster with pressure so the main bottleneckness is the group A. So I have a strong interest to speedup this group. Your idea of using a static_matrix is quite good but I think we could do a little better. What I have in mind is to write a static_diagonal_matrix and a static_sperical_matrix types. With these types, we could solve in an "optimal" fashion all ours scalar problems of the group A I think. Note that the static_diagonal_matrix would be applied to the group A.1 and A.3 (for which variables don't share the same matrix) whereas the static_spherical_matrix would be used for the densities (which all share the same velocity, so the same matrix -- only the rhs differs). I hope it is a good idea. What do you think about it? And from strict implementation point of view, if I have well understand the code, it is not difficult. I think it is sufficient to do a kind of copy/paste from static_matrix. Do you think so also?
for (c=s=0 , ptrdiff_t i...) {d = x(i)*y(i) - c;
t = s + d; // t = x(i)*y(i)
c = (t-s) - d; // c = d - d = 0
s = t; //s = x(i)*y(i)
}
return s;
for (s=0, ptrdiff_t i...) {s += x(i)*y(i);}
Do you see something else? And do you think that in the AMG framework it is still usefull to handle such a perfectionnement of the ILU(k) family (if yes, I could see with my friend how to do it)?
Otherwise, to close with the other details, a few comments on your last mail.* For the inner_product, I didn't suggest to use optimized blas kernel or something else. In fact, I just wonder myself why you need the trick:for (c=s=0 , ptrdiff_t i...) {d = x(i)*y(i) - c;
t = s + d; // t = x(i)*y(i)
c = (t-s) - d; // c = d - d = 0
s = t; //s = x(i)*y(i)
}
return s;
instead of simpler version (twice as fast for a scalar value_type):for (s=0, ptrdiff_t i...) {s += x(i)*y(i);}What happens, now I understand also that when using AMGCL it is of no concern nevertheless.
* For the ptrdiff_t, I can be fair. What happens is that I'm not confortable with prtdiff_t that I have seldom used in my life. And when I tried to use AMGCL for the first time, knowing that my code use int* ptr, int* col for the CRS matrix storage and knowing that I tried to use the zero_adpater to gain a little time (i began with sequential case), the code doesn't compile because zero_adapter requires explicitely that the size of my index for CRS storage has the size of ptrdiff_t. Knowing also that it could leads to failure if I try to use an int* instead of an ptrdiff_t*, I was afraid and choose to do my trick of Ptrdiff_t ~ int. I'm quite agree that it is quite dirty but I was not sure I could do otherwise. So to close this discuss, I just have one question: what is the best and safer way -- with the library "as it is" -- to deal with my storage "int* ptr, int* col"? I just to be sure there won't be some "compatibility" problems...
* Last, for the use of schur_pressure_correction to handle my "multi-valued" case, the idea is quite nice but I'm not sure I'm able to do it. When you speak about "recursion", I'm not sure to know how I can do it. Do I need to "modify" the schur_pressure_correction class to handle multiple (template) solvers? It seems a little technical from strict c++ point of view no? Maybe you thought something else no?
Great thanks...Thomas.
Hi Denis,for the preconditioner ILU(k), as my friend explained me, his implementation is not really a "distributed" one (no intra-proc communication). He simply uses the information of the remote part to handle a little more coupling between the domains. It seems easy to do (well I can hope at least;). Ok we will do the job with my friend and give you feedback / source if it is usefull (my friend says it is the case but I will try to find good cases demonstrating this fact).
Sincerely.Ok, many work to do on my side. Let me give you some news when I have more results... (I don't forget the implementation of some mesh functionalities & associated problems that could be good for the examples part).For the schur_pressure, ok I understand what you mean now. Modern c++ features are quite useful & beautiful... I will ask a coworker to deal with this topic, making comparison with the more crude approach of using static_matric or something related.For the ptrdiff_t ok it is clearer and I can safely give up my so beautiful find & sed trick;)For the Kahan summation, I understand better what you have done now. I had strictly never heard about this trick before. I'm a little ashamed since I do some numerical analysis for more than 10 years! Well in fact it is quite good since it is always great to learn something new;) Thanks for it.