Hello Daniel and Everyone:
After spending quite a bit of time working with AMGCL,
I came to the realization that perhaps I am not using the best strategy.
Background info/usage:
1) Transient time-dependent Poisson type (7-point stencil) in a uniform hexahedral but non-cubic non-uniform domain.
2) Ever growing matrix contains the previous time-step matrix, plus a small percentage increase (< 0.1%) in size.
3) Approximate solution only requires a small residual decrease (relative error < 0.1 is OK). With typical multigrid (4 levels) the setup time is on average 5 times larger than then solution time (usually 2 iterations is enough).
Observations:
1) Right out of the box AMGCL is faster than PETSc, Trilinos, Hypre
2) Still, I would like to reduce the Wall time even more :)
3) I tried not doing setup, precond.max_levels = 1, and it converges ~ 80% of the time in less than 10 iterations. When it does not converge, just switch to using precond.max_levels = 2 and it converges in ~ 2 iters. With this strategy I am saving ~ 25-30% of the solution time. Which is good, but not knowing AMGCL, perhaps there is more that can be done ? Please let me know. I am including the code snippet for this strategy below
4) Not using MG (precond.max_levels = 1) (I think I am not using MG..), leads to good convergence, if I put a few more iterations
I bet it would always converge (symmetric pos definite matrices), I am thinking that perhaps this is ideal to be solved on a GPU,
because I read on the AMGCL docs that the solution part on GPUs is very fast. And because for symmetric systems we only need to update
the new rows, the GPU can keep the same matrix, and we only add the new rows for each new iteration/solution. What do you think?,
will this be a good approach on typical low end GPUs with ~ 3GB of memory ?
the matrix alone takes less than 500 MB of memory, half if we could use symmetry) Actually only using the lower (or upper) triangular
part would be ideal because the coeff matrix only needs to be "appended" with the new rows, otherwise it is more complicated...
Here is the little code snippet to illustrate my current use:
typedef amgcl::backend::builtin<double> PrecBackend; //float did not save Wall time with OMP backend
typedef amgcl::backend::builtin<double> SolvBackend;
typedef amgcl::make_solver<
amgcl::amg<
PrecBackend,
amgcl::coarsening::smoothed_aggregation,
amgcl::relaxation::spai0
>,
amgcl::solver::bicgstab<SolvBackend>
> Solver;
Solver::params prm;
prm.precond.max_levels = 1;//this is for the first attempt, not setting up multigrid (I think)
prm.solver.maxiter = 10;
prm.solver.tol = 0.1;
solveagain:
amgcl::profiler<> prof;
prof.tic("setup");
Solver solve(std::tie(n, ptr, col, val), prm);
double time_setup = prof.toc("setup");
prof.tic("solve");
std::tie(iters, error) = solve(rhs, x);
double time_solve = prof.toc("solve");
if( error > prm.solver.tol ){
if( prm.precond.max_levels < 2 ){
prm.precond.max_levels = 4;
prm.solver.maxiter = 30; // never needs more than 2-3 with AMG
std::fill( x.begin(), x.end(), 0.0 );
std::copy( RHS, RHS + neqns, rhs.begin());
goto solveagain;
}
//no convergence, handle error, but this never happens ...
}
Please, let me know what you think, in my tests with MG precond and w/o MG, bicgstab is always better than cg and gmres.
I tried using more than 10 iters to always make it converge w/o MG, but it does not reduce Wall time, after ~ 10 iters it is better to setup the MG precond.
Will a no MG precond have a good chance on a typical GPU such as a GTX 1060 with 3GB ? if I can update only the new rows of a matrix ?
Any other ideas on what I could change to reduce Wall time on a typical PC with or w/o a GPU ?
Thanks for your advice/feedback !
Regards,