Understanding AMGCL on CPU/GPU to optimize solution strategy

106 views
Skip to first unread message

C B

unread,
Apr 4, 2021, 2:23:16 PM4/4/21
to amgcl
Hello Daniel and Everyone:

After spending quite a bit of time working with AMGCL,
I came to the realization that perhaps I am not using the best strategy.

  Background info/usage:
  1) Transient time-dependent Poisson type (7-point stencil) in a uniform hexahedral but non-cubic non-uniform domain.
    2) Ever growing matrix contains the previous time-step matrix, plus a small percentage increase (< 0.1%) in size.
      3) Approximate solution only requires a small residual decrease (relative error < 0.1 is OK). With typical multigrid (4 levels) the setup time is on average 5 times larger than then solution time (usually 2 iterations is enough).

  Observations:
  1) Right out of the box AMGCL is faster than PETSc, Trilinos, Hypre
    2) Still, I would like to reduce the Wall time even more :)
      3) I tried not doing setup, precond.max_levels = 1, and it converges ~ 80% of the time in less than 10 iterations. When it does not converge, just switch to using precond.max_levels = 2 and it converges in ~ 2 iters. With this strategy I am saving ~ 25-30% of the solution time. Which is good, but not knowing AMGCL, perhaps there is more that can be done ? Please let me know. I am including the code snippet for this strategy below
 4) Not using MG (precond.max_levels = 1) (I think I am not using MG..), leads to good convergence, if I put a few more iterations
     I bet it would always converge (symmetric pos definite matrices), I am thinking that perhaps this is ideal to be solved on a GPU,
     because I read on the AMGCL docs that the solution part on GPUs is very fast. And because for symmetric systems we only need to update
     the new rows, the GPU can keep the same matrix, and we only add the new rows for each new iteration/solution. What do you think?,
      will this be a good approach on typical low end GPUs with ~ 3GB of memory ?
  the matrix alone takes less than 500 MB of memory, half if we could use symmetry) Actually only using the lower (or upper) triangular
    part would be ideal because the coeff matrix only needs to be "appended" with the new rows, otherwise it is more complicated...

Here is the little code snippet to illustrate my current use:

typedef amgcl::backend::builtin<double> PrecBackend; //float did not save Wall time with OMP backend
typedef amgcl::backend::builtin<double> SolvBackend;

typedef amgcl::make_solver<
    amgcl::amg<
    PrecBackend,
    amgcl::coarsening::smoothed_aggregation,
    amgcl::relaxation::spai0
    >,
    amgcl::solver::bicgstab<SolvBackend>
    > Solver;

    Solver::params prm;
    prm.precond.max_levels = 1;//this is for the first attempt, not setting up multigrid (I think)
    prm.solver.maxiter = 10;
    prm.solver.tol = 0.1;
 solveagain:

  amgcl::profiler<> prof;
  prof.tic("setup");
  Solver solve(std::tie(n, ptr, col, val), prm);
  double time_setup = prof.toc("setup");
  prof.tic("solve");
  std::tie(iters, error) = solve(rhs, x);
  double time_solve = prof.toc("solve");

  if( error > prm.solver.tol ){
    if( prm.precond.max_levels < 2 ){
            prm.precond.max_levels = 4;
   prm.solver.maxiter = 30; // never needs more than 2-3 with AMG
   std::fill( x.begin(), x.end(), 0.0 );
   std::copy( RHS, RHS + neqns, rhs.begin());
   goto solveagain;
    }
    //no convergence, handle error, but this never happens ...
  }

Please, let me know what you think, in my tests with MG precond and w/o MG, bicgstab is always better than cg and gmres.
I tried using more than 10 iters to always make it converge w/o MG, but it does not reduce Wall time, after ~ 10 iters it is better to setup the MG precond.
Will a no MG precond have a good chance on a typical GPU such as a GTX 1060 with 3GB ? if I can update only the new rows of a matrix ?
Any other ideas on what I could change to reduce Wall time on a typical PC with or w/o a GPU ?

Thanks for your advice/feedback !

    Regards,

Denis Demidov

unread,
Apr 5, 2021, 12:22:11 AM4/5/21
to amgcl
Hello Carl, (and Daniel),



On Sunday, April 4, 2021 at 9:23:16 PM UTC+3 cebau...@gmail.com wrote:
Hello Daniel and Everyone:

After spending quite a bit of time working with AMGCL,
I came to the realization that perhaps I am not using the best strategy.

  Background info/usage:
  1) Transient time-dependent Poisson type (7-point stencil) in a uniform hexahedral but non-cubic non-uniform domain.
    2) Ever growing matrix contains the previous time-step matrix, plus a small percentage increase (< 0.1%) in size.
      3) Approximate solution only requires a small residual decrease (relative error < 0.1 is OK). With typical multigrid (4 levels) the setup time is on average 5 times larger than then solution time (usually 2 iterations is enough).

  Observations:
  1) Right out of the box AMGCL is faster than PETSc, Trilinos, Hypre
    2) Still, I would like to reduce the Wall time even more :)
      3) I tried not doing setup, precond.max_levels = 1, and it converges ~ 80% of the time in less than 10 iterations. When it does not converge, just switch to using precond.max_levels = 2 and it converges in ~ 2 iters. With this strategy I am saving ~ 25-30% of the solution time. Which is good, but not knowing AMGCL, perhaps there is more that can be done ? Please let me know. I am including the code snippet for this strategy below 
 4) Not using MG (precond.max_levels = 1) (I think I am not using MG..), leads to good convergence, if I put a few more iterations
     I bet it would always converge (symmetric pos definite matrices), I am thinking that perhaps this is ideal to be solved on a GPU,
     because I read on the AMGCL docs that the solution part on GPUs is very fast. And because for symmetric systems we only need to update
     the new rows, the GPU can keep the same matrix, and we only add the new rows for each new iteration/solution. What do you think?,
      will this be a good approach on typical low end GPUs with ~ 3GB of memory ?

Do you mean launch an additional kernel which would finalize the computation for the new rows? I think that could be disproportionately expensive
(because kernel launch is expensive on GPU). And libraries like CUSPARSE tend keep the sparse matrices in opaque data structures optimized for the GPUs,
so you would need to rebuild the matrix if you want to do the spmv in a single kernel.
 
  the matrix alone takes less than 500 MB of memory, half if we could use symmetry) Actually only using the lower (or upper) triangular
    part would be ideal because the coeff matrix only needs to be "appended" with the new rows, otherwise it is more complicated...

Since setting max-levels to 1 yields good results for you, I would also try with amgcl::relaxation::as_preconditioner instead of amgcl::amg:

something like 

typedef amgcl::make_solver<
  amgcl::relaxation::as_preconditioner<PrecBackend, amgcl::relaxation::spai0>,
  amgcl::solver::bicgstab<SolvBackend>
  >

This should be even cheaper to setup, and behave similarly (it could result in about twice more iterations, but that is only
because a single amg level does one pre-relaxation and one post-relaxation)
I think your configuration is good for a Poisson problem. Re CPU vs GPU vs AMG vs Single-level, you would need to test these on your specific problem/GPU, but here are some results for my Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz vs NVIDIA GeForce GTX 1050 Ti. The problem is 3D Poisson on a 80^3 grid, the relative error tolerance is set to 1e-6:

CPU, full AMG:
solver -n 80 solver.tol=1e-6
Solver
======
Type:             BiCGStab
Unknowns:         512000
Memory footprint: 27.34 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    180.50 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       512000        3545600    134.56 M (61.85%)
    1        64560        1903196     40.72 M (33.20%)
    2         4185         270731      4.89 M ( 4.72%)
    3          237          13051    336.19 K ( 0.23%)

Iterations: 8
Error:      4.6025e-07

[Profile:          0.825 s] (100.00%)
[ self:            0.004 s] (  0.44%)
[  assembling:     0.038 s] (  4.61%)
[  setup:          0.225 s] ( 27.30%)
[  solve:          0.558 s] ( 67.66%)

CPU, single-level:
solver -n 80 solver.tol=1e-6 -1 solver.maxiter=200
Solver
======
Type:             BiCGStab
Unknowns:         512000
Memory footprint: 27.34 M

Preconditioner
==============
Relaxation as preconditioner
  Unknowns: 512000
  Nonzeros: 3545600
  Memory:   61.91 M

Iterations: 110
Error:      9.03894e-07

[Profile:          1.935 s] (100.00%)
[ self:            0.003 s] (  0.13%)
[  assembling:     0.045 s] (  2.30%)
[  setup:          0.020 s] (  1.02%)
[  solve:          1.869 s] ( 96.54%)

GPU, full AMG:
solver_cuda -n 80 solver.tol=1e-6
GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         512000
Memory footprint: 27.34 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    137.54 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       512000        3545600    102.75 M (61.85%)
    1        64560        1903196     30.78 M (33.20%)
    2         4185         270731      3.68 M ( 4.72%)
    3          237          13051    339.89 K ( 0.23%)

Iterations: 8
Error:      4.6025e-07

[Profile:          0.678 s] (100.00%)
[ self:            0.229 s] ( 33.73%)
[  assembling:     0.038 s] (  5.60%)
[  setup:          0.293 s] ( 43.24%)
[  solve:          0.118 s] ( 17.42%)

GPU, single-level:
solver_cuda -n 80 solver.tol=1e-6 -1 solver.maxiter=200
GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         512000
Memory footprint: 27.34 M

Preconditioner
==============
Relaxation as preconditioner
  Unknowns: 512000
  Nonzeros: 3545600
  Memory:   46.44 M

Iterations: 115
Error:      9.33971e-07

[Profile:          0.650 s] (100.00%)
[ self:            0.219 s] ( 33.63%)
[  assembling:     0.044 s] (  6.70%)
[  setup:          0.049 s] (  7.58%)
[  solve:          0.339 s] ( 52.08%)

 So, at least for this problem, with this precision, the GPU-single-level seems to be the optimal choice.

Denis Demidov

unread,
Apr 6, 2021, 3:31:57 AM4/6/21
to amgcl


---------- Forwarded message ---------
From: C B <cebau...@gmail.com>
Date: Tue, Apr 6, 2021 at 8:51 AM
Subject: Re: Understanding AMGCL on CPU/GPU to optimize solution strategy
To: Denis Demidov <dennis....@gmail.com>


Hello Denis !
Thank you so much for your recommendations :),

With  amgcl::relaxation::as_preconditioner<PrecBackend, amgcl::relaxation::spai0>  the wall time was reduced by 20% for jobs with 400k unknowns.
I am hoping that for larger cases the savings will be even larger, I will let you know...

I need to build my win10 environment with CUSP to fully understand timings, I thought that only copying the new rows to the GPU would save time, but I don't really know how AMGCL works with the GPU, there is a lot I need to learn :)

I don't have solver_cuda in my Win10 environment. Does your reported Profile time include the Wall time required to copy the matrix and RHS to the GPU ?, or the copying time is negligible ?
I tried to figure this myself but I am afraid I may be missing something here:
I have a GeForce GTX 1060 3 GB with a PCI-Express 3.0 x16 interface, and I hope the motherboard supports the PCI-E 3x16.
Version 3.x: 8 GT/s
  • ×1: 985 MB/s
  • ×16: 15.75 GB/s  
I don't know if the effective bandwidth is close to 16 GB/s or these are peak values way over what we can get ?.....

In your example with ~ 500k unknowns, with a 7 point stencil, and some other data, lets say roughly 80 bytes per unknown, only 40 MB, even if we get an effective bandwidth of 8 GB/s the data transfer would take only 5 ms. Is this correct ?, is this the speed at which you can move data to your GPU ?

Thanks again for your recommendations!
Cheers,


--
You received this message because you are subscribed to the Google Groups "amgcl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to amgcl+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/7b83d0e5-012f-4bb3-95ab-51b7f04cab59n%40googlegroups.com.


--
Cheers,
Denis

Denis Demidov

unread,
Apr 6, 2021, 3:56:55 AM4/6/21
to amgcl
Carl,

On Tuesday, April 6, 2021 at 10:31:57 AM UTC+3 Denis Demidov wrote:


---------- Forwarded message ---------
From: C B <cebau...@gmail.com>
Date: Tue, Apr 6, 2021 at 8:51 AM
Subject: Re: Understanding AMGCL on CPU/GPU to optimize solution strategy
To: Denis Demidov <dennis....@gmail.com>


Hello Denis !
Thank you so much for your recommendations :),

With  amgcl::relaxation::as_preconditioner<PrecBackend, amgcl::relaxation::spai0>  the wall time was reduced by 20% for jobs with 400k unknowns.
I am hoping that for larger cases the savings will be even larger, I will let you know...

It may also go the other way because the single-level preconditioner should become less and less effective for larger problems.
 

I need to build my win10 environment with CUSP to fully understand timings, I thought that only copying the new rows to the GPU would save time, but I don't really know how AMGCL works with the GPU, there is a lot I need to learn :)

I don't have solver_cuda in my Win10 environment. Does your reported Profile time include the Wall time required to copy the matrix and RHS to the GPU ?, or the copying time is negligible ?

solver_cuda is built from the same source as solver, you just need to copy solver.cpp to solver.cu and build it with nvcc (and with -DSOLVER_BACKEND_CUDA). AMGCL cmake should do this for you if it finds your cuda SDK installation.
 
I tried to figure this myself but I am afraid I may be missing something here:
I have a GeForce GTX 1060 3 GB with a PCI-Express 3.0 x16 interface, and I hope the motherboard supports the PCI-E 3x16.
Version 3.x: 8 GT/s
  • ×1: 985 MB/s
  • ×16: 15.75 GB/s  
I don't know if the effective bandwidth is close to 16 GB/s or these are peak values way over what we can get ?.....

Frankly, I haven't looked at these numbers for quite some time. My guess is the effective bandwidth will be significantly lower than the theoretical one, because the kernels in amgcl are not the most effective ones, and also because there are many kernels launched for a single V-cycle, and so there is some overhead related to that.
 

In your example with ~ 500k unknowns, with a 7 point stencil, and some other data, lets say roughly 80 bytes per unknown, only 40 MB, even if we get an effective bandwidth of 8 GB/s the data transfer would take only 5 ms. Is this correct ?, is this the speed at which you can move data to your GPU ?

If you build the solver with -DAMGCL_PROFILING, you should get more detailed profile:

solver_cuda -n 80 solver.tol=1e-6 solver.maxiter=200
GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         512000
Memory footprint: 27.34 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    137.54 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       512000        3545600    102.75 M (61.85%)
    1        64560        1903196     30.78 M (33.20%)
    2         4185         270731      3.68 M ( 4.72%)
    3          237          13051    339.89 K ( 0.23%)

Iterations: 8
Error:      4.6025e-07

[Profile:                    0.770 s] (100.00%)
[ self:                      0.254 s] ( 33.00%)
[  assembling:               0.038 s] (  4.95%)
[  setup:                    0.338 s] ( 43.85%)
[   self:                    0.023 s] (  2.96%)
[    coarse operator:        0.111 s] ( 14.36%)
[    coarsest level:         0.006 s] (  0.82%)
[    move to backend:        0.083 s] ( 10.79%)
[    relaxation:             0.009 s] (  1.22%)
[    transfer operators:     0.105 s] ( 13.70%)
[     self:                  0.036 s] (  4.68%)
[      aggregates:           0.027 s] (  3.55%)
[      smoothing:            0.035 s] (  4.60%)
[      tentative:            0.007 s] (  0.87%)
[  solve:                    0.140 s] ( 18.20%)
[    axpby:                  0.003 s] (  0.41%)
[    axpbypcz:               0.005 s] (  0.65%)
[    clear:                  0.031 s] (  4.06%)
[    coarse:                 0.001 s] (  0.10%)
[    copy:                   0.001 s] (  0.15%)
[    inner_product:          0.020 s] (  2.62%)
[    relax:                  0.073 s] (  9.46%)
[      residual:             0.017 s] (  2.23%)
[      vmul:                 0.056 s] (  7.23%)
[    residual:               0.003 s] (  0.41%)
[    spmv:                   0.002 s] (  0.32%)

The "move to backend" operation here takes about 10% of the total time (and is significantly more than 5ms).
Again, the data is moved to GPU not in a single batch, but as a lot of smaller individual vector. A CSR matrix, for example, contains three vectors, which are copied separately.

C B

unread,
Apr 9, 2021, 1:29:14 AM4/9/21
to Denis Demidov, amgcl
Hello Denis,

Thank you so much for your insights and great examples.

As you said the single-level prec is asymptotically slower than AMG, however for the sizes that I tested,
only increasing the num of unknowns up to 5 times, I am still getting roughly the same overall savings,
this might be related to the large residual tolerance that I am using.

After some work I managed to build with CUDA on Windows.
I run into a couple of things which I am sharing here for others who may run into the same issues:
1) Boost 1.75 works fine with VS and the builtin backend, but not with the cuda backend. With CUDA there is a bug in an include file that the boost developers have found,
and fixed, but I prefer to work with a stable release, so I moved to 1.74.
2) Then I run into the issue  that for each CUDA sdk version, there is a minimum driver number required, there is a long table about this in the release notes,
most people do not run into this issue because the CUDA Toolkit installs the right driver. However, in my case CUDA 11.2 could not install the correct driver,
and solver_cuda was built, but it would not execute. To trap initialization error we could use similar code as in the CUDA library utilities where they use:
cudaError_t    error_id = cudaGetDeviceCount(&deviceCount);
 if (error_id != cudaSuccess) {...........}

Then I installed CUDA 10.0 and everything was fine, not more initialization issues, and I was able to run solver_cuda w/o any problems.
This is very encouraging, the power of AMGCL is incredible!, we can change from one backend to another with minimum changes :).

Thanks again for your help !
Cheers!


Reply all
Reply to author
Forward
0 new messages