Solving unconventional algebraic systems with AMGCL

29 views

Skip to first unread message

C B

unread,

Jul 4, 2021, 4:12:55 PM7/4/21

to amgcl

Hi Denis and Everyone,

I continue solving large sparse SPD systems where I only need a small residual reduction (0.05),

and the solution is required for a very large number of time steps,

therefore I am not using the power of AMG, but just bicgstab with a diagonal preconditioner.

Below are timings obtained with the following options for all solvers:

./solver*.exe -B --matrix matrices/mat_bin_006.bin --rhs matrices/rhs_bin_006.bin -1 -p solver.tol=0.05

I would like to ask you why the "inner product" step is so expensive on the GPUs,

and for comparison the top results are solution times on the CPUs, and below on the GPUs.

On the (left) side of the snapshot are the results from an old i7 CPU with an old Quadro GPU and on the (right) for a newer CPU with a newer GPU.

When comparing "inner product" on the CPUs vs. GPUs we see that on both computers it takes 5x more time on the GPU.

Another thing that is interesting to me is that on both computers the time for "spmv" is much lower than the "inner product" time,

and that on the old GPU the "spmv" time is soo low.

Another important thing that surprised me is that on the new computer (right) the "setup" time is much larger than on the "old" computer (left).

Your insights will be much appreciated.

Cheers,

Denis Demidov

unread,

Jul 5, 2021, 12:45:50 AM7/5/21

to amgcl

Hi Carl,

On Sunday, July 4, 2021 at 11:12:55 PM UTC+3 cebau...@gmail.com wrote:

Hi Denis and Everyone,

I continue solving large sparse SPD systems where I only need a small residual reduction (0.05),
and the solution is required for a very large number of time steps,
therefore I am not using the power of AMG, but just bicgstab with a diagonal preconditioner.

Below are timings obtained with the following options for all solvers:

./solver*.exe -B --matrix matrices/mat_bin_006.bin --rhs matrices/rhs_bin_006.bin -1 -p solver.tol=0.05

I would like to ask you why the "inner product" step is so expensive on the GPUs,
and for comparison the top results are solution times on the CPUs, and below on the GPUs.
On the (left) side of the snapshot are the results from an old i7 CPU with an old Quadro GPU and on the (right) for a newer CPU with a newer GPU.

When comparing "inner product" on the CPUs vs. GPUs we see that on both computers it takes 5x more time on the GPU.

Inner product is basically a reduction operation, which is an expensive operation on a GPU. It may take several kernels to compute, or, as in the case of VexCL implementation, it runs a single kernel followed by the transfer of a small resulting vector over the PCI bus. The computation is then finalized on the CPU. A pipelined implementation of the iterative solver could probably help here, but is not available in amgcl yet:

https://arxiv.org/abs/1801.04728

https://arxiv.org/abs/1612.01395

Another thing that is interesting to me is that on both computers the time for "spmv" is much lower than the "inner product" time,
and that on the old GPU the "spmv" time is soo low.
Another important thing that surprised me is that on the new computer (right) the "setup" time is much larger than on the "old" computer (left).

The reason is probably the same one you found in the case of full amg: slow transfer of the data to the GPU memory. The profile for the single-level relaxation lacks this information, but the matrix and, in the case of SPAI0, the diagonal vector, still need to be copied to the GPU memory.

C B

unread,

Jul 5, 2021, 10:28:11 AM7/5/21

to Denis Demidov, amgcl

Hi Denis,

Thank you so much for your explanation, and for sharing the papers,

the authors of the second paper wrote that they implemented it on Petsc, I wonder if they know that AMGCL is way better and faster than Petsc?....

As you indicated that copying the matrix and diagonal preconditioner takes time, and in the context of solving very similar matrices many times,

matrices which only change in the last few row/cols as they grow in each iteration,

what about keeping pinned memory chunks on the GPU for the three vectors defining the matrix,

and having one "setup" function call to initiate this memory allocation, and then an "update" call to update the matrix which will only

take "updates" to the 3 matrix vectors. In theory this would reduce the copying to the fraction of the memory size updated,

which in my case is less than 0.1 %.

What do you think about this?, would this be useful to reduce "setup" time, or copying is a small part of "setup" ?

Thanks for you advice,

I wish I could take a class with you!

Cheers,

--
You received this message because you are subscribed to the Google Groups "amgcl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to amgcl+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/b9214531-4af9-4173-adab-900a220ebae2n%40googlegroups.com.

Denis Demidov

unread,

Jul 5, 2021, 11:21:47 AM7/5/21

to amgcl

On Monday, July 5, 2021 at 5:28:11 PM UTC+3 cebau...@gmail.com wrote:

Hi Denis,
Thank you so much for your explanation, and for sharing the papers,
the authors of the second paper wrote that they implemented it on Petsc, I wonder if they know that AMGCL is way better and faster than Petsc?....

As you indicated that copying the matrix and diagonal preconditioner takes time, and in the context of solving very similar matrices many times,
matrices which only change in the last few row/cols as they grow in each iteration,
what about keeping pinned memory chunks on the GPU for the three vectors defining the matrix,
and having one "setup" function call to initiate this memory allocation, and then an "update" call to update the matrix which will only
take "updates" to the 3 matrix vectors. In theory this would reduce the copying to the fraction of the memory size updated,
which in my case is less than 0.1 %.
What do you think about this?, would this be useful to reduce "setup" time, or copying is a small part of "setup" ?

I've added some profiling to the vexcl matrix construction shown in this patch (applied to the vexcl source code):

https://gist.github.com/ddemidov/452f2708766d61c42925b847d98fda00

This results in the following profile:

solver_vexcl_cuda -n 128

1. NVIDIA GeForce GTX 1050 Ti

Solver

======

Type: BiCGStab

Unknowns: 2097152

Memory footprint: 112.00 M

Preconditioner

==============

Number of levels: 4

Operator complexity: 1.62

Grid complexity: 1.13

Memory footprint: 744.74 M

level unknowns nonzeros memory

---------------------------------------------

0 2097152 14581760 553.22 M (61.61%)

1 263552 7918340 168.88 M (33.46%)

2 16128 1114704 20.01 M ( 4.71%)

3 789 53055 2.62 M ( 0.22%)

Iterations: 10

Error: 2.50965e-09

[Profile: 2.557 s] (100.00%)

[ self: 0.158 s] ( 6.16%)

[ assembling: 0.115 s] ( 4.48%)

[ setup: 1.593 s] ( 62.29%)

[ self: 0.065 s] ( 2.54%)

[ CSR copy: 0.048 s] ( 1.89%)

[ coarse operator: 0.436 s] ( 17.06%)

[ coarsest level: 0.031 s] ( 1.20%)

[ move to backend: 0.611 s] ( 23.88%)

[ self: 0.055 s] ( 2.14%)

[ convert: 0.349 s] ( 13.65%)

[ memcpy: 0.207 s] ( 8.09%)

[ relaxation: 0.028 s] ( 1.09%)

[ transfer operators: 0.374 s] ( 14.63%)

[ self: 0.116 s] ( 4.53%)

[ aggregates: 0.110 s] ( 4.32%)

[ smoothing: 0.131 s] ( 5.14%)

[ tentative: 0.016 s] ( 0.64%)

[ solve: 0.692 s] ( 27.06%)

[ axpby: 0.027 s] ( 1.07%)

[ axpbypcz: 0.023 s] ( 0.89%)

[ clear: 0.033 s] ( 1.29%)

[ coarse: 0.266 s] ( 10.39%)

[ copy: 0.003 s] ( 0.13%)

[ inner_product: 0.263 s] ( 10.30%)

[ relax: 0.013 s] ( 0.49%)

[ residual: 0.000 s] ( 0.02%)

[ vmul: 0.012 s] ( 0.47%)

[ residual: 0.035 s] ( 1.35%)

[ spmv: 0.029 s] ( 1.14%)

The move to backend operation here takes 0.611 / 1.593 = 40% of the setup, and the copy is 0.207 / 1.593 = 13%.

The rest is the conversion from CSR to ELL format. ELL format is optimized for spmv performance on GPU, but it takes some time to construct from CSR, as you can see here. The following patch changes the vexcl matrix format used in amgcl from vex::sparse::matrix (defaults to ELL on a GPU) to vex::sparse::csr:

https://gist.github.com/ddemidov/c68bab93d54efa1a6c491733c1451983

The result is:

solver_vexcl_cuda -n 128

1. NVIDIA GeForce GTX 1050 Ti

Solver

======

Type: BiCGStab

Unknowns: 2097152

Memory footprint: 112.00 M

Preconditioner

==============

Number of levels: 4

Operator complexity: 1.62

Grid complexity: 1.13

Memory footprint: 744.74 M

level unknowns nonzeros memory

---------------------------------------------

0 2097152 14581760 553.22 M (61.61%)

1 263552 7918340 168.88 M (33.46%)

2 16128 1114704 20.01 M ( 4.71%)

3 789 53055 2.62 M ( 0.22%)

Iterations: 10

Error: 2.50965e-09

[Profile: 2.479 s] (100.00%)

[ self: 0.100 s] ( 4.04%)

[ assembling: 0.116 s] ( 4.69%)

[ setup: 1.115 s] ( 44.97%)

[ self: 0.053 s] ( 2.14%)

[ CSR copy: 0.044 s] ( 1.77%)

[ coarse operator: 0.429 s] ( 17.28%)

[ coarsest level: 0.031 s] ( 1.23%)

[ move to backend: 0.207 s] ( 8.37%)

[ relaxation: 0.033 s] ( 1.33%)

[ transfer operators: 0.318 s] ( 12.84%)

[ self: 0.108 s] ( 4.36%)

[ aggregates: 0.099 s] ( 4.01%)

[ smoothing: 0.100 s] ( 4.05%)

[ tentative: 0.011 s] ( 0.42%)

[ solve: 1.148 s] ( 46.31%)

[ axpby: 0.000 s] ( 0.01%)

[ axpbypcz: 0.001 s] ( 0.03%)

[ clear: 0.001 s] ( 0.02%)

[ coarse: 0.649 s] ( 26.16%)

[ copy: 0.007 s] ( 0.28%)

[ inner_product: 0.474 s] ( 19.13%)

[ relax: 0.007 s] ( 0.29%)

[ residual: 0.000 s] ( 0.02%)

[ vmul: 0.007 s] ( 0.27%)

[ residual: 0.000 s] ( 0.01%)

[ spmv: 0.009 s] ( 0.37%)

The "move to backend" in the setup is now just a "move to backend/copy" from the previous profile, and the solution step takes 65% more time.

The overall solution time (setup + solve) is almost equal for both cases (since the Poisson problem converges so fast), but since setup time is much more important for you, this could benefit your application.

Another note on why the inner_product apparently takes so much. AMGCL_TOC uses the CPU clock, and the GPU kernels are launched asynchronously. Inner product returns the result to the CPU, which means it acts as a synchronization point for CUDA/OpenCL kernels.

So the numbers shown in the solve profile are not that meaningful unfortunately. The following (dirty) patch to amgcl/util.hpp syncs the GPU context before each call to AMGCL_TOC.

https://gist.github.com/ddemidov/17159d405c00a28e2a3616553704b72e

The results make much more sense now (this is still with CSR matrix format on the GPU):

solver_vexcl_cuda -n 128

1. NVIDIA GeForce GTX 1050 Ti

Solver

======

Type: BiCGStab

Unknowns: 2097152

Memory footprint: 112.00 M

Preconditioner

==============

Number of levels: 4

Operator complexity: 1.62

Grid complexity: 1.13

Memory footprint: 744.74 M

level unknowns nonzeros memory

---------------------------------------------

0 2097152 14581760 553.22 M (61.61%)

1 263552 7918340 168.88 M (33.46%)

2 16128 1114704 20.01 M ( 4.71%)

3 789 53055 2.62 M ( 0.22%)

Iterations: 10

Error: 2.50965e-09

[Profile: 2.390 s] (100.00%)

[ self: 0.086 s] ( 3.60%)

[ assembling: 0.112 s] ( 4.68%)

[ setup: 1.042 s] ( 43.61%)

[ self: 0.051 s] ( 2.14%)

[ CSR copy: 0.037 s] ( 1.54%)

[ coarse operator: 0.386 s] ( 16.14%)

[ coarsest level: 0.030 s] ( 1.27%)

[ move to backend: 0.207 s] ( 8.67%)

[ relaxation: 0.026 s] ( 1.07%)

[ transfer operators: 0.306 s] ( 12.78%)

[ self: 0.102 s] ( 4.26%)

[ aggregates: 0.094 s] ( 3.92%)

[ smoothing: 0.101 s] ( 4.24%)

[ tentative: 0.009 s] ( 0.37%)

[ solve: 1.150 s] ( 48.11%)

[ axpby: 0.010 s] ( 0.44%)

[ axpbypcz: 0.017 s] ( 0.71%)

[ clear: 0.004 s] ( 0.19%)

[ coarse: 0.006 s] ( 0.25%)

[ copy: 0.001 s] ( 0.04%)

[ inner_product: 0.018 s] ( 0.76%)

[ relax: 0.519 s] ( 21.74%)

[ residual: 0.488 s] ( 20.40%)

[ vmul: 0.032 s] ( 1.33%)

[ residual: 0.252 s] ( 10.53%)

[ spmv: 0.321 s] ( 13.45%)

Reply all

Reply to author

Forward

0 new messages