Solving unconventional algebraic systems with AMGCL

29 views
Skip to first unread message

C B

unread,
Jul 4, 2021, 4:12:55 PM7/4/21
to amgcl
Hi Denis and Everyone,

I continue solving large sparse SPD systems where I only need a small residual reduction (0.05),
and the solution is required for a very large number of time steps,
therefore I am not using the power of AMG, but just bicgstab with a diagonal preconditioner.

Below are timings obtained with the following options for all solvers:

./solver*.exe -B --matrix matrices/mat_bin_006.bin --rhs matrices/rhs_bin_006.bin -1 -p solver.tol=0.05 

I would like to ask you why the "inner product" step is so expensive on the GPUs,
and for comparison the top results are solution times on the CPUs, and below on the GPUs.
On the (left) side of the snapshot are the results from an old i7 CPU with an old Quadro GPU and on the (right) for a newer CPU with a newer GPU.

image.png

When comparing "inner product" on the CPUs vs. GPUs we see that on both computers it takes 5x more time on the GPU.
Another thing that is interesting to me is that on both computers the time for "spmv" is much lower than the "inner product" time,
and that on the old GPU the "spmv" time is soo low.
Another important thing that surprised me is that on the new computer (right) the "setup" time is much larger than on the "old" computer (left).

Your insights will be much appreciated.
Cheers,



Denis Demidov

unread,
Jul 5, 2021, 12:45:50 AM7/5/21
to amgcl
Hi Carl,

On Sunday, July 4, 2021 at 11:12:55 PM UTC+3 cebau...@gmail.com wrote:
Hi Denis and Everyone,

I continue solving large sparse SPD systems where I only need a small residual reduction (0.05),
and the solution is required for a very large number of time steps,
therefore I am not using the power of AMG, but just bicgstab with a diagonal preconditioner.

Below are timings obtained with the following options for all solvers:

./solver*.exe -B --matrix matrices/mat_bin_006.bin --rhs matrices/rhs_bin_006.bin -1 -p solver.tol=0.05 

I would like to ask you why the "inner product" step is so expensive on the GPUs,
and for comparison the top results are solution times on the CPUs, and below on the GPUs.
On the (left) side of the snapshot are the results from an old i7 CPU with an old Quadro GPU and on the (right) for a newer CPU with a newer GPU.

image.png

When comparing "inner product" on the CPUs vs. GPUs we see that on both computers it takes 5x more time on the GPU.

Inner product is basically a reduction operation, which is an expensive operation on a GPU. It may take several kernels to compute, or, as in the case of VexCL implementation, it runs a single kernel followed by the transfer of a small resulting vector over the PCI bus. The computation is then finalized on the CPU. A pipelined implementation of the iterative solver could probably help here, but is not available in amgcl yet:

 
Another thing that is interesting to me is that on both computers the time for "spmv" is much lower than the "inner product" time,
and that on the old GPU the "spmv" time is soo low.
Another important thing that surprised me is that on the new computer (right) the "setup" time is much larger than on the "old" computer (left).

The reason is probably the same one you found in the case of full amg: slow transfer of the data to the GPU memory. The profile for the single-level relaxation lacks this information, but the matrix and, in the case of SPAI0, the diagonal vector, still need to be copied to the GPU memory.

C B

unread,
Jul 5, 2021, 10:28:11 AM7/5/21
to Denis Demidov, amgcl
Hi Denis,
Thank you so much for your explanation, and for sharing the papers, 
the authors of the second paper wrote that they implemented it on Petsc, I wonder if they know that AMGCL is way better and faster than Petsc?....

As you indicated that copying the matrix and diagonal preconditioner takes time, and in the context of solving very similar matrices many times,
matrices which only change in the last few row/cols as they grow in each iteration,
what about keeping pinned memory chunks on the GPU for the three vectors defining the matrix,
and having one "setup" function call to initiate this memory allocation, and then an "update" call to update the matrix which will only
take "updates" to the 3 matrix vectors. In theory this would reduce the copying to the fraction of the memory size updated,
which in my case is less than 0.1 %.
What do you think about this?, would this be useful to reduce "setup" time, or copying is a small part of "setup" ?

Thanks for you advice,
I wish I could take a class with you!
Cheers,


--
You received this message because you are subscribed to the Google Groups "amgcl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to amgcl+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/amgcl/b9214531-4af9-4173-adab-900a220ebae2n%40googlegroups.com.

Denis Demidov

unread,
Jul 5, 2021, 11:21:47 AM7/5/21
to amgcl
On Monday, July 5, 2021 at 5:28:11 PM UTC+3 cebau...@gmail.com wrote:
Hi Denis,
Thank you so much for your explanation, and for sharing the papers, 
the authors of the second paper wrote that they implemented it on Petsc, I wonder if they know that AMGCL is way better and faster than Petsc?....

As you indicated that copying the matrix and diagonal preconditioner takes time, and in the context of solving very similar matrices many times,
matrices which only change in the last few row/cols as they grow in each iteration,
what about keeping pinned memory chunks on the GPU for the three vectors defining the matrix,
and having one "setup" function call to initiate this memory allocation, and then an "update" call to update the matrix which will only
take "updates" to the 3 matrix vectors. In theory this would reduce the copying to the fraction of the memory size updated,
which in my case is less than 0.1 %.
What do you think about this?, would this be useful to reduce "setup" time, or copying is a small part of "setup" ?

I've added some profiling to the vexcl matrix construction shown in this patch (applied to the vexcl source code):

This results in the following profile:

solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    744.74 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0      2097152       14581760    553.22 M (61.61%)
    1       263552        7918340    168.88 M (33.46%)
    2        16128        1114704     20.01 M ( 4.71%)
    3          789          53055      2.62 M ( 0.22%)

Iterations: 10
Error:      2.50965e-09

[Profile:                    2.557 s] (100.00%)
[ self:                      0.158 s] (  6.16%)
[  assembling:               0.115 s] (  4.48%)
[  setup:                    1.593 s] ( 62.29%)
[   self:                    0.065 s] (  2.54%)
[    CSR copy:               0.048 s] (  1.89%)
[    coarse operator:        0.436 s] ( 17.06%)
[    coarsest level:         0.031 s] (  1.20%)
[    move to backend:        0.611 s] ( 23.88%)
[     self:                  0.055 s] (  2.14%)
[      convert:              0.349 s] ( 13.65%)
[      memcpy:               0.207 s] (  8.09%)
[    relaxation:             0.028 s] (  1.09%)
[    transfer operators:     0.374 s] ( 14.63%)
[     self:                  0.116 s] (  4.53%)
[      aggregates:           0.110 s] (  4.32%)
[      smoothing:            0.131 s] (  5.14%)
[      tentative:            0.016 s] (  0.64%)
[  solve:                    0.692 s] ( 27.06%)
[    axpby:                  0.027 s] (  1.07%)
[    axpbypcz:               0.023 s] (  0.89%)
[    clear:                  0.033 s] (  1.29%)
[    coarse:                 0.266 s] ( 10.39%)
[    copy:                   0.003 s] (  0.13%)
[    inner_product:          0.263 s] ( 10.30%)
[    relax:                  0.013 s] (  0.49%)
[      residual:             0.000 s] (  0.02%)
[      vmul:                 0.012 s] (  0.47%)
[    residual:               0.035 s] (  1.35%)
[    spmv:                   0.029 s] (  1.14%)

The move to backend operation here takes 0.611 / 1.593 = 40% of the setup, and the copy is 0.207 / 1.593 = 13%.
The rest is the conversion from CSR to ELL format. ELL format is optimized for spmv performance on GPU, but it takes some time to construct from CSR, as you can see here. The following patch changes the vexcl matrix format used in amgcl from vex::sparse::matrix (defaults to ELL on a GPU) to vex::sparse::csr:

The result is:

solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    744.74 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0      2097152       14581760    553.22 M (61.61%)
    1       263552        7918340    168.88 M (33.46%)
    2        16128        1114704     20.01 M ( 4.71%)
    3          789          53055      2.62 M ( 0.22%)

Iterations: 10
Error:      2.50965e-09

[Profile:                    2.479 s] (100.00%)
[ self:                      0.100 s] (  4.04%)
[  assembling:               0.116 s] (  4.69%)
[  setup:                    1.115 s] ( 44.97%)
[   self:                    0.053 s] (  2.14%)
[    CSR copy:               0.044 s] (  1.77%)
[    coarse operator:        0.429 s] ( 17.28%)
[    coarsest level:         0.031 s] (  1.23%)
[    move to backend:        0.207 s] (  8.37%)
[    relaxation:             0.033 s] (  1.33%)
[    transfer operators:     0.318 s] ( 12.84%)
[     self:                  0.108 s] (  4.36%)
[      aggregates:           0.099 s] (  4.01%)
[      smoothing:            0.100 s] (  4.05%)
[      tentative:            0.011 s] (  0.42%)
[  solve:                    1.148 s] ( 46.31%)
[    axpby:                  0.000 s] (  0.01%)
[    axpbypcz:               0.001 s] (  0.03%)
[    clear:                  0.001 s] (  0.02%)
[    coarse:                 0.649 s] ( 26.16%)
[    copy:                   0.007 s] (  0.28%)
[    inner_product:          0.474 s] ( 19.13%)
[    relax:                  0.007 s] (  0.29%)
[      residual:             0.000 s] (  0.02%)
[      vmul:                 0.007 s] (  0.27%)
[    residual:               0.000 s] (  0.01%)
[    spmv:                   0.009 s] (  0.37%)

The "move to backend" in the setup is now just a "move to backend/copy" from the previous profile, and the solution step takes 65% more time.
The overall solution time (setup + solve) is almost equal for both cases (since the Poisson problem converges so fast), but since setup time is much more important for you, this could benefit your application.


Another note on why the inner_product apparently takes so much. AMGCL_TOC uses the CPU clock, and the GPU kernels are launched asynchronously. Inner product returns the result to the CPU, which means it acts as a synchronization point for CUDA/OpenCL kernels.
So the numbers shown in the solve profile are not that meaningful unfortunately. The following (dirty) patch to amgcl/util.hpp syncs the GPU context before each call to AMGCL_TOC.

The results make much more sense now (this is still with CSR matrix format on the GPU):

solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    744.74 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0      2097152       14581760    553.22 M (61.61%)
    1       263552        7918340    168.88 M (33.46%)
    2        16128        1114704     20.01 M ( 4.71%)
    3          789          53055      2.62 M ( 0.22%)

Iterations: 10
Error:      2.50965e-09

[Profile:                    2.390 s] (100.00%)
[ self:                      0.086 s] (  3.60%)
[  assembling:               0.112 s] (  4.68%)
[  setup:                    1.042 s] ( 43.61%)
[   self:                    0.051 s] (  2.14%)
[    CSR copy:               0.037 s] (  1.54%)
[    coarse operator:        0.386 s] ( 16.14%)
[    coarsest level:         0.030 s] (  1.27%)
[    move to backend:        0.207 s] (  8.67%)
[    relaxation:             0.026 s] (  1.07%)
[    transfer operators:     0.306 s] ( 12.78%)
[     self:                  0.102 s] (  4.26%)
[      aggregates:           0.094 s] (  3.92%)
[      smoothing:            0.101 s] (  4.24%)
[      tentative:            0.009 s] (  0.37%)
[  solve:                    1.150 s] ( 48.11%)
[    axpby:                  0.010 s] (  0.44%)
[    axpbypcz:               0.017 s] (  0.71%)
[    clear:                  0.004 s] (  0.19%)
[    coarse:                 0.006 s] (  0.25%)
[    copy:                   0.001 s] (  0.04%)
[    inner_product:          0.018 s] (  0.76%)
[    relax:                  0.519 s] ( 21.74%)
[      residual:             0.488 s] ( 20.40%)
[      vmul:                 0.032 s] (  1.33%)
[    residual:               0.252 s] ( 10.53%)
[    spmv:                   0.321 s] ( 13.45%)
Reply all
Reply to author
Forward
0 new messages