I've added some profiling to the vexcl matrix construction shown in this patch (applied to the vexcl source code):
This results in the following profile:
solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti
Solver
======
Type: BiCGStab
Unknowns: 2097152
Memory footprint: 112.00 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 744.74 M
level unknowns nonzeros memory
---------------------------------------------
0 2097152 14581760 553.22 M (61.61%)
1 263552 7918340 168.88 M (33.46%)
2 16128 1114704 20.01 M ( 4.71%)
3 789 53055 2.62 M ( 0.22%)
Iterations: 10
Error: 2.50965e-09
[Profile: 2.557 s] (100.00%)
[ self: 0.158 s] ( 6.16%)
[ assembling: 0.115 s] ( 4.48%)
[ setup: 1.593 s] ( 62.29%)
[ self: 0.065 s] ( 2.54%)
[ CSR copy: 0.048 s] ( 1.89%)
[ coarse operator: 0.436 s] ( 17.06%)
[ coarsest level: 0.031 s] ( 1.20%)
[ move to backend: 0.611 s] ( 23.88%)
[ self: 0.055 s] ( 2.14%)
[ convert: 0.349 s] ( 13.65%)
[ memcpy: 0.207 s] ( 8.09%)
[ relaxation: 0.028 s] ( 1.09%)
[ transfer operators: 0.374 s] ( 14.63%)
[ self: 0.116 s] ( 4.53%)
[ aggregates: 0.110 s] ( 4.32%)
[ smoothing: 0.131 s] ( 5.14%)
[ tentative: 0.016 s] ( 0.64%)
[ solve: 0.692 s] ( 27.06%)
[ axpby: 0.027 s] ( 1.07%)
[ axpbypcz: 0.023 s] ( 0.89%)
[ clear: 0.033 s] ( 1.29%)
[ coarse: 0.266 s] ( 10.39%)
[ copy: 0.003 s] ( 0.13%)
[ inner_product: 0.263 s] ( 10.30%)
[ relax: 0.013 s] ( 0.49%)
[ residual: 0.000 s] ( 0.02%)
[ vmul: 0.012 s] ( 0.47%)
[ residual: 0.035 s] ( 1.35%)
[ spmv: 0.029 s] ( 1.14%)
The move to backend operation here takes 0.611 / 1.593 = 40% of the setup, and the copy is 0.207 / 1.593 = 13%.
The rest is the conversion from CSR to ELL format. ELL format is optimized for spmv performance on GPU, but it takes some time to construct from CSR, as you can see here. The following patch changes the vexcl matrix format used in amgcl from vex::sparse::matrix (defaults to ELL on a GPU) to vex::sparse::csr:
The result is:
solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti
Solver
======
Type: BiCGStab
Unknowns: 2097152
Memory footprint: 112.00 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 744.74 M
level unknowns nonzeros memory
---------------------------------------------
0 2097152 14581760 553.22 M (61.61%)
1 263552 7918340 168.88 M (33.46%)
2 16128 1114704 20.01 M ( 4.71%)
3 789 53055 2.62 M ( 0.22%)
Iterations: 10
Error: 2.50965e-09
[Profile: 2.479 s] (100.00%)
[ self: 0.100 s] ( 4.04%)
[ assembling: 0.116 s] ( 4.69%)
[ setup: 1.115 s] ( 44.97%)
[ self: 0.053 s] ( 2.14%)
[ CSR copy: 0.044 s] ( 1.77%)
[ coarse operator: 0.429 s] ( 17.28%)
[ coarsest level: 0.031 s] ( 1.23%)
[ move to backend: 0.207 s] ( 8.37%)
[ relaxation: 0.033 s] ( 1.33%)
[ transfer operators: 0.318 s] ( 12.84%)
[ self: 0.108 s] ( 4.36%)
[ aggregates: 0.099 s] ( 4.01%)
[ smoothing: 0.100 s] ( 4.05%)
[ tentative: 0.011 s] ( 0.42%)
[ solve: 1.148 s] ( 46.31%)
[ axpby: 0.000 s] ( 0.01%)
[ axpbypcz: 0.001 s] ( 0.03%)
[ clear: 0.001 s] ( 0.02%)
[ coarse: 0.649 s] ( 26.16%)
[ copy: 0.007 s] ( 0.28%)
[ inner_product: 0.474 s] ( 19.13%)
[ relax: 0.007 s] ( 0.29%)
[ residual: 0.000 s] ( 0.02%)
[ vmul: 0.007 s] ( 0.27%)
[ residual: 0.000 s] ( 0.01%)
[ spmv: 0.009 s] ( 0.37%)
The "move to backend" in the setup is now just a "move to backend/copy" from the previous profile, and the solution step takes 65% more time.
The overall solution time (setup + solve) is almost equal for both cases (since the Poisson problem converges so fast), but since setup time is much more important for you, this could benefit your application.
Another note on why the inner_product apparently takes so much.
AMGCL_TOC uses the CPU clock, and the GPU kernels are launched asynchronously. Inner product returns the result to the CPU, which means it acts as a synchronization point for CUDA/OpenCL kernels.
So the numbers shown in the solve profile are not that meaningful unfortunately. The following (dirty) patch to amgcl/util.hpp syncs the GPU context before each call to AMGCL_TOC.
The results make much more sense now (this is still with CSR matrix format on the GPU):
solver_vexcl_cuda -n 128
1. NVIDIA GeForce GTX 1050 Ti
Solver
======
Type: BiCGStab
Unknowns: 2097152
Memory footprint: 112.00 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 744.74 M
level unknowns nonzeros memory
---------------------------------------------
0 2097152 14581760 553.22 M (61.61%)
1 263552 7918340 168.88 M (33.46%)
2 16128 1114704 20.01 M ( 4.71%)
3 789 53055 2.62 M ( 0.22%)
Iterations: 10
Error: 2.50965e-09
[Profile: 2.390 s] (100.00%)
[ self: 0.086 s] ( 3.60%)
[ assembling: 0.112 s] ( 4.68%)
[ setup: 1.042 s] ( 43.61%)
[ self: 0.051 s] ( 2.14%)
[ CSR copy: 0.037 s] ( 1.54%)
[ coarse operator: 0.386 s] ( 16.14%)
[ coarsest level: 0.030 s] ( 1.27%)
[ move to backend: 0.207 s] ( 8.67%)
[ relaxation: 0.026 s] ( 1.07%)
[ transfer operators: 0.306 s] ( 12.78%)
[ self: 0.102 s] ( 4.26%)
[ aggregates: 0.094 s] ( 3.92%)
[ smoothing: 0.101 s] ( 4.24%)
[ tentative: 0.009 s] ( 0.37%)
[ solve: 1.150 s] ( 48.11%)
[ axpby: 0.010 s] ( 0.44%)
[ axpbypcz: 0.017 s] ( 0.71%)
[ clear: 0.004 s] ( 0.19%)
[ coarse: 0.006 s] ( 0.25%)
[ copy: 0.001 s] ( 0.04%)
[ inner_product: 0.018 s] ( 0.76%)
[ relax: 0.519 s] ( 21.74%)
[ residual: 0.488 s] ( 20.40%)
[ vmul: 0.032 s] ( 1.33%)
[ residual: 0.252 s] ( 10.53%)
[ spmv: 0.321 s] ( 13.45%)