multithreading support for large BA problems

1,133 views
Skip to first unread message

Nikolaus Demmel

unread,
Oct 24, 2020, 3:57:21 PM10/24/20
to Ceres Solver
Hi,

I'm benchmarking ceres on some large BAL-style bundle adjustment problems and I just wanna make sure I have the configuration straight in terms of multi-threading support.

I understand how multi-threading is used for residual/jacobian evaluation and for computing the SC, but for larger BA problems significant time is spent on solving the reduced camera system.

I'm considering the SPARSE_SCHUR and ITERATIVE_SCHUR linear solvers with JACOBI and SCHUR_JACOBI preconditioners.

To my understanding, there is no option for a multi-threaded exact sparse solver (SPARSE_SCHUR), right? (I'm using SUITE_SPARSE as the sparse linear algebra library, but I guess it's not different with CX_SPARSE or EIGEN_SPARSE, since the docs mention that SUITE_SPARSE is strongly preferred for large BA problems).

What about the inexact solvers (ITERATIVE_SCHUR), both the explicit and implicit SC variant? Are they making use of multithreading in the CG iterations? It doesn't seem like it to me (except the SCHUR_JACOBI preconditioner in the implicit SC variant uses the parallel schur eliminator to compute the block diagonal preconditioner), but maybe I'm missing something since there are many different variants of sparse matrix classes, so I might be looking at the wrong code.

Best,
Niko

PS: My BLAS library is OpenBLAS and I'm turning it's multi-threading support off with OPENBLAS_NUM_THREADS=1 as suggested in the docs. This is on Ubuntu Linux.

Sameer Agarwal

unread,
Oct 25, 2020, 11:25:05 PM10/25/20
to ceres-...@googlegroups.com
On Sat, Oct 24, 2020 at 12:57 PM Nikolaus Demmel <nde...@gmail.com> wrote:
Hi,

I'm benchmarking ceres on some large BAL-style bundle adjustment problems and I just wanna make sure I have the configuration straight in terms of multi-threading support.

I understand how multi-threading is used for residual/jacobian evaluation and for computing the SC, but for larger BA problems significant time is spent on solving the reduced camera system.

I'm considering the SPARSE_SCHUR and ITERATIVE_SCHUR linear solvers with JACOBI and SCHUR_JACOBI preconditioners.

To my understanding, there is no option for a multi-threaded exact sparse solver (SPARSE_SCHUR), right? (I'm using SUITE_SPARSE as the sparse linear algebra library, but I guess it's not different with CX_SPARSE or EIGEN_SPARSE, since the docs mention that SUITE_SPARSE is strongly preferred for large BA problems).

You are right. None of the sparse solvers are threaded.
 
What about the inexact solvers (ITERATIVE_SCHUR), both the explicit and implicit SC variant?

In ITERATIVE_SCHUR, some of the preconditioners use the schur eliminator in which case they can use threads to compute that,  other than that no. 

Are they making use of multithreading in the CG iterations?

No.
 
It doesn't seem like it to me (except the SCHUR_JACOBI preconditioner in the implicit SC variant uses the parallel schur eliminator to compute the block diagonal preconditioner), but maybe I'm missing something since there are many different variants of sparse matrix classes, so I might be looking at the wrong code.

You have it right.

Sameer
 

Best,
Niko

PS: My BLAS library is OpenBLAS and I'm turning it's multi-threading support off with OPENBLAS_NUM_THREADS=1 as suggested in the docs. This is on Ubuntu Linux.

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/5c7171d8-b7fd-492d-9455-473b01d21b7bo%40googlegroups.com.

Nikolaus Demmel

unread,
Oct 26, 2020, 5:45:55 AM10/26/20
to Ceres Solver
Hi Sameer,

thanks for your reply, this is very helpful.

One more question about the implicit SC iterative solver. If I'm not mistaken, this corresponds to the "implicit-schur" variant from the Multicore Bundle Adjustment paper (http://grail.cs.washington.edu/projects/mcba/pba.pdf). In that paper, parallel evaluation of the matrix vector products needed in CG is discussed. Is the reason this is not implemented in Ceres some fundamental difficulty, e.g. b/c Ceres is general purpose whereas the paper discusses specifically BAL-style BA problems? Or is it simply that no-one ever got around to implementing a multithreaded version in Ceres so far?

Kind regards,
Nikolaus


To unsubscribe from this group and stop receiving emails from it, send an email to ceres-...@googlegroups.com.

Sameer Agarwal

unread,
Oct 27, 2020, 9:59:32 AM10/27/20
to ceres-...@googlegroups.com
On Mon, Oct 26, 2020 at 2:45 AM Nikolaus Demmel <nde...@gmail.com> wrote:
Hi Sameer,

thanks for your reply, this is very helpful.

One more question about the implicit SC iterative solver. If I'm not mistaken, this corresponds to the "implicit-schur" variant from the Multicore Bundle Adjustment paper (http://grail.cs.washington.edu/projects/mcba/pba.pdf). In that paper, parallel evaluation of the matrix vector products needed in CG is discussed. Is the reason this is not implemented in Ceres some fundamental difficulty, e.g. b/c Ceres is general purpose whereas the paper discusses specifically BAL-style BA problems? Or is it simply that no-one ever got around to implementing a multithreaded version in Ceres so far?

The short version is that I tried and failed to get good performance out of it. It likely has to do with the fact that Ceres solves a more general problem and uses a more general block sparse matrix representation than the one used in the MCBA paper. 

So there are two products to be evaluated: Jx and J'x. Since the Jacobian is stored in block row major order, Jx is trivial to thread. I am pretty sure that when we had openmp annotation based threading, that product was threaded but my memory fails me right now. Either ways, the resulting reduction in time was small.

J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.

It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.

Sameer
 
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/1b686f46-01d5-4b8a-9678-5088a3a8e1b8o%40googlegroups.com.

Nikolaus Demmel

unread,
Oct 27, 2020, 6:15:16 PM10/27/20
to Ceres Solver

On Tuesday, October 27, 2020 at 2:59:32 PM UTC+1, Sameer Agarwal wrote:


On Mon, Oct 26, 2020 at 2:45 AM Nikolaus Demmel <nde...@gmail.com> wrote:
Hi Sameer,

thanks for your reply, this is very helpful.

One more question about the implicit SC iterative solver. If I'm not mistaken, this corresponds to the "implicit-schur" variant from the Multicore Bundle Adjustment paper (http://grail.cs.washington.edu/projects/mcba/pba.pdf). In that paper, parallel evaluation of the matrix vector products needed in CG is discussed. Is the reason this is not implemented in Ceres some fundamental difficulty, e.g. b/c Ceres is general purpose whereas the paper discusses specifically BAL-style BA problems? Or is it simply that no-one ever got around to implementing a multithreaded version in Ceres so far?

The short version is that I tried and failed to get good performance out of it. It likely has to do with the fact that Ceres solves a more general problem and uses a more general block sparse matrix representation than the one used in the MCBA paper. 

So there are two products to be evaluated: Jx and J'x. Since the Jacobian is stored in block row major order, Jx is trivial to thread. I am pretty sure that when we had openmp annotation based threading, that product was threaded but my memory fails me right now. Either ways, the resulting reduction in time was small.
 
Thanks for the insights. Makes sense. 


J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.

It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.

By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.  Alternatively you could split up x and do a map-reduce over the columns of J' (== rows of J). Of course in the reduction you would have to do additional vector additions (of size of the reduced camera system). Not sure which one would work better, that the map reduce should not require too much additional memory I think. I haven't really though this through.

The explicit iterative schur solver should be well-parallelizable, since you only need Hx. Assuming H is also stored in row-block-compressed.

I agree those would be nice projects. Quite self-contained, e.g. for someone that wanted to get into contributing to Ceres. At the moment there is no time, unfortunately. Maybe I should find a student to implement it and find the best parallelization strategy for implicit schur.

Best,
Niko


 
Sameer
 

Sameer Agarwal

unread,
Oct 29, 2020, 9:45:33 AM10/29/20
to ceres-...@googlegroups.com
A few more comments inline.


The short version is that I tried and failed to get good performance out of it. It likely has to do with the fact that Ceres solves a more general problem and uses a more general block sparse matrix representation than the one used in the MCBA paper. 

So there are two products to be evaluated: Jx and J'x. Since the Jacobian is stored in block row major order, Jx is trivial to thread. I am pretty sure that when we had openmp annotation based threading, that product was threaded but my memory fails me right now. Either ways, the resulting reduction in time was small.
 
Thanks for the insights. Makes sense. 


J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.

It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.

By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.  Alternatively you could split up x and do a map-reduce over the columns of J' (== rows of J). Of course in the reduction you would have to do additional vector additions (of size of the reduced camera system). Not sure which one would work better, that the map reduce should not require too much additional memory I think. I haven't really though this through.

I mean break up J  and x in chunks of rows, compute y_i = J_i' x_i on each thread and then at the end of it sum up the y_is.  This requires k-copies the size of the number of parameters.

The additional vector operations are not the size of the reduced camera system. They are of the size of the full parameter vector.
 
The explicit iterative schur solver should be well-parallelizable, since you only need Hx. Assuming H is also stored in row-block-compressed.

Yes but the cost of computing H is rather high so the zone in which H + iterative methods work well is quite narrow.
 
I agree those would be nice projects. Quite self-contained, e.g. for someone that wanted to get into contributing to Ceres. At the moment there is no time, unfortunately. Maybe I should find a student to implement it and find the best parallelization strategy for implicit schur.

That would be great! and a nice contribution to improving performance. The other would be to add support for CUDA/GPU based linear solvers.

Sameer

 
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/694229e4-6dd4-4b7f-825c-d431724cd32do%40googlegroups.com.

Nikolaus Demmel

unread,
Oct 30, 2020, 5:06:31 PM10/30/20
to Ceres Solver


On Thursday, October 29, 2020 at 2:45:33 PM UTC+1, Sameer Agarwal wrote:


J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.

It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.

By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.  Alternatively you could split up x and do a map-reduce over the columns of J' (== rows of J). Of course in the reduction you would have to do additional vector additions (of size of the reduced camera system). Not sure which one would work better, that the map reduce should not require too much additional memory I think. I haven't really though this through.

I mean break up J  and x in chunks of rows, compute y_i = J_i' x_i on each thread and then at the end of it sum up the y_is.  This requires k-copies the size of the number of parameters.

The additional vector operations are not the size of the reduced camera system. They are of the size of the full parameter vector.

That was what I was thinking of with map-reduce, but I didn't think about it being the full parameter vector. One could probably still exploit sparsity of the blocks here in an efficient way, maybe exploiting know properties of the "landmark" part. But yeah, one needs to test and investigate.
 
 
The explicit iterative schur solver should be well-parallelizable, since you only need Hx. Assuming H is also stored in row-block-compressed.

Yes but the cost of computing H is rather high so the zone in which H + iterative methods work well is quite narrow.

Thats a valuable insight, thanks.
 
 
I agree those would be nice projects. Quite self-contained, e.g. for someone that wanted to get into contributing to Ceres. At the moment there is no time, unfortunately. Maybe I should find a student to implement it and find the best parallelization strategy for implicit schur.

That would be great! and a nice contribution to improving performance. The other would be to add support for CUDA/GPU based linear solvers.

Yes. There is no timeline. If someone reading this mailing list wants to jump on it, don't hold back.
 
Best,
Niko




Sameer

 

dmitriy.k...@gmail.com

unread,
Nov 2, 2020, 6:16:28 AM11/2/20
to Ceres Solver
> By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.
One might store only a CompressedColumnBlockStructure corresponding to the structure of J (with value offsets pointing to the same values of J, keeping in mind an intrinsic transpose).
This allows to run parallel products for both J and J' only storing a structure two times (but values are stored only once); this also simplifies modification logic (no need to track value changes).

Sameer Agarwal

unread,
Nov 2, 2020, 7:21:12 AM11/2/20
to ceres-...@googlegroups.com

On Mon, Nov 2, 2020, 3:16 AM dmitriy.k...@gmail.com <dmitriy.k...@gmail.com> wrote:
> By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.
One might store only a CompressedColumnBlockStructure corresponding to the structure of J (with value offsets pointing to the same values of J, keeping in mind an intrinsic transpose).
This allows to run parallel products for both J and J' only storing a structure two times (but values are stored only once); this also simplifies modification logic (no need to track value changes).

That's a cool idea. It is worth prototyping and measuring the performance gain and compare it to the scatter and gather approach.

Sameer


To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/10f5447b-f792-4144-a27b-4886e0a466a8n%40googlegroups.com.

Nikolaus Demmel

unread,
Nov 2, 2020, 8:36:56 AM11/2/20
to Ceres Solver


On Monday, November 2, 2020 at 1:21:12 PM UTC+1, Sameer Agarwal wrote:

On Mon, Nov 2, 2020, 3:16 AM dmitriy.k...@gmail.com <dmitriy....@gmail.com> wrote:
> By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.
One might store only a CompressedColumnBlockStructure corresponding to the structure of J (with value offsets pointing to the same values of J, keeping in mind an intrinsic transpose).
This allows to run parallel products for both J and J' only storing a structure two times (but values are stored only once); this also simplifies modification logic (no need to track value changes).

That's a cool idea. It is worth prototyping and measuring the performance gain and compare it to the scatter and gather approach.

Yes, that sounds easy to implement and should work nicely. Not sure if cache performance might be an issue when iterating over columns in that way, but maybe it's ok if the blocks aren't too small. The alternatives also have overhead, so definitely work comparing for different problem sizes.

Best,
Niko

 


Sameer Agarwal

unread,
Nov 2, 2020, 12:08:14 PM11/2/20
to ceres-...@googlegroups.com
That's a cool idea. It is worth prototyping and measuring the performance gain and compare it to the scatter and gather approach.

Yes, that sounds easy to implement and should work nicely. Not sure if cache performance might be an issue when iterating over columns in that way, but maybe it's ok if the blocks aren't too small. The alternatives also have overhead, so definitely work comparing for different problem sizes.

Cache performance will matter. We store all the E blocks together for this reason.
Sameer

 
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/d63860a2-edf1-4840-b6db-0eeb35fc53f5o%40googlegroups.com.

Dmitriy Korchemkin

unread,
Nov 2, 2020, 11:39:15 PM11/2/20
to ceres-...@googlegroups.com
Would you suggest which  problems are good for testing performance; and what software packages / datasets I might use to evaluate performance impact without diving too deep into a particular problem?


Sameer Agarwal

unread,
Nov 3, 2020, 8:45:39 AM11/3/20
to ceres-...@googlegroups.com
Dmitry,
You can use bundle_adjuster.cc with linear_solver  = iterative_schur and preconditioner = jacobi or schur_jacobi, --ordering = user.
I recommend using some of the larger problems at https://grail.cs.washington.edu/projects/bal/ 

For example

./bin/bundle_adjuster --input ../../Downloads/problem-1521-939551-pre.txt --linear_solver iterative_schur --ordering user --preconditioner jacobi
iter      cost      cost_change  |gradient|   |step|    tr_ratio  tr_radius  ls_iter  iter_time  total_time
   0  2.447219e+08    0.00e+00    3.21e+15   0.00e+00   0.00e+00  1.00e+04        0    3.02e+00    5.98e+00
   1  1.297878e+07    2.32e+08    3.36e+14   7.43e+05   9.54e-01  3.00e+04        5    6.71e+00    1.27e+01
   2  1.645178e+06    1.13e+07    2.74e+12   7.29e+05   9.96e-01  9.00e+04        8    7.23e+00    1.99e+01
   3  1.631006e+06    1.42e+04    1.96e+13   1.32e+06   1.59e-01  6.83e+04      152    4.71e+01    6.70e+01
   4  1.565355e+06    6.57e+04    1.26e+14   1.06e+06   8.40e-01  9.97e+04      207    6.48e+01    1.32e+02
   5  1.553176e+06    1.22e+04    4.10e+13   1.55e+06   8.88e-01  1.87e+05      318    9.16e+01    2.23e+02

Solver Summary (v 2.0.0-eigen-(3.3.7)-lapack-suitesparse-(5.8.0)-cxsparse-(3.2.0)-acceleratesparse-eigensparse-no_openmp)

                                     Original                  Reduced
Parameter blocks                       941072                   941072
Parameters                            2832342                  2832342
Residual blocks                       4739031                  4739031
Residuals                             9478062                  9478062

Minimizer                        TRUST_REGION
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver                 ITERATIVE_SCHUR          ITERATIVE_SCHUR
Preconditioner                         JACOBI                   JACOBI
Threads                                     1                        1
Linear solver ordering            939551,1521              939551,1521
Schur structure                         2,3,9                    2,3,9

Cost:
Initial                          2.447219e+08
Final                            1.553176e+06
Change                           2.431687e+08

Minimizer iterations                        6
Successful steps                            6
Unsuccessful steps                          0

Time (in seconds):
Preprocessor                         2.962168

  Residual only evaluation           2.811207 (5)
  Jacobian & residual evaluation    13.141844 (6)
  Linear solver                    199.748744 (5)
Minimizer                          220.539127

Postprocessor                        0.280967
Total                              223.782263

Termination:                   NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 5.)


if you run (on linux/macos with glog) -logtostderr -v=3 you can get more detailed timing information, which will look like


IterativeSchurComplementSolver::Solve
                                   Delta   Cumulative
                           Setup :    1.29029      1.29029
                           Solve :   43.63016     44.92045
                           Total :    0.00001     44.92046

   3  1.631006e+06    1.42e+04    1.96e+13   1.32e+06   1.59e-01  6.83e+04      152    4.86e+01    6.89e+01

Here Solve refers to time being spent running CG where the matrix-vector product between the schur complement and the vector is evaluated at


Sameer


Dmitriy Korchemkin

unread,
Nov 13, 2020, 8:02:26 AM11/13/20
to Ceres Solver
Here are some preliminary results.

a. Data
I've selected all problems with > 1000 cameras from `final` directory and several problems with >1000 cameras from `venice` and `ladybug` directories:

b. Scenario
For each problem and implementation, the following invocation of bundle_adjuster example was used:
bundle_adjuster --linear_solver iterative_schur --ordering user --preconditioner jacobi --num_iterations 50 --num_threads ${num_threads} --input ${problem_path}
I've decided to increase the number of iterations to 50, because on initial iterations the number of iterations of CG solver is typically very small.

I've used a platform with 2 xeon platinum 8176 cpus, and 2080ti gpus for evaluation.
Since initial evaluation has shown that running across numa-nodes results in a significant performance drop, for all tests the process was bound to a specific set of cores of a single numa-node.
For gpgpu implementation, gpu that was attached to the numa-node was used.

Performance was evaluated for num_threads parameter being in set {4, 8, 16, 28, 56} with 28 being the number of cores per cpu (and 56 being the only thread count that involved hyper-threading).

c. Results
The left plot corresponds to improvement (as a ratio) in total execution time over current implementation, the right plot corresponds to improvement in linear solver time over current implementation
Solid lines are mean improvement, dashed - 0.25 and 0.75 quantiles


ceres-spmv.png
Mean improvements in a total runtime are:
                     4         8       16       28       56
cuSPARSE     7.1982374 7.5955914 7.078499 7.582467 7.407877
parallel     0.7114466 0.9600542 1.449785 1.985489 2.234053
parallel-tbb 1.1171031 1.7885792 2.652213 3.012272 2.865478


d. Implementation details
parallel & parallel-tbb correspond to parallel execution of Jx and J'x (using the idea with column structure for left products); parallel-tbb version uses tbb-based threading with some tuning of job partitioning, parallel version uses cxx threads.

cuSPARSE corresponds to spmv products being offloaded to a single gpu (entire ImplicitSchurComplement::RightMultiply is done on the GPU).
Since cuSPARSE does not support sparse matrix format similar to ceres-solver:
* CSR structure and permutation for value remapping are precomputed
* On each update of J - its' values are being streamed in chunks from CPU memory into GPU memory with  a parallel remapping (on GPU) into  CSR order.
(note:  final/problem-13682-4456117-pre.txt was not able to fit into GPU memory)


e. Conclusions
There is some improvement, but it is not very great; in parallel cpu version both left and right products do not scale great (with 16 threads typical improvement for right products alone is 8x, and for left products is 2.8x)

Please tell if you find any of these results "interesting"; I plan to test left products with parallel reduction of results + finalize gpgpu variant in next 1..2 weeks.

Nikolaus Demmel

unread,
Nov 13, 2020, 8:42:01 AM11/13/20
to Ceres Solver
Hi Dmitriy,

wow, I find this very fascinating. Thanks a lot for sharing this detailed comparison!

This clearly shows how much the memory cache efficiency matters, seeing the difference in improvement for left and right products.

I'm in particular interested how parallel reduction for left products will hold up. The vectors in the reduction could be saved as dense or as block sparse vectors. For high number of threads the block sparse vector could be much more memory and also runtime efficient (no need to sum up blocks of 0). For smaller number of threads and/or smaller problems, maybe dense is better.

Another thought / alternative: Instead of summing intermediate result vectors in the reduction, another option would be to still iterate over rows like the reduction (good for cache), but instead use mutexes to guard writing to blocks of a single result vector. This would save the summation and memory cost for the vectors during the reduction. The assumption why this could work well would be, that if the different threads work on sufficiently far apart parts of the factor graph, the overlap and thus mutex contention for writing to the result vector might be small. But I cannot judge how much is the overhead of frequent lock acquisitions.

Another question might be if the speedup for high number of threads is better for the "explicit schur complement" solver option. So far the assumption is that implicit-sc >> explitic-sc for large problems, but if parallelisation potential is better, this assessment might be different when many threads and sufficient RAM is available. For explicit SC you don't need left products, since the (reduced) hessian is symmetric.

Some questions:
- How do you compute "mean improvement"? For each solver you compute the total runtime ratio to the total runtime of "current", then compute (geometric?) mean of these sequences over all problems?
- How did you tune the job partitioning for the TBB implementation and how much difference does that make? I'm using TBB in different projects, but I've only ever used the default here. Any resources / guides that you can recommend for this would be appreciated.

Cheers,
Niko

Dmitriy Korchemkin

unread,
Nov 14, 2020, 5:37:11 AM11/14/20
to Ceres Solver
> - How do you compute "mean improvement"? For each solver you compute the total runtime ratio to the total runtime of "current", then compute (geometric?) mean of these sequences over all problems?
For each solver and each number of threads, a ratio of total runtime of "current" with the same number of threads (since jacobian & residual evaluation is already parallel) to the total runtime of the implementation-under-test is computed; then these ratios are geometrically averaged over all datasets for a particular <implementation, number_of_threads>.

I've forgotten to note that for cuSPARSE implementation 4 datasets were excluded due to final residual being >1% off. I am still checking if there is a problem in the implementation or in the "nature of things" (the relative norm of difference in right product (ISC) computed on gpu is ~1e-10..1e-12, while for cpu-parallel implementation it is ~1e-13..1e-15; gpu implementation lacks tests and probably I'm just missing something).

> - How did you tune the job partitioning for the TBB implementation and how much difference does that make? I'm using TBB in different projects, but I've only ever used the default here. Any resources / guides that you can recommend for this would be appreciated.
I do not have exact numbers for the final implementation, but in the middle of the process my observations were as follows:
* Switching from CXX threads to TBB does not provide improvement at all (using tbb:auto_partitioner = default)
* Using tbb::simple_partitioner results into ~2..5% improvement
* Using tbb::simple_partitioner and limiting number of work pieces to num_threads * 64 provides ~10..15% of improvement over CXX threads. 64 is selected by hand from the sequence of powers of 2

Keep in mind that tbb backed is unavailable in the current ceres-solver (it was removed at 16f9b34c3bfb206721b456f8962e889124d56c3e, I believe it was 1.13->1.14 transition); I believe the same changes can be done to CXX-threads backend; I just happen to have implementation of tbb-backend for ceres 2.0, which I use due to easier support of nested parallel regions (which are handled automagically with TBB [if the rest of your code uses TBB], but requires some careful thinking of which parts of code are multithreaded by what library in other case).
Moreover, I think that probably even better results might be achieved if we divide the work based on the number of nnz in a concrete set of rows/cols and do no dynamic work-splitting.

>  mutexes to guard writing to blocks of a single result vector
Do you think "literal" mutexes will be better than atomic CAS? I believe x64 and ARM64 has at least atomic CAS for 64-bit values (not sure about atomic addition of IEEE754 floating point).

Sameer Agarwal

unread,
Nov 16, 2020, 1:35:54 PM11/16/20
to ceres-...@googlegroups.com
Thanks for sharing these Dmitriy. This lines up with what I have observed in the past.
I think the fact that the GPU versions perform so much better maybe a good reason to start thinking about adding support for CUDA to Ceres Solver. Its been an ask for a while. 

As for mutexes, I do not think you will get much performance out of them. The problem here is not scattering of the product into output vectors and their sum but rather tha irregular memory access pattern of the matrix-vector product and and lookups into the block sparse matrix.

Sameer


Sameer Agarwal

unread,
Aug 13, 2022, 4:47:10 PM8/13/22
to Ceres Solver
Reviving this thread, because over the past few months Joydeep Biswas has been working on adding support for GPUs to ceres solver and he has a CL in the works to add GPU support for CGNR.
Based on the very nice performance that CUDA"s spmv is showing (which lines up with what Dmitriy also observed in his experiments), I think once joydeep's current WIP goes in over the next few days, 
I am very much interested in building upon it to add CUDA support for ITERATIVE_SCHUR also. 
Dmitriy if you, or anyone else is interested in this effort, we would be delighted to have your help.
Sameer

Dmitriy Korchemkin

unread,
Aug 17, 2022, 7:42:53 AM8/17/22
to Ceres Solver
Hi,


> Joydeep Biswas has been working on adding support for GPUs to ceres solver and he has a CL in the works to add GPU support for CGNR.
Current approach for CGNR works via writing jacobians in CSR format, followed by CSR SpMV products using cuSPARSE, right?


When I was experimenting with accelerating iterative schur complement 2 years ago, I used standard block-sparse jacobian writer with a two-stage conversion to CSR offloaded to GPU:
- “offline” (once per jacobian structure change): convert block-sparse partitioned structure to 2xCSR structures and prepare permutations for remapping elements from block-sparse structure to CSR structure
- “online” (once per jacobian values change): stream chunks of values in block-sparse order to GPU, remapping them into CSR order in a pipelined manner.

This was needed to get a reasonable performance boost over parallel CPU SpMV with caching of transposed block-sparse structure (described somewhere in this thread).
Back then I haven’t tried to implement a jacobian writer for partitioned CSR matrix though, probably it would be reasonably efficient; my approach was driven by idea of minimizing amount of data transferred to GPU (thus, avoiding CSR on CPU side) and utilizing the higher memory bandwidth of GPU for performing format conversion.

If the algorithm outlined above looks reasonable - I think I can put some effort into bringing it to current upstream.



I am also interested if someone is aware of any existing code/paper on efficient block-sparse-matrix by dense vector multiplication on GPU (i.e. using the same storage format as ceres-solver, but comparably efficient to CSR SpMV from cuSPARSE), because CSR format is kinda expensive to store (by a factor of ~1.5x for double-word indices and double-precision coefficients).
While it might be not a big problem on a modern gpu with >=24Gb memory (since we still have a limit of 2^31-1 non-zero elements, we can fit any jacobian into 24Gb GPU memory even with CSR format), I think that cutting down the amount of memory that needs to be traversed for a single product might improve performance (since SpMV products are typically bandwidth-limited).

Sameer Agarwal

unread,
Aug 17, 2022, 10:43:12 AM8/17/22
to ceres-...@googlegroups.com
Hi Dmitriy,
My replies are inline.

On Wed, Aug 17, 2022 at 4:42 AM Dmitriy Korchemkin <dmitriy.k...@gmail.com> wrote:

> Joydeep Biswas has been working on adding support for GPUs to ceres solver and he has a CL in the works to add GPU support for CGNR.
Current approach for CGNR works via writing jacobians in CSR format, followed by CSR SpMV products using cuSPARSE, right?

yes. 
 
When I was experimenting with accelerating iterative schur complement 2 years ago, I used standard block-sparse jacobian writer with a two-stage conversion to CSR offloaded to GPU:
- “offline” (once per jacobian structure change): convert block-sparse partitioned structure to 2xCSR structures and prepare permutations for remapping elements from block-sparse structure to CSR structure
- “online” (once per jacobian values change): stream chunks of values in block-sparse order to GPU, remapping them into CSR order in a pipelined manner.

This was needed to get a reasonable performance boost over parallel CPU SpMV with caching of transposed block-sparse structure (described somewhere in this thread).
Back then I haven’t tried to implement a jacobian writer for partitioned CSR matrix though, probably it would be reasonably efficient; my approach was driven by idea of minimizing amount of data transferred to GPU (thus, avoiding CSR on CPU side) and utilizing the higher memory bandwidth of GPU for performing format conversion.

If the algorithm outlined above looks reasonable - I think I can put some effort into bringing it to current upstream.

I was planning on doing something similar but without the permutation step.

1. Construct two CSE matrices for the e and f columns respectively, and have an object which updates their values array every-time the jacobian is updated.
2. With these two in hand, and the BlockCRSJacobi preconditioner we can easily compute (f'f){^-1 and(e'e)^{-1} (the latter is needed for the jacobi and the schur_power_series_expansion preconditioners).
3. Copy these 3/4 matrices to the GPU and implement the implicit schur complement product and the preconditioners using them.

I am not sure what the relative penalty is doing the CSR conversion on CPU vs doing it on the GPU. Doing the BlockSparse to CSR conversion on CPU is straightforward and with a little bit of effort parallelizable. 

A partitioned CSR writer sounds very interesting. It will however make schur_jacobi preconditioner inaccessible, because it depends on the schur eliminator which uses a block sparse matrix as input. 

Perhaps we should start with the simpler approach described above and get a baseline and then more sophisticated approaches to see what the perf/complexity tradeoff is?

BTW I would very much like to see the transposed block structure approach and how we can integrate it into Ceres. If you have a patch that you can revive there, that would be great.

I am also interested if someone is aware of any existing code/paper on efficient block-sparse-matrix by dense vector multiplication on GPU (i.e. using the same storage format as ceres-solver, but comparably efficient to CSR SpMV from cuSPARSE), because CSR format is kinda expensive to store (by a factor of ~1.5x for double-word indices and double-precision coefficients).
While it might be not a big problem on a modern gpu with >=24Gb memory (since we still have a limit of 2^31-1 non-zero elements, we can fit any jacobian into 24Gb GPU memory even with CSR format), I think that cutting down the amount of memory that needs to be traversed for a single product might improve performance (since SpMV products are typically bandwidth-limited).

I would like to see this too, that said in some experiments I did recently spmv on block sparse matrices was not that much faster than spmv using the corresponding csr matrix. I need to look at the experiments more carefully but while there is more arithmetic intensity in block sparse matrices, we also have to look up block size information across two different arrays.  Also some of the perf in partitioned matrix view comes from knowing the fixed size of the e and f blocks which allows us to used fixed size matrix-vector products.

In terms of recent papers on bundle adjustment and GPUs, the one I have been meaning to read is the work from Megvii

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment
Jie Ren, Wenteng Liang, Ran Yan, Luo Mai, Shiwen Liu, Xiao Liu

Sameer

 

Mark Shachkov

unread,
Aug 17, 2022, 2:02:32 PM8/17/22
to Ceres Solver
Since the reference to MegBA was introduced, i want to repeat a concern raised by Dmitriy in a private conversation: it might be worth to plan the changes targeting usage of more then single gpu, maybe staying at single host boundary for this moment.

среда, 17 августа 2022 г. в 17:43:12 UTC+3, sameer...@google.com:

Sameer Agarwal

unread,
Aug 17, 2022, 2:05:20 PM8/17/22
to ceres-...@googlegroups.com
On Wed, Aug 17, 2022 at 11:02 AM Mark Shachkov <marksh...@gmail.com> wrote:
Since the reference to MegBA was introduced, i want to repeat a concern raised by Dmitriy in a private conversation: it might be worth to plan the changes targeting usage of more then single gpu, maybe staying at single host boundary for this moment.

Mark I am not sure I understand what you are saying. Are you saying we should plan for multiple GPUs on a single host?
What sort of considerations need to go into that?

Sameer

 

Mark Shachkov

unread,
Aug 17, 2022, 2:13:22 PM8/17/22
to ceres-...@googlegroups.com
Are you saying we should plan for multiple GPUs on a single host?
Yes.

What sort of considerations need to go into that?
Currently, i think that being aware about system buses topology and (not) moving data accordingly is most important.

Mark Shachkov


ср, 17 авг. 2022 г. в 21:05, 'Sameer Agarwal' via Ceres Solver <ceres-...@googlegroups.com>:
You received this message because you are subscribed to a topic in the Google Groups "Ceres Solver" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ceres-solver/-AC_9wsYrW0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUCz0Z-6ZAtf9Jzgk3JQHR6gq-uy%2BSKoEs-mBB%3DK%2BJ6abA%40mail.gmail.com.

Sameer Agarwal

unread,
Aug 17, 2022, 2:16:25 PM8/17/22
to ceres-...@googlegroups.com
Does that fundamentally change how single GPU stuff is done? I suspect that is going to be the common case, so we should get that going before we block on multi-gpu issues.
Sameer


Mark Shachkov

unread,
Aug 17, 2022, 2:29:54 PM8/17/22
to ceres-...@googlegroups.com
Does that fundamentally change how single GPU stuff is done?
If moving data between cpu and gpu that are not directly connected (traversing cpu-cpu bridge on hosts with more then single cpu node) will kill the performance (establishing, if this is the case, needs a bit of experimentation, given memory transfer patterns specific to particular implementation), then topology awareness is a need even for the single gpu case.

Mark Shachkov


ср, 17 авг. 2022 г. в 21:16, 'Sameer Agarwal' via Ceres Solver <ceres-...@googlegroups.com>:

Sameer Agarwal

unread,
Aug 17, 2022, 2:32:31 PM8/17/22
to ceres-...@googlegroups.com
On Wed, Aug 17, 2022 at 11:29 AM Mark Shachkov <marksh...@gmail.com> wrote:
Does that fundamentally change how single GPU stuff is done?
If moving data between cpu and gpu that are not directly connected (traversing cpu-cpu bridge on hosts with more then single cpu node) will kill the performance (establishing, if this is the case, needs a bit of experimentation, given memory transfer patterns specific to particular implementation), then topology awareness is a need even for the single gpu case.

sure, but this seems like a second order consideration right now to me. First order should be getting high performance on a single GPU and once that is stable and performant and showing its limits, move to multiple GPUs.

Sameer

 

Mark Shachkov

unread,
Aug 28, 2022, 5:15:36 AM8/28/22
to Ceres Solver
Sameer, the implementation of gpgpu spmv for views on ceres::internal::BlockSparseMatrix is sure an interesting attempt to gain speedups for broader set of problems. Meanwhile, what do you think (especially, given your experience with pba) about the possibility to introduce in ceres-solver structures specific to "boring ba", with 2 non zero blocks in jacobian rows stored explicitly as e and f parts? I don't have any code/experimentation results currently, but it seems that for such problems/structure there might be more efficient ways to exploit gpgpu capabilities for spmv's.

Mark Shachkov

среда, 17 августа 2022 г. в 21:32:31 UTC+3, sameer...@google.com:

Sameer Agarwal

unread,
Aug 28, 2022, 8:33:59 PM8/28/22
to ceres-...@googlegroups.com
Hi Mark,

That would require two assumptions. 

One "Boring BA" as you say, no camera-camera constraints, or even just camera regularization constraints. 
and second fixed camera and point sized blocks, and knowing these sizes in advance or via template expansion. PBA can do what it does because it is primarily used by visualsfm or for benchmarking purposes. Real bundle adjustment problems almost always comes with camera regularization constraints, and more sophisticated camera matrix structure. For example split rotation/translation or shared intrinsics. 

Sure, if we introduce the Boring BA variant, we will win some benchmarks but practically speaking those code paths will not see much use and further complicate the internals.

Given that GPU based spmv is so fast, before we do all this I think implementing the basic implicit schur complement on the GPU is going to be significant win. I have half a CL for this in the works, but if you and/or Dmitriy want to work on it, you are welcome to it too. Please let me know so that I am not working on it in parallel.


Sameer




Mark Shachkov

unread,
Aug 29, 2022, 4:32:59 AM8/29/22
to Ceres Solver
Sameer, i have no code currently, so not much to be wasted on my side.

I came to know ceres-solver via colmap and i think that colmap is still quite a user of ceres-solver. As far as i know, colmap does not employ any regularization during its most time-consuming ba solves and i think there can be sufficiently small work to be done to get rid of "holes" in camera parametrization (given that colmap is capable to use pba, this work is already done to some extent). Maybe we can bring this to attention of Johannes Schönberger to get a comment about is there an interest and will to make such changes.

A bit orthogonal thought is that it might be valuable to gather via public questionnaire (for example in the way Eigen developers did this) a statistics about type of problems and the way ceres-solver applied to them by the users.

Mark Shachkov

понедельник, 29 августа 2022 г. в 03:33:59 UTC+3, sameer...@google.com:

Sameer Agarwal

unread,
Aug 30, 2022, 1:24:13 AM8/30/22
to ceres-...@googlegroups.com
Yes it would be good to ask Johannes as well Noah Snavely.
Good idea on the questionnaire too. I mostly devote my time to things that will be used inside google curiosities that interest me. But a broader understanding of community needs would be great too.
Sameer


Jason

unread,
Oct 23, 2022, 10:50:07 AM10/23/22
to Ceres Solver
What is the current status of CUDA support for BlockSparseMatrix and PartitionedMatrixView so that ImplicitSchurComplement::RightMultiplyAndAccumulate can run on the GPU?

Sameer Agarwal

unread,
Oct 23, 2022, 11:21:57 AM10/23/22
to ceres-...@googlegroups.com
Dmitriy has plans. But I do not have an ETA.
We are working on improving cpu based multi threading for ITERATIVE_SCHUR first. Full support is one pr away.
Sameer 

Nikolaus Demmel

unread,
Dec 12, 2024, 12:34:26 PM12/12/24
to Ceres Solver
Hi Sameer and all,

I hope everyone is well! Checking in on this old thread. Did CPU-based multi-threading (and possibly CUDA support) for ITERATIVE_SCHUR make it in in the meantime?

Best,
Nikolaus

Dmitriy Korchemkin

unread,
Dec 12, 2024, 1:28:38 PM12/12/24
to Ceres Solver
Hi, Nikolaus

CPU-based multi-threading for ITERATIVE_SCHUR was finished and merged.

GPU-part only exists in a form of a series of patches on gerrit which we have discussed with Sameer, but shortly after I got hit with some real-life problems and never finished reworking them to the state when they could be merged in.

In the end I come to conclusion that with reasonably good preconditioner moving just left/right products to GPU is not sufficient (i.e. brings just 2x...3x improvement in total run-time), because computing preconditioner is not cheap.
But moving preconditioner to GPU makes sense with preserving its' block-structure, and then you're not that far from making SpMV for block-sparse matrices yourself (not via cuSPARSE)
And then the solution of SpMV via "cool hacks with permutations" sort of makes less sense.

I can try to finalize those patches during this Christmas/New Year season, if it still makes sense.

Nikolaus Demmel

unread,
Dec 12, 2024, 1:50:20 PM12/12/24
to Ceres Solver
Thanks for the quick reply!

Great to hear multi-threaded ITERATIVE_SCHUR is merged, this is fantastic. Would you mind pointing me to Gerrit or a git commit so I can see in which version it's available. Would love to give this a try! (I assume it doesn't require an additional option and just uses the existing `num_threads` configuration. But I can also check the code myself.)

Sorry to hear about your real-life problems. I hope you are doing ok in the meantime! Don't worry, you don't owe anyone finishing the contribution, there are certainly many more important things in life!

I'm not in a position to say "how much sense sense it makes". I see some other recent activity around CUDA support in Ceres. I'm not sure how that is related to your work. Or maybe that is for different solvers and not ITERATIVE_SCHUR... Maybe Sameer can comment when he has a minute.

All the best!
Nikolaus

Dmitriy Korchemkin

unread,
Dec 12, 2024, 2:09:14 PM12/12/24
to Ceres Solver
> Would you mind pointing me to Gerrit or a git commit so I can see in which version it's available.
I think that the last point when cpu parallelization was touched is

commit 354002f989e7abdab9a4c17291662ab6e3102283  Schedule task in ParallerFor from the previous task

Thus, parallel matrix-vector products should be available since 2.2.0.

If   ImplicitSchurComplement::RightMultiplyAndAccumulate has lots of words "parallel" in it  - you have a version with parallel right products (partitioned matrix view have it's own copy of options and parallel context, so those products are also parallelized)
https://github.com/ceres-solver/ceres-solver/blob/master/internal/ceres/implicit_schur_complement.cc#L106-L144


Keep in mind that SpMV product is a problem limited by memory-bandwidth, so once memory bandwidth is depleted - adding more cores does not bring more performance.
I believe that Sameer observed that on apple silicon memory bandwidth available to a single core is large enough to make cpu parallelization profits minuscule

Sameer Agarwal

unread,
Dec 13, 2024, 6:59:21 PM12/13/24
to ceres-...@googlegroups.com
Hi Dmirtriy,
Nice to hear from you and I hope you are doing well.
Lets talk about moving the preconditional to the gpu. It would be cool if we can do that efficiently.
Sameer


Reply all
Reply to author
Forward
0 new messages