Hi,I'm benchmarking ceres on some large BAL-style bundle adjustment problems and I just wanna make sure I have the configuration straight in terms of multi-threading support.I understand how multi-threading is used for residual/jacobian evaluation and for computing the SC, but for larger BA problems significant time is spent on solving the reduced camera system.I'm considering the SPARSE_SCHUR and ITERATIVE_SCHUR linear solvers with JACOBI and SCHUR_JACOBI preconditioners.
To my understanding, there is no option for a multi-threaded exact sparse solver (SPARSE_SCHUR), right? (I'm using SUITE_SPARSE as the sparse linear algebra library, but I guess it's not different with CX_SPARSE or EIGEN_SPARSE, since the docs mention that SUITE_SPARSE is strongly preferred for large BA problems).
What about the inexact solvers (ITERATIVE_SCHUR), both the explicit and implicit SC variant?
Are they making use of multithreading in the CG iterations?
It doesn't seem like it to me (except the SCHUR_JACOBI preconditioner in the implicit SC variant uses the parallel schur eliminator to compute the block diagonal preconditioner), but maybe I'm missing something since there are many different variants of sparse matrix classes, so I might be looking at the wrong code.
Best,
NikoPS: My BLAS library is OpenBLAS and I'm turning it's multi-threading support off with OPENBLAS_NUM_THREADS=1 as suggested in the docs. This is on Ubuntu Linux.
--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/5c7171d8-b7fd-492d-9455-473b01d21b7bo%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-...@googlegroups.com.
Hi Sameer,thanks for your reply, this is very helpful.One more question about the implicit SC iterative solver. If I'm not mistaken, this corresponds to the "implicit-schur" variant from the Multicore Bundle Adjustment paper (http://grail.cs.washington.edu/projects/mcba/pba.pdf). In that paper, parallel evaluation of the matrix vector products needed in CG is discussed. Is the reason this is not implemented in Ceres some fundamental difficulty, e.g. b/c Ceres is general purpose whereas the paper discusses specifically BAL-style BA problems? Or is it simply that no-one ever got around to implementing a multithreaded version in Ceres so far?
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/1b686f46-01d5-4b8a-9678-5088a3a8e1b8o%40googlegroups.com.
On Mon, Oct 26, 2020 at 2:45 AM Nikolaus Demmel <nde...@gmail.com> wrote:Hi Sameer,thanks for your reply, this is very helpful.One more question about the implicit SC iterative solver. If I'm not mistaken, this corresponds to the "implicit-schur" variant from the Multicore Bundle Adjustment paper (http://grail.cs.washington.edu/projects/mcba/pba.pdf). In that paper, parallel evaluation of the matrix vector products needed in CG is discussed. Is the reason this is not implemented in Ceres some fundamental difficulty, e.g. b/c Ceres is general purpose whereas the paper discusses specifically BAL-style BA problems? Or is it simply that no-one ever got around to implementing a multithreaded version in Ceres so far?The short version is that I tried and failed to get good performance out of it. It likely has to do with the fact that Ceres solves a more general problem and uses a more general block sparse matrix representation than the one used in the MCBA paper.So there are two products to be evaluated: Jx and J'x. Since the Jacobian is stored in block row major order, Jx is trivial to thread. I am pretty sure that when we had openmp annotation based threading, that product was threaded but my memory fails me right now. Either ways, the resulting reduction in time was small.
J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.
Sameer
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/1b686f46-01d5-4b8a-9678-5088a3a8e1b8o%40googlegroups.com.
The short version is that I tried and failed to get good performance out of it. It likely has to do with the fact that Ceres solves a more general problem and uses a more general block sparse matrix representation than the one used in the MCBA paper.So there are two products to be evaluated: Jx and J'x. Since the Jacobian is stored in block row major order, Jx is trivial to thread. I am pretty sure that when we had openmp annotation based threading, that product was threaded but my memory fails me right now. Either ways, the resulting reduction in time was small.Thanks for the insights. Makes sense.J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory. Alternatively you could split up x and do a map-reduce over the columns of J' (== rows of J). Of course in the reduction you would have to do additional vector additions (of size of the reduced camera system). Not sure which one would work better, that the map reduce should not require too much additional memory I think. I haven't really though this through.
The explicit iterative schur solver should be well-parallelizable, since you only need Hx. Assuming H is also stored in row-block-compressed.
I agree those would be nice projects. Quite self-contained, e.g. for someone that wanted to get into contributing to Ceres. At the moment there is no time, unfortunately. Maybe I should find a student to implement it and find the best parallelization strategy for implicit schur.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/694229e4-6dd4-4b7f-825c-d431724cd32do%40googlegroups.com.
J'x is more complicated since going row wise now requires that we scatter and gather, which also increases memory use.It is entirely possible that there is a better way to do this, I'd be happy to brainstorm and review code. It would be nice to have the iterative part of Ceres be faster.By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory. Alternatively you could split up x and do a map-reduce over the columns of J' (== rows of J). Of course in the reduction you would have to do additional vector additions (of size of the reduced camera system). Not sure which one would work better, that the map reduce should not require too much additional memory I think. I haven't really though this through.I mean break up J and x in chunks of rows, compute y_i = J_i' x_i on each thread and then at the end of it sum up the y_is. This requires k-copies the size of the number of parameters.The additional vector operations are not the size of the reduced camera system. They are of the size of the full parameter vector.
The explicit iterative schur solver should be well-parallelizable, since you only need Hx. Assuming H is also stored in row-block-compressed.Yes but the cost of computing H is rather high so the zone in which H + iterative methods work well is quite narrow.
I agree those would be nice projects. Quite self-contained, e.g. for someone that wanted to get into contributing to Ceres. At the moment there is no time, unfortunately. Maybe I should find a student to implement it and find the best parallelization strategy for implicit schur.That would be great! and a nice contribution to improving performance. The other would be to add support for CUDA/GPU based linear solvers.
Sameer
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/694229e4-6dd4-4b7f-825c-d431724cd32do%40googlegroups.com.
> By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.
One might store only a CompressedColumnBlockStructure corresponding to the structure of J (with value offsets pointing to the same values of J, keeping in mind an intrinsic transpose).This allows to run parallel products for both J and J' only storing a structure two times (but values are stored only once); this also simplifies modification logic (no need to track value changes).
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/10f5447b-f792-4144-a27b-4886e0a466a8n%40googlegroups.com.
> By scatter and gather you mean to be able to parallelize the product J'x over rows of J' (== columns of J)? Sounds almost like storing J a second time in compressed (block-) column form, which would indeed be a lot of additional memory.
One might store only a CompressedColumnBlockStructure corresponding to the structure of J (with value offsets pointing to the same values of J, keeping in mind an intrinsic transpose).This allows to run parallel products for both J and J' only storing a structure two times (but values are stored only once); this also simplifies modification logic (no need to track value changes).That's a cool idea. It is worth prototyping and measuring the performance gain and compare it to the scatter and gather approach.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/10f5447b-f792-4144-a27b-4886e0a466a8n%40googlegroups.com.
That's a cool idea. It is worth prototyping and measuring the performance gain and compare it to the scatter and gather approach.Yes, that sounds easy to implement and should work nicely. Not sure if cache performance might be an issue when iterating over columns in that way, but maybe it's ok if the blocks aren't too small. The alternatives also have overhead, so definitely work comparing for different problem sizes.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/d63860a2-edf1-4840-b6db-0eeb35fc53f5o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUC9xwODL%2Bhzzaicm6VsdKacQ-WiB0O--ygTnyc%3D2FVCwQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAN8XMixHQvnGr_zWpp0Fo0ajWTkPjz%2BhR-5-za2SoXKpjUL-rA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/6712fb10-25fa-49c8-8391-13ab018029c7n%40googlegroups.com.
> Joydeep Biswas has been working on adding support for GPUs to ceres solver and he has a CL in the works to add GPU support for CGNR.
Current approach for CGNR works via writing jacobians in CSR format, followed by CSR SpMV products using cuSPARSE, right?
When I was experimenting with accelerating iterative schur complement 2 years ago, I used standard block-sparse jacobian writer with a two-stage conversion to CSR offloaded to GPU:
- “offline” (once per jacobian structure change): convert block-sparse partitioned structure to 2xCSR structures and prepare permutations for remapping elements from block-sparse structure to CSR structure
- “online” (once per jacobian values change): stream chunks of values in block-sparse order to GPU, remapping them into CSR order in a pipelined manner.
This was needed to get a reasonable performance boost over parallel CPU SpMV with caching of transposed block-sparse structure (described somewhere in this thread).
Back then I haven’t tried to implement a jacobian writer for partitioned CSR matrix though, probably it would be reasonably efficient; my approach was driven by idea of minimizing amount of data transferred to GPU (thus, avoiding CSR on CPU side) and utilizing the higher memory bandwidth of GPU for performing format conversion.
If the algorithm outlined above looks reasonable - I think I can put some effort into bringing it to current upstream.
I am also interested if someone is aware of any existing code/paper on efficient block-sparse-matrix by dense vector multiplication on GPU (i.e. using the same storage format as ceres-solver, but comparably efficient to CSR SpMV from cuSPARSE), because CSR format is kinda expensive to store (by a factor of ~1.5x for double-word indices and double-precision coefficients).
While it might be not a big problem on a modern gpu with >=24Gb memory (since we still have a limit of 2^31-1 non-zero elements, we can fit any jacobian into 24Gb GPU memory even with CSR format), I think that cutting down the amount of memory that needs to be traversed for a single product might improve performance (since SpMV products are typically bandwidth-limited).
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/42fee209-bad1-4111-a7cb-fda60aee9427n%40googlegroups.com.
Since the reference to MegBA was introduced, i want to repeat a concern raised by Dmitriy in a private conversation: it might be worth to plan the changes targeting usage of more then single gpu, maybe staying at single host boundary for this moment.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/606274df-00c6-401a-96d1-d76d58004f3bn%40googlegroups.com.
Are you saying we should plan for multiple GPUs on a single host?
What sort of considerations need to go into that?
You received this message because you are subscribed to a topic in the Google Groups "Ceres Solver" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ceres-solver/-AC_9wsYrW0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUCz0Z-6ZAtf9Jzgk3JQHR6gq-uy%2BSKoEs-mBB%3DK%2BJ6abA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CA%2Bu8GTC%2B%3D1a8eGt7sNaOmTYtApgQ0E5E1BzTx6K5OkKKB%2BPTbA%40mail.gmail.com.
Does that fundamentally change how single GPU stuff is done?
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUAsXswQWSujvdhJAxOwo%3D0Xh0QMU88k%2BXioY7KtB6rHuA%40mail.gmail.com.
Does that fundamentally change how single GPU stuff is done?If moving data between cpu and gpu that are not directly connected (traversing cpu-cpu bridge on hosts with more then single cpu node) will kill the performance (establishing, if this is the case, needs a bit of experimentation, given memory transfer patterns specific to particular implementation), then topology awareness is a need even for the single gpu case.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CA%2Bu8GTDnLyfpFtAzgws9QVPKJz5kpDJYonPsCDgw_dTPORwVCw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/0a7b13d2-9d62-4382-b9d3-eb1f1d33d26bn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/139b9ea8-bfc0-4982-a366-7a2b45c26bfcn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/18e9a2f6-7d5b-4bc7-b641-90508f63eae6n%40googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ceres-solver/f6476c42-a645-4484-a3f6-4a7f47437f19n%40googlegroups.com.