SPARSE_SCHUR solver very slow on big bundle adjustment problems

1,220 views
Skip to first unread message

Raúl Díaz

unread,
Jul 21, 2016, 2:18:56 PM7/21/16
to Ceres Solver
Hi,

I'm trying to reconstruct an image dataset (~7000 images) using openMVG, which uses Ceres as its non-lin solver for bundle adjustment. I'm having performance issues as the problem gets too big. OpenMVG sets DENSE_SCHUR as the solver while the problem has less than 100 images, and SPARSE_SCHUR for more than 100 images alongside the JACOBI preconditioner. Ceres reports good timings when the problem is still relatively small, but as the problem gets bigger the solver takes just too much time. I've tried either cxsparse or suitesparse algebra libs with similar behavior. I am running it in a 48 core machine with openMP support. Here I attach a few reports to illustrate the problem:

With 102 images:

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

                                     Original                  Reduced
Parameter blocks                        40940                    39008
Parameters                             129201                   117609
Residual blocks                        133503                   133503
Residual                               267006                   267006

Minimizer                        TRUST_REGION

Sparse linear algebra library    SUITE_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                    48                       48
Linear solver threads                      48                       48
Linear solver ordering              AUTOMATIC               38813, 195

Cost:
Initial                          1.806568e+04
Final                            1.686165e+04
Change                           1.204027e+03

Minimizer iterations                       11
Successful steps                           11
Unsuccessful steps                          0

Time (in seconds):
Preprocessor                           1.0120

  Residual evaluation                  0.1268
  Jacobian evaluation                  1.3551
  Linear solver                        3.5802
Minimizer                              6.3650

Postprocessor                          0.0248
Total                                  7.4018

Termination:                      CONVERGENCE (Function tolerance reached. |cost_change|/cost: 3.047409e-07 <= 1.000000e-06)




With 477 images

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

                                     Original                  Reduced
Parameter blocks                       141442                   139693
Parameters                             432087                   421593
Residual blocks                        479307                   479307
Residual                               958614                   958614

Minimizer                        TRUST_REGION

Sparse linear algebra library    SUITE_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                    48                       48
Linear solver threads                      48                       48
Linear solver ordering              AUTOMATIC              138855, 838

Cost:
Initial                          7.177760e+04
Final                            7.152920e+04
Change                           2.483920e+02

Minimizer iterations                       30
Successful steps                           22
Unsuccessful steps                          8

Time (in seconds):
Preprocessor                           5.5030

  Residual evaluation                  0.9512
  Jacobian evaluation                 11.3193
  Linear solver                       96.4503
Minimizer                            117.8575

Postprocessor                          0.0945
Total                                123.4550

Termination:                      CONVERGENCE (Function tolerance reached. |cost_change|/cost: 4.437567e-07 <= 1.000000e-06)


When the problem is very big, it can last 15 minutes. I've been running a job for a week now and this is the minimal report for 3044 images. OpenMVG had the ceres report flag to false so I do not have detailed information like above, which I obtained by turning the report on and running a new problem from scratch yesterday.

Bundle Adjustment statistics (approximated RMSE):
 #views: 6844
 #poses: 3044
 #intrinsics: 2638
 #tracks: 772138
 #residuals: 6577458
 Initial RMSE: 0.396149
 Final RMSE: 0.395921
 Time (s): 925.159


I've tried to switch to ITERATIVE_SCHUR/SCHUR_JACOBI but that works even worse (100 images would take 150 seconds!). Maybe there's something wrong the way I compiled openMVG, but it seems to work fine otherwise. The ceres version I'm running is 1.10. From monitoring the CPU, I can see that ceres is not using all 48 CPUs, but using just 1 most of the time.

Thanks
Raúl

Sameer Agarwal

unread,
Jul 21, 2016, 2:39:14 PM7/21/16
to Ceres Solver
can you share a -logtostderr -v=3 log for the bundle adjustment logs?

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/da48975f-e887-4611-85ee-509e7456ee04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pierre Moulon

unread,
Jul 21, 2016, 3:02:34 PM7/21/16
to ceres-...@googlegroups.com
Don't want to disturb from the ceres point of view, I just ask some additional questions & provide some comments.

1.
-> OpenMVG uses all the available CPU if your compiler can use OpenMP.
In your case we see that Ceres is trying to use all your CPUs and that the used ceres is compiled with openMP support:
(v 1.10.0-lapack-suitesparse-cxsparse-openmp)
Threads                                    48                       48
Linear solver threads                      48                       48

The fact that 1 thread is used most of the time is certainly due to a Ceres third party used for matrix computation
(factorization? - I think that Sameer can provide more feedback about timing/threading performance for this part)

2.
It seems that you are using the Incremental pipeline that can be quite slow for very big dataset.
I would suggest to try the Global SfM pipeline that uses less BA iterations.
Feel free to ask me additional question if you want give a try on your dataset.

3.
You can also tweak a bit some OpenMVG parameter to make the Incremental SfM use less BA.
You can set the chosen image resection group size to a larger size by lower this percentage

4.
In order to use the Sameer suggestion about logs option (Glogs parameters).
You must compiled your Ceres version, tell to OpenMVG to use it, and add the glog configuration into https://github.com/openMVG/openMVG/blob/master/src/software/SfM/main_GlobalSfM.cpp
=> I set Ceres to use miniglog by default in order to minimize external dependencies.

I'm also interested also by the comments of the ceres-users about using Ceres in the best way to deal with large scenes ;-)


Regards/Cordialement,
Pierre M

Raúl Díaz

unread,
Jul 21, 2016, 3:21:44 PM7/21/16
to ceres-...@googlegroups.com
Hi Pierre,

Thanks for the reply. Certainly, I'm using miniglog as it comes nicely packed in your software. So if I want to provide the logs Sameer is asking I need to install glog, right?

To reply to some of your comments, I think I can't use your global pipeline as I do not have the focals for all cameras in my image set, and that is a requirement, right?

On the other hand, I've tweaked myself a bit openMVG to speed things up a bit (perhaps something we can discuss outside this thread):
 - p4pf solver when unknown intrinsics
 - reorganized _map_tracks as _map_tracks_reverse to be of the form {imageId, {trackId, featId}} so that searching for common tracks between images is dominated by the O(log n) search time, rather than the O(n).

Changing the group size is also an option, although I think it would not entirely solve this bottleneck. Revising my logs, I can see multiple calls to BA if after the first call some tracks have been rejected. There are times when BA is called 5 times in a row for that reason, yielding more than 1h of ceres runtime for a single batch of images.

I'll install glog and let you know what I get.

Thanks
Raul

--
You received this message because you are subscribed to a topic in the Google Groups "Ceres Solver" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ceres-solver/b-0dPvQpIRE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CADjmpz-L0A1Ht4XXf8Za9wLdXMMqvBakXBK6tGNuv96fztejPQ%40mail.gmail.com.

Pierre Moulon

unread,
Jul 21, 2016, 3:35:29 PM7/21/16
to ceres-...@googlegroups.com
Hi Raul,

2016-07-21 21:21 GMT+02:00 Raúl Díaz <diaz...@gmail.com>:
Hi Pierre,

Thanks for the reply. Certainly, I'm using miniglog as it comes nicely packed in your software. So if I want to provide the logs Sameer is asking I need to install glog, right?
Install glog, gflags, compile your own ceres, and then tell to OpenMVG to use it.
 

To reply to some of your comments, I think I can't use your global pipeline as I do not have the focals for all cameras in my image set, and that is a requirement, right?
Yes.
 

On the other hand, I've tweaked myself a bit openMVG to speed things up a bit (perhaps something we can discuss outside this thread):
 - p4pf solver when unknown intrinsics

Would be cool if you can do PR with this feature ;-) 

 - reorganized _map_tracks as _map_tracks_reverse to be of the form {imageId, {trackId, featId}} so that searching for common tracks between images is dominated by the O(log n) search time, rather than the O(n).
Yes, I know that it improves running time, does not got the time to make it now, despite think about it since a while.
The other reason is that in the past I have tried to reject outlier correspondences from the trackid, and so the imageId (track list) was dynamic.
Would be valuable to have this as a PR too, and to refactorize a bit the code to make this function usable from the openMVG::tracks module.
 

Changing the group size is also an option, although I think it would not entirely solve this bottleneck. Revising my logs, I can see multiple calls to BA if after the first call some tracks have been rejected. There are times when BA is called 5 times in a row for that reason, yielding more than 1h of ceres runtime for a single batch of images.

You know that OpenMVG goal is to make the computation as accurate as possible, and so sometime it can be seen as cumbersome to run multiple BA.
But as some of my papers show it, it helps to run multiple BA.
In your case, as you have a very large dataset, I think it will not change a lot the result if you decrease the number of BA.
Actually, a series of BA is running until there is not obersvation residual up to 4 pixels.
=> Continue to perform a BA if there is more than 50 observations that have been rejected.
=> You can change the loop to make less iteration, or only run one.

---
Off topic:
1. If you want have a try with the GlobalSfM on your dataset, You can weak the openMVG_main_SfMInit_ImageListing to set a default focal for all your cameras (something like 1.2 * max(w,h) should provide decent results) 

2. You can also try to calibrate only the camera that have some valid focal with the GlobalSfM pipeline, then try to localize the remaining images in the reconstruction.
I have not tested this functionality on large dataset, but it should work if your camera calibrated at the first round cover your whole scene.


 

Raúl Díaz

unread,
Jul 24, 2016, 9:06:07 PM7/24/16
to Ceres Solver
Hi

I waited for a couple of days to get the pipeline running and gather some logs for when the problem had become big and slow:

The stdout looks like this (the ceres call lasted 1405 seconds):

CHOLMOD version 3.0.11, May 4, 2016: Symbolic Analysis: status: OK
  Architecture: Linux
    sizeof(int):      4
    sizeof(SuiteSparse_long):  8
    sizeof(void *):   8
    sizeof(double):   8
    sizeof(Int):      4 (CHOLMOD's basic integer)
    sizeof(BLAS_INT): 4 (integer used in the BLAS)
  Results from most recent analysis:
    Cholesky flop count: 5.5713e+10
    Nonzeros in L:       2.828e+07
  memory blocks in use:          15
  memory in use (MB):         166.8
  peak memory usage (MB):     330.9
  maxrank:    update/downdate rank:   8
  supernodal control: 1 40 (supernodal if flops/lnz >= 40)
  nmethods:   number of ordering methods to try: 1
    method 0: natural
        flop count: 5.5713e+10
        nnz(L):     2.828e+07
  OK

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

                                     Original                  Reduced
Parameter blocks                       468899                   467646
Parameters                            1419270                  1411752
Residual blocks                       1884494                  1884494
Residual                              3768988                  3768988

Minimizer                        TRUST_REGION

Sparse linear algebra library    SUITE_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                    48                       48
Linear solver threads                      48                       48
Linear solver ordering              AUTOMATIC             464708, 2938

Cost:
Initial                          4.780442e+05
Final                            4.706488e+05
Change                           7.395402e+03

Minimizer iterations                       28
Successful steps                           18
Unsuccessful steps                         10

Time (in seconds):
Preprocessor                          28.2264

  Residual evaluation                  2.4845
  Jacobian evaluation                 32.8710
  Linear solver                     1310.4017
Minimizer                           1376.7207

Postprocessor                          0.3506
Total                               1405.2977

Termination:                      CONVERGENCE (Function tolerance reached. |cost_change|/cost: 7.995233e-07 <= 1.000000e-06)


Bundle Adjustment statistics (approximated RMSE):
 #views: 6844
 #poses: 1791
 #intrinsics: 2400
 #tracks: 464708
 #residuals: 3768988
 Initial RMSE: 0.356141
 Final RMSE: 0.353375
 Time (s): 1405.3


The sdterr looks like this (it's much longer but essentially it's repeating itself on these forty-something second blocks:

I0724 17:23:22.476943 31239 block_sparse_matrix.cc:78] Allocating values array with 452278560 bytes.
I0724 17:23:29.882452 31239 block_random_access_sparse_matrix.cc:75] Matrix Size [17628,17628] 14446008
I0724 17:23:30.399112 31239 detect_structure.cc:107] Schur complement static structure <2,3,6>.
I0724 17:24:15.843502 31239 wall_time.cc:74] 

SchurComplementSolver::Solve
                                   Delta   Cumulative
                           Setup :    5.07889      5.07889
                       Eliminate :    1.94684      7.02573
                    ReducedSolve :   42.77079     49.79652
                  BackSubstitute :    0.10135     49.89786
                           Total :    0.00002     49.89788

I0724 17:25:15.193806 31239 wall_time.cc:74] 

SchurComplementSolver::Solve
                                   Delta   Cumulative
                           Setup :    0.00361      0.00361
                       Eliminate :    1.93419      1.93780
                    ReducedSolve :   53.92556     55.86336
                  BackSubstitute :    0.10428     55.96764
                           Total :    0.00002     55.96766

I0724 17:26:03.628803 31239 wall_time.cc:74] 

SchurComplementSolver::Solve
                                   Delta   Cumulative
                           Setup :    0.00670      0.00670
                       Eliminate :    1.97086      1.97755
                    ReducedSolve :   43.26899     45.24654
                  BackSubstitute :    0.10377     45.35032
                           Total :    0.00002     45.35034

I0724 17:26:49.655076 31239 wall_time.cc:74] 

SchurComplementSolver::Solve
                                   Delta   Cumulative
                           Setup :    0.00404      0.00404
                       Eliminate :    1.95039      1.95442
                    ReducedSolve :   41.20693     43.16135
                  BackSubstitute :    0.10206     43.26341
                           Total :    0.00002     43.26343


Let me know if I need to provide something else.

Thanks
Raúl

Sameer Agarwal

unread,
Jul 25, 2016, 12:52:11 AM7/25/16
to Ceres Solver
Raul,
Thanks for sharing the logs. From this we can infer that almost all the time is being spent in factoring the schur complement. Your schur complement matrix is of size 17.6kx17.6k with 14M non-zeros. Thats a sizeable matrix to be factorizing every iteration, since CHOLMOD is single threaded. Its performance is entirely governed by the performance the underlying BLAS & LAPACK libraries. What is the BLAS/LAPACK library you are linking to. I hope it is not the reference blas. I recommend using either ATLAS or OpenBLAS.

The second thing to try is using ITERATIVE_SCHUR solver with SCHUR_JACOBI. For small to moderate systems it is not a good choice, but once you go past a thousand or so images, it works quite well. You can try the JACOBI and the CLUSTER_JACOBI preconditioners. The former is cheaper but worse performing in my opinion. The latter performs better, but is also more expensive to setup and in many cases that cost is too prohibitive compared to SCHUR_JACOBI. You should use the single linkage based clustering algorithm for it .

Sameer


Reply all
Reply to author
Forward
0 new messages