SPARSE_SCHUR solver very slow on big bundle adjustment problems

Raúl Díaz

unread,

Jul 21, 2016, 2:18:56 PM7/21/16

to Ceres Solver

Hi,

I'm trying to reconstruct an image dataset (~7000 images) using openMVG, which uses Ceres as its non-lin solver for bundle adjustment. I'm having performance issues as the problem gets too big. OpenMVG sets DENSE_SCHUR as the solver while the problem has less than 100 images, and SPARSE_SCHUR for more than 100 images alongside the JACOBI preconditioner. Ceres reports good timings when the problem is still relatively small, but as the problem gets bigger the solver takes just too much time. I've tried either cxsparse or suitesparse algebra libs with similar behavior. I am running it in a 48 core machine with openMP support. Here I attach a few reports to illustrate the problem:

With 102 images:

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

Original Reduced

Parameter blocks 40940 39008

Parameters 129201 117609

Residual blocks 133503 133503

Residual 267006 267006

Minimizer TRUST_REGION

Sparse linear algebra library SUITE_SPARSE

Trust region strategy LEVENBERG_MARQUARDT

Given Used

Linear solver SPARSE_SCHUR SPARSE_SCHUR

Threads 48 48

Linear solver threads 48 48

Linear solver ordering AUTOMATIC 38813, 195

Cost:

Initial 1.806568e+04

Final 1.686165e+04

Change 1.204027e+03

Minimizer iterations 11

Successful steps 11

Unsuccessful steps 0

Time (in seconds):

Preprocessor 1.0120

Residual evaluation 0.1268

Jacobian evaluation 1.3551

Linear solver 3.5802

Minimizer 6.3650

Postprocessor 0.0248

Total 7.4018

Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 3.047409e-07 <= 1.000000e-06)

With 477 images

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

Original Reduced

Parameter blocks 141442 139693

Parameters 432087 421593

Residual blocks 479307 479307

Residual 958614 958614

Minimizer TRUST_REGION

Sparse linear algebra library SUITE_SPARSE

Trust region strategy LEVENBERG_MARQUARDT

Given Used

Linear solver SPARSE_SCHUR SPARSE_SCHUR

Threads 48 48

Linear solver threads 48 48

Linear solver ordering AUTOMATIC 138855, 838

Cost:

Initial 7.177760e+04

Final 7.152920e+04

Change 2.483920e+02

Minimizer iterations 30

Successful steps 22

Unsuccessful steps 8

Time (in seconds):

Preprocessor 5.5030

Residual evaluation 0.9512

Jacobian evaluation 11.3193

Linear solver 96.4503

Minimizer 117.8575

Postprocessor 0.0945

Total 123.4550

Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 4.437567e-07 <= 1.000000e-06)

When the problem is very big, it can last 15 minutes. I've been running a job for a week now and this is the minimal report for 3044 images. OpenMVG had the ceres report flag to false so I do not have detailed information like above, which I obtained by turning the report on and running a new problem from scratch yesterday.

Bundle Adjustment statistics (approximated RMSE):

#views: 6844

#poses: 3044

#intrinsics: 2638

#tracks: 772138

#residuals: 6577458

Initial RMSE: 0.396149

Final RMSE: 0.395921

Time (s): 925.159

I've tried to switch to ITERATIVE_SCHUR/SCHUR_JACOBI but that works even worse (100 images would take 150 seconds!). Maybe there's something wrong the way I compiled openMVG, but it seems to work fine otherwise. The ceres version I'm running is 1.10. From monitoring the CPU, I can see that ceres is not using all 48 CPUs, but using just 1 most of the time.

Thanks

Raúl

Sameer Agarwal

unread,

Jul 21, 2016, 2:39:14 PM7/21/16

to Ceres Solver

can you share a -logtostderr -v=3 log for the bundle adjustment logs?

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/da48975f-e887-4611-85ee-509e7456ee04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pierre Moulon

unread,

Jul 21, 2016, 3:02:34 PM7/21/16

to ceres-...@googlegroups.com

Don't want to disturb from the ceres point of view, I just ask some additional questions & provide some comments.

1.

-> OpenMVG uses all the available CPU if your compiler can use OpenMP.

In your case we see that Ceres is trying to use all your CPUs and that the used ceres is compiled with openMP support:

(v 1.10.0-lapack-suitesparse-cxsparse-openmp)

Threads 48 48

Linear solver threads 48 48

The fact that 1 thread is used most of the time is certainly due to a Ceres third party used for matrix computation

(factorization? - I think that Sameer can provide more feedback about timing/threading performance for this part)

2.

It seems that you are using the Incremental pipeline that can be quite slow for very big dataset.

I would suggest to try the Global SfM pipeline that uses less BA iterations.

Feel free to ask me additional question if you want give a try on your dataset.

3.

You can also tweak a bit some OpenMVG parameter to make the Incremental SfM use less BA.

You can set the chosen image resection group size to a larger size by lower this percentage

https://github.com/openMVG/openMVG/blob/master/src/openMVG/sfm/pipelines/sequential/sequential_SfM.cpp#L774

4.

In order to use the Sameer suggestion about logs option (Glogs parameters).

You must compiled your Ceres version, tell to OpenMVG to use it, and add the glog configuration into https://github.com/openMVG/openMVG/blob/master/src/software/SfM/main_GlobalSfM.cpp

=> I set Ceres to use miniglog by default in order to minimize external dependencies.

I'm also interested also by the comments of the ceres-users about using Ceres in the best way to deal with large scenes ;-)

Regards/Cordialement,

Pierre M

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUAvmUq11cw%3DVn%2BJ38jE5HAJg9cpSKS4-sP3-ipHsh3v9g%40mail.gmail.com.

Raúl Díaz

unread,

Jul 21, 2016, 3:21:44 PM7/21/16

to ceres-...@googlegroups.com

Hi Pierre,

Thanks for the reply. Certainly, I'm using miniglog as it comes nicely packed in your software. So if I want to provide the logs Sameer is asking I need to install glog, right?

To reply to some of your comments, I think I can't use your global pipeline as I do not have the focals for all cameras in my image set, and that is a requirement, right?

On the other hand, I've tweaked myself a bit openMVG to speed things up a bit (perhaps something we can discuss outside this thread):

- p4pf solver when unknown intrinsics

- reorganized _map_tracks as _map_tracks_reverse to be of the form {imageId, {trackId, featId}} so that searching for common tracks between images is dominated by the O(log n) search time, rather than the O(n).

Changing the group size is also an option, although I think it would not entirely solve this bottleneck. Revising my logs, I can see multiple calls to BA if after the first call some tracks have been rejected. There are times when BA is called 5 times in a row for that reason, yielding more than 1h of ceres runtime for a single batch of images.

I'll install glog and let you know what I get.

Thanks

Raul

--
You received this message because you are subscribed to a topic in the Google Groups "Ceres Solver" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ceres-solver/b-0dPvQpIRE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CADjmpz-L0A1Ht4XXf8Za9wLdXMMqvBakXBK6tGNuv96fztejPQ%40mail.gmail.com.

Pierre Moulon

unread,

Jul 21, 2016, 3:35:29 PM7/21/16

to ceres-...@googlegroups.com

Hi Raul,

2016-07-21 21:21 GMT+02:00 Raúl Díaz <diaz...@gmail.com>:

Hi Pierre,

Thanks for the reply. Certainly, I'm using miniglog as it comes nicely packed in your software. So if I want to provide the logs Sameer is asking I need to install glog, right?

Install glog, gflags, compile your own ceres, and then tell to OpenMVG to use it.

To reply to some of your comments, I think I can't use your global pipeline as I do not have the focals for all cameras in my image set, and that is a requirement, right?

Yes.

On the other hand, I've tweaked myself a bit openMVG to speed things up a bit (perhaps something we can discuss outside this thread):
- p4pf solver when unknown intrinsics

Would be cool if you can do PR with this feature ;-)

- reorganized _map_tracks as _map_tracks_reverse to be of the form {imageId, {trackId, featId}} so that searching for common tracks between images is dominated by the O(log n) search time, rather than the O(n).

Yes, I know that it improves running time, does not got the time to make it now, despite think about it since a while.

The other reason is that in the past I have tried to reject outlier correspondences from the trackid, and so the imageId (track list) was dynamic.

Would be valuable to have this as a PR too, and to refactorize a bit the code to make this function usable from the openMVG::tracks module.

Changing the group size is also an option, although I think it would not entirely solve this bottleneck. Revising my logs, I can see multiple calls to BA if after the first call some tracks have been rejected. There are times when BA is called 5 times in a row for that reason, yielding more than 1h of ceres runtime for a single batch of images.

You know that OpenMVG goal is to make the computation as accurate as possible, and so sometime it can be seen as cumbersome to run multiple BA.

But as some of my papers show it, it helps to run multiple BA.

In your case, as you have a very large dataset, I think it will not change a lot the result if you decrease the number of BA.

Actually, a series of BA is running until there is not obersvation residual up to 4 pixels.

See here https://github.com/openMVG/openMVG/blob/master/src/openMVG/sfm/pipelines/sequential/sequential_SfM.cpp#L141

=> Continue to perform a BA if there is more than 50 observations that have been rejected.

=> You can change the loop to make less iteration, or only run one.

---

Off topic:

1. If you want have a try with the GlobalSfM on your dataset, You can weak the openMVG_main_SfMInit_ImageListing to set a default focal for all your cameras (something like 1.2 * max(w,h) should provide decent results)

2. You can also try to calibrate only the camera that have some valid focal with the GlobalSfM pipeline, then try to localize the remaining images in the reconstruction.

I have not tested this functionality on large dataset, but it should work if your camera calibrated at the first round cover your whole scene.

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAJvCh0_phTVs_6rJUEt9ftuTeu804WEw0or1CxOk7yWTSFQ8%3DQ%40mail.gmail.com.

Raúl Díaz

unread,

Jul 24, 2016, 9:06:07 PM7/24/16

to Ceres Solver

Hi

I waited for a couple of days to get the pipeline running and gather some logs for when the problem had become big and slow:

The stdout looks like this (the ceres call lasted 1405 seconds):

CHOLMOD version 3.0.11, May 4, 2016: Symbolic Analysis: status: OK

Architecture: Linux

sizeof(int): 4

sizeof(SuiteSparse_long): 8

sizeof(void *): 8

sizeof(double): 8

sizeof(Int): 4 (CHOLMOD's basic integer)

sizeof(BLAS_INT): 4 (integer used in the BLAS)

Results from most recent analysis:

Cholesky flop count: 5.5713e+10

Nonzeros in L: 2.828e+07

memory blocks in use: 15

memory in use (MB): 166.8

peak memory usage (MB): 330.9

maxrank: update/downdate rank: 8

supernodal control: 1 40 (supernodal if flops/lnz >= 40)

nmethods: number of ordering methods to try: 1

method 0: natural

flop count: 5.5713e+10

nnz(L): 2.828e+07

OK

Solver Summary (v 1.10.0-lapack-suitesparse-cxsparse-openmp)

Original Reduced

Parameter blocks 468899 467646

Parameters 1419270 1411752

Residual blocks 1884494 1884494

Residual 3768988 3768988

Minimizer TRUST_REGION

Sparse linear algebra library SUITE_SPARSE

Trust region strategy LEVENBERG_MARQUARDT

Given Used

Linear solver SPARSE_SCHUR SPARSE_SCHUR

Threads 48 48

Linear solver threads 48 48

Linear solver ordering AUTOMATIC 464708, 2938

Cost:

Initial 4.780442e+05

Final 4.706488e+05

Change 7.395402e+03

Minimizer iterations 28

Successful steps 18

Unsuccessful steps 10

Time (in seconds):

Preprocessor 28.2264

Residual evaluation 2.4845

Jacobian evaluation 32.8710

Linear solver 1310.4017

Minimizer 1376.7207

Postprocessor 0.3506

Total 1405.2977

Termination: CONVERGENCE (Function tolerance reached. |cost_change|/cost: 7.995233e-07 <= 1.000000e-06)

Bundle Adjustment statistics (approximated RMSE):

#views: 6844

#poses: 1791

#intrinsics: 2400

#tracks: 464708

#residuals: 3768988

Initial RMSE: 0.356141

Final RMSE: 0.353375

Time (s): 1405.3

The sdterr looks like this (it's much longer but essentially it's repeating itself on these forty-something second blocks:

I0724 17:23:22.476943 31239 block_sparse_matrix.cc:78] Allocating values array with 452278560 bytes.

I0724 17:23:29.882452 31239 block_random_access_sparse_matrix.cc:75] Matrix Size [17628,17628] 14446008

I0724 17:23:30.399112 31239 detect_structure.cc:107] Schur complement static structure <2,3,6>.

I0724 17:24:15.843502 31239 wall_time.cc:74]

SchurComplementSolver::Solve

Delta Cumulative

Setup : 5.07889 5.07889

Eliminate : 1.94684 7.02573

ReducedSolve : 42.77079 49.79652

BackSubstitute : 0.10135 49.89786

Total : 0.00002 49.89788

I0724 17:25:15.193806 31239 wall_time.cc:74]

SchurComplementSolver::Solve

Delta Cumulative

Setup : 0.00361 0.00361

Eliminate : 1.93419 1.93780

ReducedSolve : 53.92556 55.86336

BackSubstitute : 0.10428 55.96764

Total : 0.00002 55.96766

I0724 17:26:03.628803 31239 wall_time.cc:74]

SchurComplementSolver::Solve

Delta Cumulative

Setup : 0.00670 0.00670

Eliminate : 1.97086 1.97755

ReducedSolve : 43.26899 45.24654

BackSubstitute : 0.10377 45.35032

Total : 0.00002 45.35034

I0724 17:26:49.655076 31239 wall_time.cc:74]

SchurComplementSolver::Solve

Delta Cumulative

Setup : 0.00404 0.00404

Eliminate : 1.95039 1.95442

ReducedSolve : 41.20693 43.16135

BackSubstitute : 0.10206 43.26341

Total : 0.00002 43.26343

Let me know if I need to provide something else.

Thanks

Raúl

Sameer Agarwal

unread,

Jul 25, 2016, 12:52:11 AM7/25/16

to Ceres Solver

Raul,
Thanks for sharing the logs. From this we can infer that almost all the time is being spent in factoring the schur complement. Your schur complement matrix is of size 17.6kx17.6k with 14M non-zeros. Thats a sizeable matrix to be factorizing every iteration, since CHOLMOD is single threaded. Its performance is entirely governed by the performance the underlying BLAS & LAPACK libraries. What is the BLAS/LAPACK library you are linking to. I hope it is not the reference blas. I recommend using either ATLAS or OpenBLAS.

The second thing to try is using ITERATIVE_SCHUR solver with SCHUR_JACOBI. For small to moderate systems it is not a good choice, but once you go past a thousand or so images, it works quite well. You can try the JACOBI and the CLUSTER_JACOBI preconditioners. The former is cheaper but worse performing in my opinion. The latter performs better, but is also more expensive to setup and in many cases that cost is too prohibitive compared to SCHUR_JACOBI. You should use the single linkage based clustering algorithm for it .

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/644f4c70-fbd3-448e-9e0e-163d9e63aa11%40googlegroups.com.

Reply all

Reply to author

Forward