Parallelizing Iteratice Schur

Björn Piltz

unread,

May 31, 2017, 5:53:04 AM5/31/17

to ceres-...@googlegroups.com

Hi,

I do classic bundle adjustment with Ceres and I have recently looked into boosting the performance a bit. It seems there is some need for parallelization.

My setup is as follows:

2 K images
20 M image observations
I am using ITERATIVE_SCHUR with SCHUR_JACOBI preconditioner
I have enabled suite_sparse, custom blas, schur specializations and OpenMP
I have compiled ceres with MSVC 2015 x64 for Windows 10

I attach a log and a visual profiling session. A few things can be learned from these:

Virtually all time is spent in ImplicitSchurComplement::RightMultiply().
RightMultiply() consitently takes ~1.6 s to execute.
RightMultiply() utilizes 12.5% (1/8) of my processing power(I have 8 physical cores).

This is quite efficient. Single threaded code often only reaches ~6% (1/16 logical cores)

RightMultiply() is called increasingly often. (22x in the first iteration and 280x in the fourth.)

So my question is: Does this setup look sane, and if yes, would it be a huge efffort to implement a parallelized version of ImplicitSchurComplement::RightMultiply()? I'd be willing to contribute. I guess MatrixVectorMultiply<Dynamic, Dynamic>() is where the actual work is done.

Thanks for the good work!

Björn

Sameer Agarwal

unread,

May 31, 2017, 8:10:08 AM5/31/17

to ceres-...@googlegroups.com

Hi Bjorn,

My comments are inline.

On Wed, May 31, 2017 at 2:53 AM Björn Piltz <bjorn...@gmail.com> wrote:

Hi,
I do classic bundle adjustment with Ceres and I have recently looked into boosting the performance a bit. It seems there is some need for parallelization.

My setup is as follows:
2 K images
20 M image observations
I am using ITERATIVE_SCHUR with SCHUR_JACOBI preconditioner
I have enabled suite_sparse, custom blas, schur specializations and OpenMP
I have compiled ceres with MSVC 2015 x64 for Windows 10
I attach a log and a visual profiling session. A few things can be learned from these:
Virtually all time is spent in ImplicitSchurComplement::RightMultiply().
RightMultiply() consitently takes ~1.6 s to execute.
RightMultiply() utilizes 12.5% (1/8) of my processing power(I have 8 physical cores).
This is quite efficient. Single threaded code often only reaches ~6% (1/16 logical cores)
RightMultiply() is called increasingly often. (22x in the first iteration and 280x in the fourth.)

The increase in the number of calls to RightMultiply is expected as the iterations progress, but it is likely also indicative that the SchurJacobi preconditioner is not good enough for what you are doing.

You may want to consider the CLUSTER_JACOBI or CLUSTER_TRIDIAGONAL preconditioners with SINGLE_LINKAGE clustering. That will trade off some preconditioner setup time for a better quality preconditioner.

So my question is: Does this setup look sane, and if yes, would it be a huge efffort to implement a parallelized version of ImplicitSchurComplement::RightMultiply()?

The function you would want to speed up are the four matrix multiply routines in partitioned_matrix_view_impl.h.

PartitionedMatrixView::RightMultiplyE

PartitionedMatrixView::RightMultiplyF

PartitionedMatrixView::LeftMultiplyE

PartitionedMatrixView::LeftMultiplyF

Of these the RightMultiply variants are straightforward to parallelize using OpenMP, but that only gives you a minor speedup. The reason is that these functions are all memory bound.

I have tried a variety of different things to speed these functions up over the years, and not much has worked.

The only thing which gave a substantial speed up is a column re-ordering of the Jacobian matrix so that all of the E blocks are contiguous and all the F blocks are contiguous. This lovely bit of logic is in BuildJacobianLayout in block_jacobian_writer.cc.

I'd be willing to contribute. I guess MatrixVectorMultiply<Dynamic, Dynamic>() is where the actual work is done.

This already exists, but speaking of, one more thing you could try is disabling the custom_blas routines by "-DCUSTOM_BLAS=OFF" and see if Eigen's routines have caught up with the hand written ones.

There used to be a significant difference between these, but its been a few years since I benchmarked these routines.

Sameer

Thanks for the good work!
Björn

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAF8Ue3nv6B0-S171tmuq8mH%2BZCOn6eWAP5P_XCJUS1HjnSzkAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Björn Piltz

unread,

Jun 1, 2017, 11:15:54 AM6/1/17

to ceres-...@googlegroups.com

Thanks for the input!

I tried all combinations of preconditioner and linkage and ITERATIVE_SCHUR with SCHUR_JACOBI still seems to be the fastest. I'll take your word for it that it is not a trivial thing to improve, since it mainly involves memory ordering.

CUSTOM_BLAS is still necessary. I think mainly because the dynamic sized version of MatrixVectorMultiply() gets called a lot with r=2 and c=2 in my case. The overhead of using dynamic sized matrices in that case is big.

Björn

Sameer Agarwal

unread,

Jun 1, 2017, 5:01:35 PM6/1/17

to ceres-...@googlegroups.com

Bjorn,

CUSTOM_BLAS is still necessary. I think mainly because the dynamic sized version of MatrixVectorMultiply() gets called a lot with r=2 and c=2 in my case. The overhead of using dynamic sized matrices in that case is big.

I do not understand. In either case, the static version should get called if ceres can detect this static block correctly.

Björn

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAF8Ue3kgV67%3DasdA-yAOUc%3DjCTz7BMX%2Bpd6uvje5V%3Dpbijudhg%40mail.gmail.com.

Björn Piltz

unread,

Jun 2, 2017, 3:59:15 AM6/2/17

to ceres-...@googlegroups.com

Yes, the static version gets called when the sizes are known. I was thinking of the calls to MatrixVectorMultiply<Eigen::Dynamic, Eigen::Dynamic>() in BlockSparseMatrix::RightMultiply() and PartitionedMatrixView<>::RightMultiplyF(). I'm not sure how large an impact these calls have. I was just thinking that to truly compare Eigen vs custom BLAS it would make sense to specialize the dynamic case, since Eigen will always have an disadvantage there.

Sameer Agarwal

unread,

Jun 2, 2017, 9:50:55 AM6/2/17

to ceres-...@googlegroups.com

Those get called if you have residual blocks that do not fit the Schur structure. In most problems the number of such blocks should be rather small. I wouldn't worry about them.

Sameer

On Fri, Jun 2, 2017 at 12:59 AM Björn Piltz <bjorn...@blikken.de> wrote:

Yes, the static version gets called when the sizes are known. I was thinking of the calls to MatrixVectorMultiply<Eigen::Dynamic, Eigen::Dynamic>() in BlockSparseMatrix::RightMultiply() and PartitionedMatrixView<>::RightMultiplyF(). I'm not sure how large an impact these calls have. I was just thinking that to truly compare Eigen vs custom BLAS it would make sense to specialize the dynamic case, since Eigen will always have an disadvantage there.

--

You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAF8Ue3n77XdViS8vchc%3DGdfDxzM-sWgK0Fvx_REu3_30K5aSTg%40mail.gmail.com.

Reply all

Reply to author

Forward