openmp doesn't work on ceres solver Android version

Frank Young

unread,

Dec 26, 2017, 9:02:51 PM12/26/17

to Ceres Solver

Hi guys,

http://www.ceres-solver.org/installation.html tells that openmp should be turned off on Android. And we actually find the ChangeNumThreadsIfNeeded() to force num_linear_solver_threads as 1. But I encountered a performance issue when running SLAM on Android and hope to turn on openmp with multi-threading. After enabling openmp forcibly, I got different results compared to single-thread mechanism. Can I know the reason openmp doesn't work?

Also notice TBB added into Ceres Solver in latest commits. Can it supported on Android platform?

Cheers,
Frank

Keir Mierle

unread,

Dec 27, 2017, 1:07:29 AM12/27/17

to ceres-...@googlegroups.com

Hi Frank,

In the distant past, OpenMP on Android was not well supported. However, this hasn't been revisited in a long time. There are some people using OpenMP on Android who are doing their own Ceres build as part of a larger project, so I believe it is possible. We are happy to accept patches that show how to use it, or update the Android.mk to use it.

For Intel TBB, it should be possible to use this on Android but there is no one testing / supporting it. If you care about this, we are happy to help you with reviews or suggestions and would welcome the contribution. I don't believe there are any fundamental barriers here; it's just a matter of doing the work and enhancing the documentation.

Thanks and happy optimizing,

Keir

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/7da2431c-d459-463f-bc8b-48c75f28bd18%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frank Young

unread,

Dec 27, 2017, 6:26:40 AM12/27/17

to Ceres Solver

Keir,

Thanks for quick feedback.

I would try TBB on latest version for ARM/Android.

Frank

在 2017年12月27日星期三 UTC+8下午2:07:29，Keir Mierle写道：

Hi Frank,

In the distant past, OpenMP on Android was not well supported. However, this hasn't been revisited in a long time. There are some people using OpenMP on Android who are doing their own Ceres build as part of a larger project, so I believe it is possible. We are happy to accept patches that show how to use it, or update the Android.mk to use it.

For Intel TBB, it should be possible to use this on Android but there is no one testing / supporting it. If you care about this, we are happy to help you with reviews or suggestions and would welcome the contribution. I don't believe there are any fundamental barriers here; it's just a matter of doing the work and enhancing the documentation.

Thanks and happy optimizing,
Keir

On Tue, Dec 26, 2017 at 6:02 PM, Frank Young <yang...@gmail.com> wrote:

Hi guys,

http://www.ceres-solver.org/installation.html tells that openmp should be turned off on Android. And we actually find the ChangeNumThreadsIfNeeded() to force num_linear_solver_threads as 1. But I encountered a performance issue when running SLAM on Android and hope to turn on openmp with multi-threading. After enabling openmp forcibly, I got different results compared to single-thread mechanism. Can I know the reason openmp doesn't work?

Also notice TBB added into Ceres Solver in latest commits. Can it supported on Android platform?

Cheers,
Frank

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

Frank Young

unread,

Dec 29, 2017, 5:03:18 AM12/29/17

to Ceres Solver

Hi Keir,

I am glad to let you know both OpenMP and TBB now work correctly on my device(Android+ARM cpu 4 cores).

But it seems that I didn't get any performance boost from the multi-threading even though I set num_linear_solver_threads as 4. (linear_solver_type = DENSE_SCHUR, trust_region_strategy_type = DOGLEG)

Ax=b, The A matrix is about 300*300

TBB: only one thread is working for solving linear system, the other 3 threads are idle. (???????)

----------------------------------------------------------------------------------------------------------------------------

User 41%, System 11%, IOW 0%, IRQ 0%

User 521 + Nice 3 + Sys 145 + Idle 576 + IOW 0 + IRQ 7 + SIRQ 3 = 1255

PID TID USER PR NI CPU% S VSS RSS PCY Thread Proc

24787 24788 root 20 0 22% S 110748K 24172K fg estimator_test estimator_test

24787 24787 root 20 0 17% R 110748K 24172K fg estimator_test estimator_test

24787 24789 root 20 0 2% S 110748K 24172K fg estimator_test estimator_test

24786 24786 root 20 0 2% R 9268K 2548K fg top top

24787 24791 root 20 0 1% S 110748K 24172K fg estimator_test estimator_test

24787 24790 root 20 0 1% S 110748K 24172K fg estimator_test estimator_test

OPENMP: Even the 4 threads of linear solver are busy and cpu ratio is high, I did not any performance boost.(???????)

-----------------------------------------------------------------------------------------------------------------------------

User 79%, System 13%, IOW 0%, IRQ 0%

User 1206 + Nice 2 + Sys 209 + Idle 87 + IOW 0 + IRQ 10 + SIRQ 3 = 1517

PID TID USER PR NI CPU% S VSS RSS PCY Thread Proc

24944 24948 root 20 0 20% R 103956K 25456K fg estimator_test estimator_test

24944 24945 root 20 0 19% S 103956K 25456K fg estimator_test estimator_test

24944 24946 root 20 0 19% R 103956K 25456K fg estimator_test estimator_test

24944 24947 root 20 0 18% R 103956K 25456K fg estimator_test estimator_test

24944 24944 root 20 0 11% R 103956K 25456K fg estimator_test estimator_test

24950 24950 root 20 0 7% R 9268K 2540K fg top top

In fact, the performance of the schur eliminator is crucial to the overall bundle adjustment in Ceres and multi-thread is mainly applied on it.

Q1. Multi-thread doesn't bring any performance boost. Is is because A matrix is too small?

Q2. Do you think it feasible to implement shcur eliminator with GPU for speed up? Meanwhile I hope to move the compution load form CPU to GPU.

I am trying to read schur eliminator/complement code and it is really hard to understand. Any documents/papers to recommand? Thanks so much.

Frank

在 2017年12月27日星期三 UTC+8下午7:26:40，Frank Young写道：

Sameer Agarwal

unread,

Dec 29, 2017, 5:11:55 AM12/29/17

to ceres-...@googlegroups.com

Frank,
Can you share the output of summary::fullreport()

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/4f333a72-bbae-4f78-9140-ca05165ccf93%40googlegroups.com.

Frank Young

unread,

Dec 29, 2017, 8:50:14 AM12/29/17

to Ceres Solver

Sameer,

I am on my holiday of new year. I would paste fullreport() once I go back office next Tuesday.

I read much about Ceres from your comments, i.e., http://thread.gmane.org/gmane.comp.lib.eigen/3901. You did g great job on Ceres. I am a freshman on non-linear optimization and would be appreciated if you can recommend some papers about shcur eliminator for further reading.

Cheers,

Frank

在 2017年12月29日星期五 UTC+8下午6:11:55，Sameer Agarwal写道：

Sameer Agarwal

unread,

Dec 29, 2017, 11:33:28 AM12/29/17

to ceres-...@googlegroups.com

Frank,

Schur elimination as implemented in ceres is not documented anywhere. It is slightly different (but mathematically equivalent) to the classical schur complement elimination as described by Lourakis et al in their description of the SBA algorithm.

Schur elimination is not GPU friendly. It is highly irregular. If you tell us more about the structure and size of the problem you are solving, we maybe able to suggest performance improvements.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/a3ef1ca2-0650-45c3-bccd-73f51798feac%40googlegroups.com.

Frank Young

unread,

Jan 1, 2018, 9:03:26 PM1/1/18

to Ceres Solver

Hello Sameer, FYI.

Ceres Solver Report: Iterations: 51, Initial cost: 7.581848e-03, Final cost: 1.155901e-03, Termination: NO_CONVERGENCE

summary.FullReport:

Solver Summary (v 1.13.0-eigen-(3.3.4)-lapack-suitesparse-(((4) * 1000 + (0)))-openmp-no_tbb)

Original Reduced

Parameter blocks 113 110

Parameters 350 340

Effective parameters 339 330

Residual blocks 877 877

Residual 1754 1754

Minimizer TRUST_REGION

Dense linear algebra library EIGEN

Trust region strategy DOGLEG (TRADITIONAL)

Given Used

Linear solver DENSE_SCHUR DENSE_SCHUR

Threads 1 1

Linear solver threads 4 4

Linear solver ordering AUTOMATIC 91,19

Schur structure 2,3,3 d,d,d

Cost:

Initial 7.581848e-03

Final 1.155901e-03

Change 6.425946e-03

Minimizer iterations 51

Successful steps 42

Unsuccessful steps 9

Time (in seconds):

Preprocessor 0.007121

Residual evaluation 0.043112

Jacobian evaluation 0.145522

Linear solver 0.347084

Minimizer 0.596056

Postprocessor 0.000102

Total 0.603279

Termination: NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 50.)

Frank

在 2017年12月30日星期六 UTC+8上午12:33:28，Sameer Agarwal写道：

Sameer Agarwal

unread,

Jan 2, 2018, 1:50:00 AM1/2/18

to ceres-...@googlegroups.com

Frank,

How many cameras do you have?

The first suggestion I have is for you to specify the linear_solver_ordering manually instead of having ceres determine it automatically.

the other is, I noticed that you are most likely not compiling with schur_template_specializations enabled. That will let you exploit the static schur structure and speed up the linear solver.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/35dca63d-a6eb-4ffb-85ba-ff2c6f05b297%40googlegroups.com.

Frank Young

unread,

Jan 2, 2018, 5:05:43 AM1/2/18

to Ceres Solver

Sameer,

Our window size is 10, which means 10 cameras.

schur_template_specializations enabling help us get a performance boost of about 25%. That is very great. Thanks so much.

If schur_structure is fixed as (2, 3, 3), Eigen has the chance to allocate matrix on stack on compiling stage, instead of dynamic allocation in heap at runtime. This would save frequent allocation time and improve access efficiency. am I right?

I will try linear_solver_ordering next step. In fact I have no idea what optimal parameters should be assigned to linear_solver_ordering at this time.

Frank

在 2018年1月2日星期二 UTC+8下午2:50:00，Sameer Agarwal写道：

Sameer Agarwal

unread,

Jan 2, 2018, 9:52:47 AM1/2/18

to ceres-...@googlegroups.com

I am glad that worked.

No this has nothing to do with dynamic or static allocation. This is about exposing to the compiler the static size of these matrices which allows it to do the linear algebra much more efficiently by using statically sized loops.

are your camera intrinsics and extrinsics separate parameter blocks?

or do you have shared intrinsics ?

all the point parameter blocks should go in the first elimination groups, and all the camera parameter blocks should go in the second elimination group.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/3e92b1f2-8aab-4e39-a5af-6042c51a4e0e%40googlegroups.com.

Frank Young

unread,

Jan 5, 2018, 2:08:28 AM1/5/18

to Ceres Solver

Sameer,

There are extrinsics and point parameter blocks in our case. I added point blocks into group 0 and extrinsics blocks into group 1. It seems not to help much.

Given Used

Linear solver ordering AUTOMATIC 91,19

After modification,

Linear solver ordering 91,22 91,19

Now ceres is running on CPU totally and a high GPU load is observed. Do you think any functions in Minimizer and linear solver can be moved to run on DSP or GPU? Our platform is snapdragon 82x. DSP and GPU are almost idle when running ceres this time.

Frank

在 2018年1月2日星期二 UTC+8下午10:52:47，Sameer Agarwal写道：

Sameer Agarwal

unread,

Jan 5, 2018, 5:51:42 AM1/5/18

to ceres-...@googlegroups.com

Frank,

can you share the current output of Summary::FullReport after the schur specializations were enabled?

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/29025a4e-0a2a-4098-9d1e-784dc192c1e3%40googlegroups.com.

Frank Young

unread,

Jan 5, 2018, 8:27:05 AM1/5/18

to Ceres Solver

FYI.

summary.FullReport:

Solver Summary (v 1.13.0-eigen-(3.3.4)-lapack-suitesparse-(((4) * 1000 + (0)))-no_openmp-no_tbb)

Original Reduced

Parameter blocks 113 110

Parameters 350 340

Effective parameters 339 330

Residual blocks 877 877

Residual 1754 1754

Minimizer TRUST_REGION

Dense linear algebra library EIGEN

Trust region strategy DOGLEG (TRADITIONAL)

Given Used

Linear solver DENSE_SCHUR DENSE_SCHUR

Threads 1 1

Linear solver threads 1 1

Linear solver ordering 91,22 91,19

Schur structure 2,3,3 2,3,3

Cost:

Initial 7.581848e-03

Final 1.155901e-03

Change 6.425947e-03

Minimizer iterations 51

Successful steps 42

Unsuccessful steps 9

Time (in seconds):

Preprocessor 0.001711

Residual evaluation 0.013869

Jacobian evaluation 0.054759

Linear solver 0.096086

Minimizer 0.189615

Postprocessor 0.000056

Total 0.191384

Termination: NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 50.)

在 2018年1月5日星期五 UTC+8下午6:51:42，Sameer Agarwal写道：

Sameer Agarwal

unread,

Jan 5, 2018, 9:22:23 AM1/5/18

to ceres-...@googlegroups.com

Frank,

As a first step, why not use openmp threading? it works quite well.

That said, your problem is fairly small, threading is only going to take you so far.

also what is the performance you are trying to hit?

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/19ad9465-d858-4312-89a2-c78aaf987f4c%40googlegroups.com.

Frank Young

unread,

Jan 6, 2018, 10:45:28 AM1/6/18

to Ceres Solver

Hi Sameer,

I tried openmp and found more time spent on the same case, even though I set num_linear_solver_threads=1.

Now the number of iterations is 50 and linear solver time is 0.096s, that means each iteration taks ~2 ms. The profiling tells the 2nd loop in ShcurEliminator::Eliminate() is the part which spent most time and it has chunks_.size() with ~100. So I guess multi-threading should be helpful here( Just think about different thread does the computation and write data back to different part in the same array or matrix simultaneously). In fact I am not sure whether the problem scale is worthy of threading or not.

In our case I try my best and hopt a time expense less than 150 ms. :-)

Frank

在 2018年1月5日星期五 UTC+8下午10:22:23，Sameer Agarwal写道：

Sameer Agarwal

unread,

Jan 8, 2018, 5:31:27 PM1/8/18

to ceres-...@googlegroups.com

You may get some benefit from threading, but it is not a clear win. The writing to different parts of the schur complement matrix is also a problem, since it is very incoherent in the way computation is done.

I am thinking about ways of improving the performance of the schur eliminator, but I do not have anything immediate to help right now.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/53e623c9-6c75-4329-bf81-01277b795466%40googlegroups.com.

Keir Mierle

unread,

Jan 8, 2018, 7:46:54 PM1/8/18

to ceres-...@googlegroups.com

Hi Frank,

As an aside, it looks like you are trying to use Ceres for realtime. We didn't design Ceres for this case, but have found several people are using Ceres in this context anyway. Can you explain more about your particular use case? What is your final application?

Thanks,

Keir

To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/53e623c9-6c75-4329-bf81-01277b795466%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Ceres Solver" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUDod3vSfoDRgZUuUv2y%2Br1ZmgAi5e9mb%3Df2BT2yd4VDyw%40mail.gmail.com.

Frank Young

unread,

Jan 30, 2018, 3:06:42 AM1/30/18

to Ceres Solver

Keir,

Sorry for late relay.

We use ceres in modified VINS-mono project, which run on our AR glass(Snapdragon 8XX series) for SLAM.

In one of our cases, the schur structure is (d, d, d) and I found the bottleneck is small matrices multiply in SchurEliminator::Eliminate(). It uses native calls in small_blas.h.

I am trying to do some optimization with unrolling and asm on AARCH64 platform now. Hope some performance boost.

Cheers,

Frank

在 2018年1月9日星期二 UTC+8上午8:46:54，Keir Mierle写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/53e623c9-6c75-4329-bf81-01277b795466%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

Sameer Agarwal

unread,

Jan 30, 2018, 8:38:11 AM1/30/18

to ceres-...@googlegroups.com

Frank,

Is it ddd because the structure detection found it be completely dynamic, or is Ceres missing a specialization?

You can also try disabling custom_blas in which case we will fall back to eigen and it may work better.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/779b3a39-d479-455b-b8e4-754bdf76d0d5%40googlegroups.com.

Frank Young

unread,

Feb 8, 2018, 5:02:42 AM2/8/18

to Ceres Solver

Sameer,

It is ddd because the structure detection found it be completely dynamic.

For A*B = C, submatrix C looks like the following in one iteration,

1x1, 1x6,

2x1, 2x6,

6x6,

9x9, 9x6,

15x6, 15x9

I tried to use Eigen calls (such as MatrixMatrixMultiplyEigen(), etc) for these small matrix operations. Unfortunately the performance did not changer better.

I did some optimizations for for/for/for matrix multiply in small_blas.h with unrolling / ASM, and got a performance improvement with about 15% on my arm64-v8a platform.

Frank

在 2018年1月30日星期二 UTC+8下午9:38:11，Sameer Agarwal写道：

Sameer Agarwal

unread,

Feb 8, 2018, 10:38:59 AM2/8/18

to ceres-...@googlegroups.com

Frank,

Thanks for the update. Would you be willing to contribute your implementations to Ceres?

I am in the process of adding some benchmarks to Ceres for small_blas, they should help measure and improve performance.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/863089f1-446e-43cd-8932-cf3ff028488e%40googlegroups.com.

Frank Young

unread,

Feb 12, 2018, 7:23:50 PM2/12/18

to Ceres Solver

Sameer,

I wrote these code for our company's project and I need to get approval from my team before contributing the code to Ceres. Fortunately, after some talking with Terry (my line manager), I am so happy to let you know it is OK to contribute these code to Ceres. I would communicate with you by my company email for more details.

Frank

在 2018年2月8日星期四 UTC+8下午11:38:59，Sameer Agarwal写道：

Sameer Agarwal

unread,

Feb 12, 2018, 7:51:13 PM2/12/18

to ceres-...@googlegroups.com

Fantastic!

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/a9a6af52-6624-4647-a7a9-0b9c544626f7%40googlegroups.com.

vincent yu

unread,

Apr 27, 2018, 10:07:20 PM4/27/18

to Ceres Solver

Hi, Frank

Excuse me , i want to know how to open the TBB, and use multi-threading in Ceres, can you tell me how to configure parameter in Andorid.mk and Application.mk, thanks a lot