openmp doesn't work on ceres solver Android version

256 views
Skip to first unread message

Frank Young

unread,
Dec 26, 2017, 9:02:51 PM12/26/17
to Ceres Solver

Hi guys,


http://www.ceres-solver.org/installation.html tells that openmp should be turned off on Android. And we actually find the ChangeNumThreadsIfNeeded() to force num_linear_solver_threads as 1. But I encountered a performance issue when running SLAM on Android and hope to turn on openmp with multi-threading. After enabling openmp forcibly, I got different results compared to single-thread mechanism. Can I know the reason openmp doesn't work?


Also notice TBB added into Ceres Solver in latest commits. Can it supported on Android platform?


Cheers,
Frank

Keir Mierle

unread,
Dec 27, 2017, 1:07:29 AM12/27/17
to ceres-...@googlegroups.com
Hi Frank,

In the distant past, OpenMP on Android was not well supported. However, this hasn't been revisited in a long time. There are some people using OpenMP on Android who are doing their own Ceres build as part of a larger project, so I believe it is possible. We are happy to accept patches that show how to use it, or update the Android.mk to use it.

For Intel TBB, it should be possible to use this on Android but there is no one testing / supporting it. If you care about this, we are happy to help you with reviews or suggestions and would welcome the contribution. I don't believe there are any fundamental barriers here; it's just a matter of doing the work and enhancing the documentation.

Thanks and happy optimizing,
Keir


--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/7da2431c-d459-463f-bc8b-48c75f28bd18%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frank Young

unread,
Dec 27, 2017, 6:26:40 AM12/27/17
to Ceres Solver
Keir,

Thanks for quick feedback.
I would try TBB on latest version for ARM/Android. 

Frank

在 2017年12月27日星期三 UTC+8下午2:07:29,Keir Mierle写道:
Hi Frank,

In the distant past, OpenMP on Android was not well supported. However, this hasn't been revisited in a long time. There are some people using OpenMP on Android who are doing their own Ceres build as part of a larger project, so I believe it is possible. We are happy to accept patches that show how to use it, or update the Android.mk to use it.

For Intel TBB, it should be possible to use this on Android but there is no one testing / supporting it. If you care about this, we are happy to help you with reviews or suggestions and would welcome the contribution. I don't believe there are any fundamental barriers here; it's just a matter of doing the work and enhancing the documentation.

Thanks and happy optimizing,
Keir

On Tue, Dec 26, 2017 at 6:02 PM, Frank Young <yang...@gmail.com> wrote:

Hi guys,


http://www.ceres-solver.org/installation.html tells that openmp should be turned off on Android. And we actually find the ChangeNumThreadsIfNeeded() to force num_linear_solver_threads as 1. But I encountered a performance issue when running SLAM on Android and hope to turn on openmp with multi-threading. After enabling openmp forcibly, I got different results compared to single-thread mechanism. Can I know the reason openmp doesn't work?


Also notice TBB added into Ceres Solver in latest commits. Can it supported on Android platform?


Cheers,
Frank

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

Frank Young

unread,
Dec 29, 2017, 5:03:18 AM12/29/17
to Ceres Solver
Hi Keir,

I am glad to let you know both OpenMP and TBB now work correctly on my device(Android+ARM cpu 4 cores). 
But it seems that I didn't get any performance boost from the multi-threading even though I set num_linear_solver_threads as 4. (linear_solver_type = DENSE_SCHUR, trust_region_strategy_type = DOGLEG) 

Ax=b, The A matrix is about 300*300 

TBB: only one thread is working for solving linear system, the other 3 threads are idle. (???????)
----------------------------------------------------------------------------------------------------------------------------
User 41%, System 11%, IOW 0%, IRQ 0%
User 521 + Nice 3 + Sys 145 + Idle 576 + IOW 0 + IRQ 7 + SIRQ 3 = 1255

  PID   TID USER     PR  NI CPU% S     VSS     RSS PCY Thread          Proc
24787 24788 root     20   0  22% S 110748K  24172K  fg estimator_test  estimator_test
24787 24787 root     20   0  17% R 110748K  24172K  fg estimator_test  estimator_test
24787 24789 root     20   0   2% S 110748K  24172K  fg estimator_test  estimator_test
24786 24786 root     20   0   2% R   9268K   2548K  fg top             top
24787 24791 root     20   0   1% S 110748K  24172K  fg estimator_test  estimator_test
24787 24790 root     20   0   1% S 110748K  24172K  fg estimator_test  estimator_test


OPENMP: Even the 4 threads of linear solver are busy and cpu ratio is high, I did not any performance boost.(???????)
-----------------------------------------------------------------------------------------------------------------------------
User 79%, System 13%, IOW 0%, IRQ 0%
User 1206 + Nice 2 + Sys 209 + Idle 87 + IOW 0 + IRQ 10 + SIRQ 3 = 1517

  PID   TID USER     PR  NI CPU% S     VSS     RSS PCY Thread          Proc
24944 24948 root     20   0  20% R 103956K  25456K  fg estimator_test  estimator_test
24944 24945 root     20   0  19% S 103956K  25456K  fg estimator_test  estimator_test
24944 24946 root     20   0  19% R 103956K  25456K  fg estimator_test  estimator_test
24944 24947 root     20   0  18% R 103956K  25456K  fg estimator_test  estimator_test
24944 24944 root     20   0  11% R 103956K  25456K  fg estimator_test  estimator_test
24950 24950 root     20   0   7% R   9268K   2540K  fg top             top

In fact, the performance of the schur eliminator is crucial to the overall bundle adjustment in Ceres and multi-thread is mainly applied on it. 
Q1. Multi-thread doesn't bring any performance boost. Is is because A matrix is too small?
Q2. Do you think it feasible to implement shcur eliminator with GPU for speed up? Meanwhile I hope to  move the compution load form CPU to GPU. 

I am trying to read schur eliminator/complement code and it is really hard to understand. Any documents/papers to recommand? Thanks so much.

Frank

在 2017年12月27日星期三 UTC+8下午7:26:40,Frank Young写道:

Sameer Agarwal

unread,
Dec 29, 2017, 5:11:55 AM12/29/17
to ceres-...@googlegroups.com

Frank,
Can you share the output of summary::fullreport()

Sameer


Frank Young

unread,
Dec 29, 2017, 8:50:14 AM12/29/17
to Ceres Solver
Sameer,

I am on my holiday of new year. I would paste fullreport() once I go back office next Tuesday.

I read much about Ceres from your comments, i.e., http://thread.gmane.org/gmane.comp.lib.eigen/3901. You did g great job on Ceres. I am a freshman on non-linear optimization and would be appreciated if you can recommend some papers about shcur eliminator for further reading. 

Cheers,
Frank

在 2017年12月29日星期五 UTC+8下午6:11:55,Sameer Agarwal写道:

Sameer Agarwal

unread,
Dec 29, 2017, 11:33:28 AM12/29/17
to ceres-...@googlegroups.com
Frank,

Schur elimination as implemented in ceres is not documented anywhere. It is slightly different (but mathematically equivalent) to the classical schur complement elimination as described by Lourakis et al in their description of the SBA algorithm.

Schur elimination is not GPU friendly. It is highly irregular. If you tell us more about the structure and size of the problem you are solving, we maybe able to suggest performance improvements.
Sameer




Frank Young

unread,
Jan 1, 2018, 9:03:26 PM1/1/18
to Ceres Solver
Hello Sameer, FYI.

Ceres Solver Report: Iterations: 51, Initial cost: 7.581848e-03, Final cost: 1.155901e-03, Termination: NO_CONVERGENCE
summary.FullReport: 
Solver Summary (v 1.13.0-eigen-(3.3.4)-lapack-suitesparse-(((4) * 1000 + (0)))-openmp-no_tbb)

                                     Original                  Reduced
Parameter blocks         113                        110
Parameters                   350                       340
Effective parameters     339                       330
Residual blocks             877                       877
Residual                        1754                     1754

Minimizer                        TRUST_REGION

Dense linear algebra library            EIGEN
Trust region strategy                  DOGLEG (TRADITIONAL)

                                        Given                     Used
Linear solver                    DENSE_SCHUR   DENSE_SCHUR
Threads                           1                           1
Linear solver threads       4                           4
Linear solver ordering      AUTOMATIC        91,19
Schur structure                2,3,3                     d,d,d

Cost:
Initial                          7.581848e-03
Final                            1.155901e-03
Change                           6.425946e-03

Minimizer iterations                       51
Successful steps                           42
Unsuccessful steps                          9

Time (in seconds):
Preprocessor                         0.007121

  Residual evaluation                0.043112
  Jacobian evaluation                0.145522
  Linear solver                      0.347084
Minimizer                            0.596056

Postprocessor                        0.000102
Total                                0.603279

Termination:                   NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 50.)

Frank

在 2017年12月30日星期六 UTC+8上午12:33:28,Sameer Agarwal写道:

Sameer Agarwal

unread,
Jan 2, 2018, 1:50:00 AM1/2/18
to ceres-...@googlegroups.com
Frank,
How many cameras do you have?
The first suggestion I have is for you to specify the linear_solver_ordering manually instead of having ceres determine it automatically. 
the other is, I noticed that you are most likely not compiling with schur_template_specializations enabled. That will let you exploit the static schur structure and speed up the linear solver.
Sameer


Frank Young

unread,
Jan 2, 2018, 5:05:43 AM1/2/18
to Ceres Solver
Sameer,

Our window size is 10, which means 10 cameras.

schur_template_specializations enabling help us get a performance boost of about 25%. That is very great.  Thanks so much.
If schur_structure is fixed as (2, 3, 3), Eigen has the chance to allocate matrix on stack on compiling stage, instead of dynamic allocation in heap at runtime. This would save frequent allocation time and improve access efficiency. am I right?

I will try linear_solver_ordering next step. In fact I have no idea what optimal parameters should be assigned to linear_solver_ordering at this time.

Frank 

在 2018年1月2日星期二 UTC+8下午2:50:00,Sameer Agarwal写道:

Sameer Agarwal

unread,
Jan 2, 2018, 9:52:47 AM1/2/18
to ceres-...@googlegroups.com
I am glad that worked.

No this has nothing to do with dynamic or static allocation. This is about exposing to the compiler the static size of these matrices which allows it to do the linear algebra much more efficiently by using statically sized loops.

are your camera intrinsics and extrinsics separate parameter blocks?

or do you have shared intrinsics ?

all the point parameter blocks should go in the first elimination groups, and all the camera parameter blocks should go in the second elimination group.

Sameer

Frank Young

unread,
Jan 5, 2018, 2:08:28 AM1/5/18
to Ceres Solver
Sameer,

There are extrinsics and point parameter blocks in our case. I added point blocks into group 0 and extrinsics blocks into group 1. It seems not to help much. 

                                           Given                Used 
Linear solver ordering        AUTOMATIC     91,19

After modification,
Linear solver ordering        91,22                 91,19


Now ceres is running on CPU totally and a high GPU load is observed. Do you think any functions in Minimizer and linear solver can be moved to run on DSP or GPU? Our platform is snapdragon 82x. DSP and GPU are almost idle when running ceres this time.
 
Frank

在 2018年1月2日星期二 UTC+8下午10:52:47,Sameer Agarwal写道:

Sameer Agarwal

unread,
Jan 5, 2018, 5:51:42 AM1/5/18
to ceres-...@googlegroups.com
Frank,
can you share the current output of Summary::FullReport after the schur specializations were enabled?
Sameer


Frank Young

unread,
Jan 5, 2018, 8:27:05 AM1/5/18
to Ceres Solver
FYI.

summary.FullReport: 
Solver Summary (v 1.13.0-eigen-(3.3.4)-lapack-suitesparse-(((4) * 1000 + (0)))-no_openmp-no_tbb)

                                     Original                  Reduced
Parameter blocks                          113                      110
Parameters                                350                      340
Effective parameters                      339                      330
Residual blocks                           877                      877
Residual                                 1754                     1754

Minimizer                        TRUST_REGION

Dense linear algebra library            EIGEN
Trust region strategy                  DOGLEG (TRADITIONAL)

                                        Given                     Used
Linear solver                     DENSE_SCHUR              DENSE_SCHUR
Threads                                     1                        1
Linear solver threads                       1                        1
Linear solver ordering                  91,22                    91,19
Schur structure                         2,3,3                    2,3,3

Cost:
Initial                          7.581848e-03
Final                            1.155901e-03
Change                           6.425947e-03

Minimizer iterations                       51
Successful steps                           42
Unsuccessful steps                          9

Time (in seconds):
Preprocessor                         0.001711

  Residual evaluation                0.013869
  Jacobian evaluation                0.054759
  Linear solver                      0.096086
Minimizer                            0.189615

Postprocessor                        0.000056
Total                                0.191384

Termination:                   NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 50.)


在 2018年1月5日星期五 UTC+8下午6:51:42,Sameer Agarwal写道:

Sameer Agarwal

unread,
Jan 5, 2018, 9:22:23 AM1/5/18
to ceres-...@googlegroups.com
Frank,
As a first step, why not use openmp threading? it works quite well.
That said, your problem is fairly small, threading is only going to take you so far. 
also what is the performance you are trying to hit?
Sameer


Frank Young

unread,
Jan 6, 2018, 10:45:28 AM1/6/18
to Ceres Solver
Hi Sameer,

I tried openmp and found more time spent on the same case, even though I set num_linear_solver_threads=1.

Now the number of iterations is 50 and linear solver time is 0.096s, that means each iteration taks ~2 ms. The profiling tells the 2nd loop in ShcurEliminator::Eliminate() is the part which spent most time and it has chunks_.size() with ~100. So I guess multi-threading should be helpful here( Just think about different thread does the computation and write data back to different part in the same array or matrix simultaneously). In fact I am not sure whether the problem scale is worthy of threading or not.

In our case I try my best and hopt a time expense less than 150 ms. :-)

Frank
 



在 2018年1月5日星期五 UTC+8下午10:22:23,Sameer Agarwal写道:

Sameer Agarwal

unread,
Jan 8, 2018, 5:31:27 PM1/8/18
to ceres-...@googlegroups.com
You may get some benefit from threading, but it is not a clear win.  The writing to different parts of the schur complement matrix is also a problem, since it is very incoherent in the way computation is done. 

I am thinking about ways of improving the performance of the schur eliminator, but I do not have anything immediate to help right now.

Sameer


Keir Mierle

unread,
Jan 8, 2018, 7:46:54 PM1/8/18
to ceres-...@googlegroups.com
Hi Frank,

As an aside, it looks like you are trying to use Ceres for realtime. We didn't design Ceres for this case, but have found several people are using Ceres in this context anyway. Can you explain more about your particular use case? What is your final application?

Thanks,
Keir

To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CABqdRUDod3vSfoDRgZUuUv2y%2Br1ZmgAi5e9mb%3Df2BT2yd4VDyw%40mail.gmail.com.

Frank Young

unread,
Jan 30, 2018, 3:06:42 AM1/30/18
to Ceres Solver
Keir,

Sorry for late relay.
We use ceres in modified VINS-mono project, which run on our AR glass(Snapdragon 8XX series) for SLAM. 
In one of our cases,  the schur structure is (d, d, d) and I found the bottleneck is small matrices multiply in SchurEliminator::Eliminate().  It uses native calls in small_blas.h.
I am trying to do some optimization with unrolling and asm on AARCH64 platform now. Hope some performance boost.

Cheers,
Frank

在 2018年1月9日星期二 UTC+8上午8:46:54,Keir Mierle写道:

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

Sameer Agarwal

unread,
Jan 30, 2018, 8:38:11 AM1/30/18
to ceres-...@googlegroups.com
Frank,
Is it ddd because the structure detection found it be completely dynamic, or is Ceres missing a specialization?

You can also try disabling custom_blas in which case we will fall back to eigen and it may work better. 

Sameer


Frank Young

unread,
Feb 8, 2018, 5:02:42 AM2/8/18
to Ceres Solver
Sameer,

It is ddd because the structure detection found it be completely dynamic. 
For A*B = C, submatrix C looks like the following in one iteration,

1x1, 1x6,

2x1, 2x6,

6x6,

9x9, 9x6,

15x6, 15x9


I tried to use Eigen calls (such as MatrixMatrixMultiplyEigen(), etc) for these small matrix operations. Unfortunately the performance did not changer better. 
I did some optimizations for for/for/for matrix multiply in small_blas.h with unrolling / ASM, and got a performance improvement with about 15% on my arm64-v8a platform.

Frank

在 2018年1月30日星期二 UTC+8下午9:38:11,Sameer Agarwal写道:

Sameer Agarwal

unread,
Feb 8, 2018, 10:38:59 AM2/8/18
to ceres-...@googlegroups.com
Frank,

Thanks for the update. Would you be willing to contribute your implementations to Ceres? 
I am in the process of adding some benchmarks to Ceres for small_blas, they should help measure and improve performance.

Sameer

Frank Young

unread,
Feb 12, 2018, 7:23:50 PM2/12/18
to Ceres Solver
Sameer,

I wrote these code for our company's project and I need to get approval from my team before contributing the code to Ceres. Fortunately, after some talking with Terry (my line manager), I am so happy to let you know it is OK to contribute these code to Ceres. I would communicate with you by my company email for more details.

Frank

在 2018年2月8日星期四 UTC+8下午11:38:59,Sameer Agarwal写道:

Sameer Agarwal

unread,
Feb 12, 2018, 7:51:13 PM2/12/18
to ceres-...@googlegroups.com

vincent yu

unread,
Apr 27, 2018, 10:07:20 PM4/27/18
to Ceres Solver
Hi, Frank

 Excuse me , i want to know how to open the TBB, and use multi-threading in Ceres,  can you  tell me how to configure parameter in Andorid.mk and Application.mk, thanks  a lot

Cheers,
vincent

在 2017年12月27日星期三 UTC+8上午10:02:51,Frank Young写道:
Reply all
Reply to author
Forward
0 new messages