Mixed Precision Support for SuiteSparse

96 views
Skip to first unread message

Sameer Agarwal

unread,
Aug 26, 2024, 12:54:39 PM8/26/24
to ceres-...@googlegroups.com
Hi Folks,

Starting around 7.4.0 Suitesparse added single precision sparse cholesky factorization support to CHOLMOD.  I have just submitted a change to Ceres which uses it. This means single precision and mixed precision solves are now available with SUITE_SPARSE as the backend. This reduces peak memory usage as well as linear solver time.

Here are some sample performance numbers from my laptop

bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt

Final Cost                            4.161838e+06
Total time                               57.448539
        616329634071  instructions retired
        929475980510  cycles elapsed
          5375034560  peak memory footprint

/usr/bin/time -l ./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt  -mixed_precision_solves
Final Cost                            4.148930e+06
Total time                               27.031170
        395186936091  instructions retired
        368802808856  cycles elapsed
          4327029824  peak memory footprint

Here I am not even doing any refinement steps.  

Please try it out and let us know of your experiences. 
Sameer

enrique.fern...@gmail.com

unread,
Sep 13, 2024, 8:19:17 AM9/13/24
to Ceres Solver
Hi,

That's a great addition and improvement. However, I've tried to reproduce it and the result is almost the same with and without mixed precision solves:

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt
iter      cost      cost_change  |gradient|   |step|    tr_ratio  tr_radius  ls_iter  iter_time  total_time
   0  9.099334e+07    0.00e+00    5.16e+12   0.00e+00   0.00e+00  1.00e+04        0    2.70e-01    5.04e+00
   1  4.649068e+07    4.45e+07    5.32e+14   0.00e+00   4.99e-01  1.00e+04        1    1.63e+01    2.13e+01
   2  8.014327e+06    3.85e+07    7.13e+13   1.70e+05   8.59e-01  1.58e+04        1    1.00e+01    3.13e+01
   3  4.455175e+08   -4.38e+08    7.13e+13   1.83e+05  -6.87e+01  7.92e+03        1    9.92e+00    4.12e+01
   4  4.161838e+06    3.85e+06    1.04e+13   1.04e+05   6.06e-01  8.00e+03        1    9.61e+00    5.08e+01
   5  5.239706e+12   -5.24e+12    1.04e+13   9.39e+04  -2.09e+06  4.00e+03        1    9.63e+00    6.04e+01

Solver Summary (v 2.3.0-eigen-(3.3.7)-lapack-suitesparse-(7.8.2)-metis-(5.1.0)-eigensparse-cuda-(12060)-cudss-(0.3.0))

                                     Original                  Reduced
Parameter blocks                       313922                   313922
Parameters                             960174                   960174
Residual blocks                       1653812                  1653812
Residuals                             3307624                  3307624

Minimizer                        TRUST_REGION
Trust region strategy     LEVENBERG_MARQUARDT
Sparse linear algebra library    SUITE_SPARSE + AMD

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                    16                       16
Linear solver ordering            310854,3068              310854,3068
Schur structure                         2,3,9                    2,3,9

Cost:
Initial                          9.099334e+07
Final                            4.161838e+06
Change                           8.683150e+07

Minimizer iterations                        6
Successful steps                            4
Unsuccessful steps                          2

Time (in seconds):
Preprocessor                         4.769383

  Residual only evaluation           0.218614 (5)
  Jacobian & residual evaluation     0.675102 (4)
  Linear solver                     54.346604 (5)
Minimizer                           55.711315

Postprocessor                        0.089452
Total                               60.570151

Termination:                   NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 5.)

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt -mixed_precision_solves
iter      cost      cost_change  |gradient|   |step|    tr_ratio  tr_radius  ls_iter  iter_time  total_time
   0  9.099334e+07    0.00e+00    5.16e+12   0.00e+00   0.00e+00  1.00e+04        0    2.66e-01    4.96e+00
   1  4.648716e+07    4.45e+07    5.32e+14   0.00e+00   4.99e-01  1.00e+04        1    1.58e+01    2.07e+01
   2  8.013994e+06    3.85e+07    7.13e+13   1.70e+05   8.59e-01  1.58e+04        1    9.70e+00    3.04e+01
   3  4.723025e+08   -4.64e+08    7.13e+13   1.83e+05  -7.29e+01  7.92e+03        1    9.97e+00    4.04e+01
   4  4.157745e+06    3.86e+06    1.03e+13   1.04e+05   6.06e-01  8.00e+03        1    9.95e+00    5.03e+01
   5  2.458755e+12   -2.46e+12    1.03e+13   9.39e+04  -9.81e+05  4.00e+03        1    9.72e+00    6.01e+01

Solver Summary (v 2.3.0-eigen-(3.3.7)-lapack-suitesparse-(7.8.2)-metis-(5.1.0)-eigensparse-cuda-(12060)-cudss-(0.3.0))

                                     Original                  Reduced
Parameter blocks                       313922                   313922
Parameters                             960174                   960174
Residual blocks                       1653812                  1653812
Residuals                             3307624                  3307624

Minimizer                        TRUST_REGION
Trust region strategy     LEVENBERG_MARQUARDT
Sparse linear algebra library    SUITE_SPARSE + AMD (Mixed Precision)

                                        Given                     Used
Linear solver                    SPARSE_SCHUR             SPARSE_SCHUR
Threads                                    16                       16
Linear solver ordering            310854,3068              310854,3068
Schur structure                         2,3,9                    2,3,9

Cost:
Initial                          9.099334e+07
Final                            4.157745e+06
Change                           8.683560e+07

Minimizer iterations                        6
Successful steps                            4
Unsuccessful steps                          2

Time (in seconds):
Preprocessor                         4.689345

  Residual only evaluation           0.213891 (5)
  Jacobian & residual evaluation     0.667013 (4)
  Linear solver                     54.054418 (5)
Minimizer                           55.396603

Postprocessor                        0.088574
Total                               60.174523

Termination:                   NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 5.)

Sameer Agarwal

unread,
Sep 13, 2024, 9:11:22 AM9/13/24
to ceres-...@googlegroups.com

Let me take a look.


--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/fb6c30e5-16fd-4ecb-a966-499b50bb8d22n%40googlegroups.com.

enrique.fern...@gmail.com

unread,
Sep 13, 2024, 10:48:26 AM9/13/24
to Ceres Solver
I don't know if it really matters, but these are my CPU details according to lshw output:

     *-memory
          description: System memory
          physical id: 0
          size: 66GiB
     *-cpu
          product: AMD Ryzen 7 3700X 8-Core Processor
          vendor: Advanced Micro Devices [AMD]
          physical id: 1
          bus info: cpu@0
          size: 2368MHz
          capacity: 3600MHz
          width: 64 bits
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq

I should also mention I built SuiteSparse from source. I hope I didn't miss any flag that was required to exploit single precision.

Sameer Agarwal

unread,
Sep 13, 2024, 10:56:35 AM9/13/24
to ceres-...@googlegroups.com
Enrique,
What BLAS and LAPACK are you using?

On my machine, M1Pro macbook, where I build bundle adjuster using SuiteSparse 7.8.1 via homebrew and Apple's LAPACK & BLAS.
Here is what I observe.

Observe the "ReducedSolve" time, (everything else should be the same between the two types of solves).

[master] cmake-build: ./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt  --stderrthreshold=0 --v=3 -mixed_precision_solves=false -max_num_refinement_iterations=0
<SNIP>

iter      cost      cost_change  |gradient|   |step|    tr_ratio  tr_radius  ls_iter  iter_time  total_time
   0  9.099334e+07    0.00e+00    5.16e+12   0.00e+00   0.00e+00  1.00e+04        0    1.95e-01    2.80e+00
<SNIP>

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    2.49219      2.49219
                       Eliminate :    1.51111      4.00330
                    ReducedSolve :   10.22955     14.23285
                  BackSubstitute :    0.09485     14.32770
                           Total :    0.00001     14.32771


   1  4.649068e+07    4.45e+07    5.32e+14   0.00e+00   4.99e-01  1.00e+04        1    1.46e+01    1.74e+01
I0913 07:51:09.415558 58465629 event_logger.cc:60]

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    0.00008      0.00008
                       Eliminate :    1.49946      1.49953
                    ReducedSolve :    8.87346     10.37300
                  BackSubstitute :    0.01754     10.39054
                           Total :    0.00000     10.39054


   2  8.014327e+06    3.85e+07    7.13e+13   1.70e+05   8.59e-01  1.58e+04        1    1.06e+01    2.80e+01
I0913 07:51:19.593790 58465629 event_logger.cc:60]

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    0.00007      0.00007
                       Eliminate :    1.37145      1.37152
                    ReducedSolve :    8.62894     10.00046
                  BackSubstitute :    0.00487     10.00533
                           Total :    0.00000     10.00534


./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt  --stderrthreshold=0 --v=3 -mixed_precision_solves=true -max_num_refinement_iterations=0
<SNIP>

iter      cost      cost_change  |gradient|   |step|    tr_ratio  tr_radius  ls_iter  iter_time  total_time
   0  9.099334e+07    0.00e+00    5.16e+12   0.00e+00   0.00e+00  1.00e+04        0    1.78e-01    2.68e+00
<SNIP>
SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    2.46577      2.46577
                       Eliminate :    1.44550      3.91127
                    ReducedSolve :    4.07641      7.98769
                  BackSubstitute :    0.09636      8.08404
                           Total :    0.00002      8.08406


   1  4.642861e+07    4.46e+07    5.31e+14   0.00e+00   4.99e-01  1.00e+04        1    8.32e+00    1.10e+01
I0913 07:52:48.368463 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    0.00008      0.00008
                       Eliminate :    1.51176      1.51184
                    ReducedSolve :    2.76627      4.27811
                  BackSubstitute :    0.00496      4.28307
                           Total :    0.00000      4.28308


   2  8.005778e+06    3.84e+07    7.12e+13   1.70e+05   8.59e-01  1.58e+04        1    4.44e+00    1.54e+01
I0913 07:52:52.495034 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    0.00007      0.00007
                       Eliminate :    1.42480      1.42487
                    ReducedSolve :    2.54446      3.96932
                  BackSubstitute :    0.00502      3.97435
                           Total :    0.00000      3.97435


   3  4.343577e+08   -4.26e+08    7.12e+13   1.83e+05  -6.70e+01  7.92e+03        1    4.03e+00    1.95e+01
I0913 07:52:56.489554 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
                                        Delta   Cumulative
                           Setup :    0.00007      0.00007
                       Eliminate :    1.39895      1.39902
                    ReducedSolve :    2.55508      3.95410
                  BackSubstitute :    0.00453      3.95864
                           Total :    0.00000      3.95864

So, with regular solves which use double precision factorization, we spend ~8.7 seconds per reduced solve and with mixed precision solves, we bring that down ~2.6 seconds.
Sameer


enrique.fern...@gmail.com

unread,
Sep 13, 2024, 11:32:01 AM9/13/24
to Ceres Solver
I have

$ dpkg -l | grep liblapack-dev
ii  liblapack-dev:amd64                                 3.9.0-1build1                                       amd64        Library of linear algebra routines 3 - static version

$ dpkg -l | grep libblas-dev
ii  libblas-dev:amd64                                   3.9.0-1build1                                       amd64        Basic Linear Algebra Subroutines 3, static library

$ dpkg -l | grep libcublas-dev
ii  libcublas-dev-12-6                                  12.6.1.4-1                                          amd64        CUBLAS native dev links, headers

I tried to use the flags you used but they're not available for me:

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt --stderrthreshold=0 --v=3 -mixed_precision_solves=false -max_num_refinement_iterations=0
ERROR: Unknown command line flag 'stderrthreshold'
ERROR: Unknown command line flag 'v'

Sameer Agarwal

unread,
Sep 13, 2024, 11:39:02 AM9/13/24
to ceres-...@googlegroups.com
I have a suspicion that you are not using an optimized BLAS & LAPACK. Please see

https://wiki.debian.org/DebianScience/LinearAlgebraLibraries

I recommend switching to using one of the more optimized alternatives.

Something is not right either about the version of ceres you are using. I fixed this issue with logging flags 


back in mid august.

Sameer




enrique.fern...@gmail.com

unread,
Sep 13, 2024, 11:46:22 AM9/13/24
to Ceres Solver
Thanks, I'd look into that. It'd take me some time to report my findings.
Reply all
Reply to author
Forward
0 new messages