Mixed Precision Support for SuiteSparse

Sameer Agarwal

unread,

Aug 26, 2024, 12:54:39 PM8/26/24

to ceres-...@googlegroups.com

Hi Folks,

Starting around 7.4.0 Suitesparse added single precision sparse cholesky factorization support to CHOLMOD. I have just submitted a change to Ceres which uses it. This means single precision and mixed precision solves are now available with SUITE_SPARSE as the backend. This reduces peak memory usage as well as linear solver time.

Here are some sample performance numbers from my laptop

bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt

Final Cost                            4.161838e+06
Total time                               57.448539
        616329634071  instructions retired
        929475980510  cycles elapsed
          5375034560  peak memory footprint

/usr/bin/time -l ./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt  -mixed_precision_solves
Final Cost                            4.148930e+06
Total time                               27.031170
        395186936091  instructions retired
        368802808856  cycles elapsed
          4327029824  peak memory footprint

Here I am not even doing any refinement steps.

Please try it out and let us know of your experiences.

Sameer

enrique.fern...@gmail.com

unread,

Sep 13, 2024, 8:19:17 AM9/13/24

to Ceres Solver

Hi,

That's a great addition and improvement. However, I've tried to reproduce it and the result is almost the same with and without mixed precision solves:

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 9.099334e+07 0.00e+00 5.16e+12 0.00e+00 0.00e+00 1.00e+04 0 2.70e-01 5.04e+00
1 4.649068e+07 4.45e+07 5.32e+14 0.00e+00 4.99e-01 1.00e+04 1 1.63e+01 2.13e+01
2 8.014327e+06 3.85e+07 7.13e+13 1.70e+05 8.59e-01 1.58e+04 1 1.00e+01 3.13e+01
3 4.455175e+08 -4.38e+08 7.13e+13 1.83e+05 -6.87e+01 7.92e+03 1 9.92e+00 4.12e+01
4 4.161838e+06 3.85e+06 1.04e+13 1.04e+05 6.06e-01 8.00e+03 1 9.61e+00 5.08e+01
5 5.239706e+12 -5.24e+12 1.04e+13 9.39e+04 -2.09e+06 4.00e+03 1 9.63e+00 6.04e+01

Solver Summary (v 2.3.0-eigen-(3.3.7)-lapack-suitesparse-(7.8.2)-metis-(5.1.0)-eigensparse-cuda-(12060)-cudss-(0.3.0))

Original Reduced
Parameter blocks 313922 313922
Parameters 960174 960174
Residual blocks 1653812 1653812
Residuals 3307624 3307624

Minimizer TRUST_REGION
Trust region strategy LEVENBERG_MARQUARDT
Sparse linear algebra library SUITE_SPARSE + AMD

Given Used
Linear solver SPARSE_SCHUR SPARSE_SCHUR
Threads 16 16
Linear solver ordering 310854,3068 310854,3068
Schur structure 2,3,9 2,3,9

Cost:
Initial 9.099334e+07
Final 4.161838e+06
Change 8.683150e+07

Minimizer iterations 6
Successful steps 4
Unsuccessful steps 2

Time (in seconds):
Preprocessor 4.769383

Residual only evaluation 0.218614 (5)
Jacobian & residual evaluation 0.675102 (4)
Linear solver 54.346604 (5)
Minimizer 55.711315

Postprocessor 0.089452
Total 60.570151

Termination: NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 5.)

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt -mixed_precision_solves
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 9.099334e+07 0.00e+00 5.16e+12 0.00e+00 0.00e+00 1.00e+04 0 2.66e-01 4.96e+00
1 4.648716e+07 4.45e+07 5.32e+14 0.00e+00 4.99e-01 1.00e+04 1 1.58e+01 2.07e+01
2 8.013994e+06 3.85e+07 7.13e+13 1.70e+05 8.59e-01 1.58e+04 1 9.70e+00 3.04e+01
3 4.723025e+08 -4.64e+08 7.13e+13 1.83e+05 -7.29e+01 7.92e+03 1 9.97e+00 4.04e+01
4 4.157745e+06 3.86e+06 1.03e+13 1.04e+05 6.06e-01 8.00e+03 1 9.95e+00 5.03e+01
5 2.458755e+12 -2.46e+12 1.03e+13 9.39e+04 -9.81e+05 4.00e+03 1 9.72e+00 6.01e+01

Solver Summary (v 2.3.0-eigen-(3.3.7)-lapack-suitesparse-(7.8.2)-metis-(5.1.0)-eigensparse-cuda-(12060)-cudss-(0.3.0))

Original Reduced
Parameter blocks 313922 313922
Parameters 960174 960174
Residual blocks 1653812 1653812
Residuals 3307624 3307624

Minimizer TRUST_REGION
Trust region strategy LEVENBERG_MARQUARDT
Sparse linear algebra library SUITE_SPARSE + AMD (Mixed Precision)

Given Used
Linear solver SPARSE_SCHUR SPARSE_SCHUR
Threads 16 16
Linear solver ordering 310854,3068 310854,3068
Schur structure 2,3,9 2,3,9

Cost:
Initial 9.099334e+07
Final 4.157745e+06
Change 8.683560e+07

Minimizer iterations 6
Successful steps 4
Unsuccessful steps 2

Time (in seconds):
Preprocessor 4.689345

Residual only evaluation 0.213891 (5)
Jacobian & residual evaluation 0.667013 (4)
Linear solver 54.054418 (5)
Minimizer 55.396603

Postprocessor 0.088574
Total 60.174523

Termination: NO_CONVERGENCE (Maximum number of iterations reached. Number of iterations: 5.)

Sameer Agarwal

unread,

Sep 13, 2024, 9:11:22 AM9/13/24

to ceres-...@googlegroups.com

Let me take a look.

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/fb6c30e5-16fd-4ecb-a966-499b50bb8d22n%40googlegroups.com.

enrique.fern...@gmail.com

unread,

Sep 13, 2024, 10:48:26 AM9/13/24

to Ceres Solver

I don't know if it really matters, but these are my CPU details according to lshw output:

*-memory
description: System memory
physical id: 0
size: 66GiB
*-cpu
product: AMD Ryzen 7 3700X 8-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 1
bus info: cpu@0
size: 2368MHz
capacity: 3600MHz
width: 64 bits
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq

I should also mention I built SuiteSparse from source. I hope I didn't miss any flag that was required to exploit single precision.

Sameer Agarwal

unread,

Sep 13, 2024, 10:56:35 AM9/13/24

to ceres-...@googlegroups.com

Enrique,

What BLAS and LAPACK are you using?

On my machine, M1Pro macbook, where I build bundle adjuster using SuiteSparse 7.8.1 via homebrew and Apple's LAPACK & BLAS.

Here is what I observe.

Observe the "ReducedSolve" time, (everything else should be the same between the two types of solves).

[master] cmake-build: ./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt --stderrthreshold=0 --v=3 -mixed_precision_solves=false -max_num_refinement_iterations=0

<SNIP>

iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time

0 9.099334e+07 0.00e+00 5.16e+12 0.00e+00 0.00e+00 1.00e+04 0 1.95e-01 2.80e+00
<SNIP>

SchurComplementSolver::Solve
Delta Cumulative
Setup : 2.49219 2.49219
Eliminate : 1.51111 4.00330
ReducedSolve : 10.22955 14.23285
BackSubstitute : 0.09485 14.32770
Total : 0.00001 14.32771

1 4.649068e+07 4.45e+07 5.32e+14 0.00e+00 4.99e-01 1.00e+04 1 1.46e+01 1.74e+01
I0913 07:51:09.415558 58465629 event_logger.cc:60]

SchurComplementSolver::Solve
Delta Cumulative
Setup : 0.00008 0.00008
Eliminate : 1.49946 1.49953
ReducedSolve : 8.87346 10.37300
BackSubstitute : 0.01754 10.39054
Total : 0.00000 10.39054

2 8.014327e+06 3.85e+07 7.13e+13 1.70e+05 8.59e-01 1.58e+04 1 1.06e+01 2.80e+01
I0913 07:51:19.593790 58465629 event_logger.cc:60]

SchurComplementSolver::Solve
Delta Cumulative
Setup : 0.00007 0.00007
Eliminate : 1.37145 1.37152
ReducedSolve : 8.62894 10.00046
BackSubstitute : 0.00487 10.00533
Total : 0.00000 10.00534

./bin/bundle_adjuster --input=../../Downloads/problem-3068-310854-pre.txt --stderrthreshold=0 --v=3 -mixed_precision_solves=true -max_num_refinement_iterations=0
<SNIP>

iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time

0 9.099334e+07 0.00e+00 5.16e+12 0.00e+00 0.00e+00 1.00e+04 0 1.78e-01 2.68e+00
<SNIP>
SchurComplementSolver::Solve
Delta Cumulative
Setup : 2.46577 2.46577
Eliminate : 1.44550 3.91127
ReducedSolve : 4.07641 7.98769
BackSubstitute : 0.09636 8.08404
Total : 0.00002 8.08406

1 4.642861e+07 4.46e+07 5.31e+14 0.00e+00 4.99e-01 1.00e+04 1 8.32e+00 1.10e+01
I0913 07:52:48.368463 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
Delta Cumulative
Setup : 0.00008 0.00008
Eliminate : 1.51176 1.51184
ReducedSolve : 2.76627 4.27811
BackSubstitute : 0.00496 4.28307
Total : 0.00000 4.28308

2 8.005778e+06 3.84e+07 7.12e+13 1.70e+05 8.59e-01 1.58e+04 1 4.44e+00 1.54e+01
I0913 07:52:52.495034 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
Delta Cumulative
Setup : 0.00007 0.00007
Eliminate : 1.42480 1.42487
ReducedSolve : 2.54446 3.96932
BackSubstitute : 0.00502 3.97435
Total : 0.00000 3.97435

3 4.343577e+08 -4.26e+08 7.12e+13 1.83e+05 -6.70e+01 7.92e+03 1 4.03e+00 1.95e+01
I0913 07:52:56.489554 58469952 event_logger.cc:60]

SchurComplementSolver::Solve
Delta Cumulative
Setup : 0.00007 0.00007
Eliminate : 1.39895 1.39902
ReducedSolve : 2.55508 3.95410
BackSubstitute : 0.00453 3.95864
Total : 0.00000 3.95864

So, with regular solves which use double precision factorization, we spend ~8.7 seconds per reduced solve and with mixed precision solves, we bring that down ~2.6 seconds.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/CAK0oyEqkTLYPem3RvqkiYCOUZ15On75Mm0y6MTmG7jVtPOH9jw%40mail.gmail.com.

enrique.fern...@gmail.com

unread,

Sep 13, 2024, 11:32:01 AM9/13/24

to Ceres Solver

I have

$ dpkg -l | grep liblapack-dev
ii liblapack-dev:amd64 3.9.0-1build1 amd64 Library of linear algebra routines 3 - static version

$ dpkg -l | grep libblas-dev
ii libblas-dev:amd64 3.9.0-1build1 amd64 Basic Linear Algebra Subroutines 3, static library

$ dpkg -l | grep libcublas-dev
ii libcublas-dev-12-6 12.6.1.4-1 amd64 CUBLAS native dev links, headers

I tried to use the flags you used but they're not available for me:

$ build/ceres-solver/bin/bundle_adjuster --input=data/final/problem-3068-310854-pre.txt --stderrthreshold=0 --v=3 -mixed_precision_solves=false -max_num_refinement_iterations=0
ERROR: Unknown command line flag 'stderrthreshold'
ERROR: Unknown command line flag 'v'

I've built ceres solver from source (https://github.com/ceres-solver/ceres-solver/commit/487ce37fa775d57e1403c40eec7b69d0d91d7f2c)

Sameer Agarwal

unread,

Sep 13, 2024, 11:39:02 AM9/13/24

to ceres-...@googlegroups.com

I have a suspicion that you are not using an optimized BLAS & LAPACK. Please see

https://wiki.debian.org/DebianScience/LinearAlgebraLibraries

I recommend switching to using one of the more optimized alternatives.

Something is not right either about the version of ceres you are using. I fixed this issue with logging flags

https://github.com/ceres-solver/ceres-solver/issues/1090

back in mid august.

Sameer

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/1d483f46-d258-43e2-8f95-4e84a5fd7a34n%40googlegroups.com.

enrique.fern...@gmail.com

unread,

Sep 13, 2024, 11:46:22 AM9/13/24

to Ceres Solver

Thanks, I'd look into that. It'd take me some time to report my findings.

Reply all

Reply to author

Forward