Ways to reduce memory consumption & improve multithreading on very big, sparse problems

Matthias Bühlmann

unread,

Jun 14, 2021, 11:37:43 AM6/14/21

to Ceres Solver

I have two questions regarding solving of large problems.

I'm solving a few hundred ceres problems. Most of them are relatively small (hundreds to a few hundred thousands of residuals and parameters), but a handful of them are quite big ones

I'm running on a 24 core / 48 hyperthreads computer with two NUMA domains and 160Gb of ram

The small problems I solve in parallel with 48 threads and each problem getting 1 thread. When solving these smaller problems in parallel, I get 100% CPU usage on the system.

When getting to the larger problems, my application always gets killed due to running out of memory.

The first question regards multithreading:

to save on memory, as the problems get larger, I switch to solving one problem at a time but setting solver_options.num_threads = 48. However, I only get about 10% of CPU usage at this stage. I'm using linear_solver_type = SPARSE_NORMAL_CHOLESKY and tried both with sparse_linear_algebra_library_type = EIGEN_SPARSE (Eigen 3.3.9 compiled with default defines) and sparse_linear_algebra_library_type = SUITE_SPARSE (SuiteSparse 4.5.6) but with both of them I see very low CPU usage when solving one ceres problem at the time with solver_options.num_threads = 48.

So the first question is: Is there something I can do to get better CPU utilization when solving one big, sparse ceres::Problem at a time but with solver_options.num_threads = 48?

Second question: when getting to the largest problems, I run out of memory also when solving them 1 by 1, already during building the problem.

My currently largest problem has about a billion residuals (in a billion residual blocks) and about 50 million parameters (in about 50 million parameter blocks)

it is extremely sparse though. Each residual depends only on 80 parameters (in 11 parameter blocks), and there are a few hundred residuals depending on each parameter block. The cost functors and the observation data (which I already compressed as good as possible in memory and share between cost functors) take about 50gb, so the rest seems to be overhead from building the problem.

What options do I have to reduce the memory footprint of building and solving such a large, sparse problem?

Thanks!

Dmitriy Korchemkin

unread,

Jun 23, 2021, 8:41:51 AM6/23/21

to Ceres Solver

I've actually experienced performance decrease when dealing with large problems on NUMA machine (dual xeon 8176 with 384G ram each, if that matters).

I would suggest running your optimization on a single NUMA node (in terms of both memory allocation & thread scheduling) [using numactl for example, if you're on linux], and setting number of threads to the number of threads per socket.

> I'm using linear_solver_type = SPARSE_NORMAL_CHOLESKY

Does your problem allow you to exploit some structure (using schur complement solvers, for example)?
Iterative solvers might also be an answer (direct solvers are quite slow from my perspective); but currently their performance is somewhat limited to sparse matrix - vector product (especially in the case of partitioned matrix); I've experimented with speeding it via parallelization ( https://groups.google.com/g/ceres-solver/c/-AC_9wsYrW0/m/0itXGNY-AwAJ ), but was unable to get promising results on CPU (supposedly -- due to memory throughput / access pattern irregularity), but GPGPU results were quite promising for problem sizes that fit into GPGPU memory.

Sameer Agarwal

unread,

Jun 23, 2021, 12:14:38 PM6/23/21

to Ceres Solver

Something went wrong with google groups. I never saw the original message from Matthias, just Dmitriy's response.

In any case here goes.

We use CHOLMOD for sparse cholesky factorization. This solver is not threaded. So the only thing you see being parallelized is the Jacobian evaluation.

At a billion residuals you are going to run into problems.

I will make the obvious but annoying suggestion -- do you really need a billion residuals? and does your problem actually have 50 million parameter blocks? is there a lower dimensional structure that you are trying to optimize here which when discretized/sampled gives rise to this many million parameters? It maybe worth looking at your problem formulation itself.

I will give you an example from bundle adjustment.

The number of 3d points that are optimized in a bundle adjustment problem are usually just a function of the feature detector that was used to detect the image features themselves. There is nothing intrinsic about this number. In a scene with more texture you will get more 3d points than a scene with less texture. It says nothing about the 3d-ness of the scene itself. What matters in most cases is accurate camera parameters and once you have them you can just triangulate points individually -- which is what multiview stereo is all about. So after a certain number of 3d points adding more just increases computational complexity without really affecting the camera parameters. Now of course, choosing the right small set of 3d points to optimize over is not an easy problem.

Sameer

Matthias Bühlmann

unread,

Jul 1, 2021, 3:31:51 AM7/1/21

to Ceres Solver

The parameters I'm solving for are the texel values in multiple large textures as well as a bunch of vertex attributes at each vertex of a 3D mesh, which together parametrize the surface properties for a real-time shader. Since the vertex attributes get interpolated across adjacent triangles, I can't solve triangle by triangle.

Sameer Agarwal

unread,

Jul 1, 2021, 7:30:02 AM7/1/21

to ceres-...@googlegroups.com

Sure but do you need to solve the whole mesh at the same time?

One approach I have used in the past is to break up the graph/mesh into overlapping pieces and then iterate over them.

Another idea is to do a non overlapping partition and then use something like admm or dual ascent.

Sameer

--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/a0e23063-167c-44c5-84d0-12fe61ab61a0n%40googlegroups.com.

yel...@gmail.com

unread,

Sep 23, 2022, 11:17:23 PM9/23/22

to Ceres Solver

Just want to know a bit more about scalability of Jacobian evaluation given more cores. Suppose I have a lot residuals but much less number of parameters (so the ram should be sufficient), do you expect its strong scaling efficiency to be high (say > 80%)? For example, if I want to solve BA for a few hundred cameras, but very dense point correspondences. Would having a big multi-core machine be helpful?

Sameer Agarwal

unread,

Sep 23, 2022, 11:57:45 PM9/23/22

to ceres-...@googlegroups.com

Yes.

--

You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/cc747d6f-44b9-494f-a009-fb6745d8b83cn%40googlegroups.com.

Reply all

Reply to author

Forward