Ceres 2.1.0 CUDA Dense Linear Algebra Library causes preprocessor to take a VERY LONG TIME

77 views
Skip to first unread message

Cameron Conroy

unread,
Jan 11, 2023, 3:01:46 PM1/11/23
to Ceres Solver
Hello All,

For the problem I am using ceres for, solves on the CPU take about 2-3ms total with Eigen dense linear algebra library. Preprocessor usually takes under 2ms for those solves. However, when I use CUDA as the dense linear algebra library, the preprocessor is taking 1+ seconds. After profiling with Nsight systems there are only 2 memcopy calls that are both copying 112 bytes each. Does anyone have any ideas as to what is causing the preprocessor to slow down? From what I have read the host-to-device transfer doesn't even happen in the preprocessor. 

I have attached the Full report and my CMake cache file.
CMakeCache.txt
ceres_output.txt

Sameer Agarwal

unread,
Jan 11, 2023, 3:06:55 PM1/11/23
to ceres-...@googlegroups.com
I suspect this has to do with Cuda being initialized everytime you call solve.
The way around it is to construct a Context object yourself ceres::Context::Create() and hold on to it. 
Construct your problem with this context problem passed via Problem::Options and then you should pay this cost only once.
Sameer


--
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ceres-solver/6cab64e7-dd03-4ad7-ba4a-fcd997cd52ben%40googlegroups.com.

Cameron Conroy

unread,
Jan 11, 2023, 3:30:50 PM1/11/23
to Ceres Solver
Thank you Sameer, I will try this!

Cameron Conroy

unread,
Jan 12, 2023, 9:51:30 AM1/12/23
to Ceres Solver
I am a little confused about how to pass the context to the solve. I see that context is inside the struct definition of Options but Options nor context is a member of ceres::problem. Could you please elaborate on how I do this? Thanks in advance!

Sameer Agarwal

unread,
Jan 12, 2023, 9:54:01 AM1/12/23
to ceres-...@googlegroups.com
The context is passed to the problem not to the solve. This is done by constructing a Problem::Options struct with context in it and then constructing a problem by passing the options struct to the Problem constructor.

Sameer.

Cameron Conroy

unread,
Jan 12, 2023, 11:28:07 AM1/12/23
to Ceres Solver
That seemed to work. CUDA is still slower than running with Eigen dense library on CPU. Any recommendations for preconditioner to use with CUDA and any tips on threading when running with CUDA? Also, my GPU is an RTX 5000 if that matters.

Joydeep Biswas

unread,
Jan 12, 2023, 12:09:27 PM1/12/23
to ceres-...@googlegroups.com
Hi Cameron,

If I understood it correctly, the issue is a slow preprocessor, not a preconditioner - or are you trying out an iterative solver?
If it is the former, it would help to run with verbose debugging (`--v 3 --alsologtostderr`) to see what is taking time - can you share the result?
Your problem in the summary that you shared is also tiny - it'd be much faster to solve on the CPU, due to the GPU operation overhead. If you have a larger problem, it would help to look at the summary and the verbose result to identify potential improvements.

Regards,
Joydeep


Sameer Agarwal

unread,
Jan 12, 2023, 12:24:29 PM1/12/23
to ceres-...@googlegroups.com
You are solving a tiny problem (your jacobian is 8x12). Going to the gpu is not going to do anything for you. Your problem is essentially solver latency not the throughput.
I recommend trying to use TinySolver which is meant for solving small problems like this. it has fewer features, but is very fast.
Sameer



Cameron Conroy

unread,
Aug 24, 2023, 3:30:45 PM8/24/23
to Ceres Solver
Hey Sameer, could you tell me what the threshold is for what is considered a tiny solve and what is large enough to use the regular ceres solver? Thanks in advance. 

Sameer Agarwal

unread,
Aug 24, 2023, 3:38:58 PM8/24/23
to ceres-...@googlegroups.com
Cameron,
There is no single answer to this question because the threshold where tinysolver gets slower than a full solve will depend on your particular running environment (cpu, linear algebra libraries etc).
That said, I expect tiny solver to do well up a 100s of parameters and on dense problems. As the sparsity in the problem becomes a substantial factor the full solver will start doing better.
Sameer


Reply all
Reply to author
Forward
0 new messages