segfault when trying to optimize large graph

152 views
Skip to first unread message

Yotam Stern

unread,
Mar 5, 2023, 3:02:10 AM3/5/23
to gtsam users
Hi all, i have some weird behavior from gtsam when i'm trying to optimize some large graph ~250k factors.
The reason why i think its related to the size of the graph is because all the individual factors and values seems sane, i don't see any nans or infs or all zero jacobians when i linearize each factor individually, or even linearize the entire graph with the values.

If i take a subset of the graph and try to optimize it everything goes well, for instance the first half and the second half, i managed to optimize up to ~80% of the graph without problem, only when i tried optimize it all together it dies on segfault.

This reproduces with LM, GN and DL optimizers.
I also tried to optimize both halves and then use that as initial values for an optimization of the entire graph and i get the same segfault, so i really doubt this is the result of one of the factors giving bad error or jacobian.

Unfortunately i cannot share an example for the graph because its using factors i wrote in my local project, but basically the graph is a fusion between gps and speed/yaw measurements for a vehicle. it contains gpsFactor for each gps measurement and "between" factor that constraints the poses to propagate by the speed/yaw measurements and another prior factor on the speed/yaw to comply to the measurements.
i have 71k  Pose3 values and 280k factors in the graph. basically.

Did anyone encounter any problem such as this before? or have any clue as to what can be the source?
As a work around i can just optimize the graph in 2 chunks and use the combined result, but its still bothering me that it fails. i did optimize graphs with millions of factors in the past with gtsam and had no problems.

Dellaert, Frank

unread,
Mar 5, 2023, 3:50:06 PM3/5/23
to Yotam Stern, gtsam users
Hmm. Can you use gdb or lldb do see where it crashes and share the back-trace stack in reply or (better) in a gist? That might give some clues.

Frank

From: gtsam...@googlegroups.com <gtsam...@googlegroups.com> on behalf of Yotam Stern <sht...@gmail.com>
Sent: Sunday, March 5, 2023 12:02:10 AM
To: gtsam users <gtsam...@googlegroups.com>
Subject: [GTSAM] segfault when trying to optimize large graph
 
--
You received this message because you are subscribed to the Google Groups "gtsam users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtsam-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtsam-users/384eece3-3e95-4061-be9a-d89e08b0d2f9n%40googlegroups.com.

Yotam Stern

unread,
Mar 6, 2023, 2:18:57 AM3/6/23
to gtsam users
Hi, for using gdb i need to compile gtsam in debug mode? will it be any use if my project runs in python?

Dellaert, Frank

unread,
Mar 6, 2023, 10:09:41 AM3/6/23
to Yotam Stern, gtsam users

Ah, good point.

I have done this quite a long time ago, so I don’t fully remember, but indeed, I would try compiling and building the wrapper with Debug, then attach gdb to the python process.

Frank

Bernd Pfrommer

unread,
Mar 7, 2023, 5:57:23 PM3/7/23
to gtsam users
I've been struggling with a segfaults, too. This is with TagSLAM, a fairly large and convoluted C++ code with plenty of opportunity to cause memory corruption *outside* of GTSAM. I have a fairly heavy test suite which I can run as often as I want against the old GTSAM 4.0 Ubuntu package, without any segfaults. But GTSAM 4.1 and later gives intermittent segfaults every now and then.
The Segfaults don't occur under e.g. Valgrind or gdb, i.e. it is a Heisenbug. Even more puzzling, Valgrind  reports no memory allocation errors.
Running the entire test suite can take 1h+ and errors do not manifest themselves every time, so I'd have to run for nearly a day to be reach some certainty that a given version of GTSAM does not crash my code. Bisection is quite tedious under such circumstances so I have not gone down that route yet. And the bug may well be in my code, and it's just that later versions of GTSAM expose it.
If anybody has seen similar elusive segfaults with more details as to what has caused them I'd love to hear it.

José Luis Blanco-Claraco

unread,
Mar 7, 2023, 6:08:53 PM3/7/23
to Bernd Pfrommer, gtsam users
Try the GCC thread and memory sanitizers, if you don't have tried them yet. They also run slower than regular code but much faster than valgrind. 


JL

Bernd Pfrommer

unread,
Mar 23, 2023, 7:28:32 AM3/23/23
to gtsam users
Thanks, that really helped! It still was a slug fest but I was able to track down the problem. It was some weird interaction with the tf2_ros and eigen_conversions library, they are using Eigen and presumably were compiled with -march=native so when linking against it bad things happened. Removing the dependency on the libraries solved the problem and I was able to finally move up from GTSAM 4.0.3 to 4.1.1. Interestingly both versions of GTSAM triggered a memory corruption warning under ASAN, but only 4.1.1 crashed occasionally. So no bug in GTSAM and actually neither in my code either.

Dellaert, Frank

unread,
Mar 23, 2023, 8:52:20 PM3/23/23
to Bernd Pfrommer, gtsam users

Phew, many thanks for posting the results of this investigation back to the group!

Frank

Reply all
Reply to author
Forward
0 new messages