aarch64 Nonlinear Optimizer Issues

224 views
Skip to first unread message

Jay Elrod

unread,
Jun 16, 2021, 12:39:01 PM6/16/21
to gtsam users
Hi,

I have some unit tests that chain together some BetweenFactor using an odometry mock, inject some slightly contradictory supplemental evidence in the form of other factors, such as a unary factor akin to the "GPS-like" factor from LocalizationExample.cpp, and finally use LevenbergMarquardtOptimizer to `optimize()` the factor graph and assert the result.

For example, a test might start with some rather noisy mocked odometry moving forward at a consistent speed with a heading of pi/4, qualify the noisy estimates with some very not-noisy GPS-like factors that suggest movement is actually happening in the pi/2 direction, rather than the pi/4 direction, then optimize the graph and make sure that the optimized poses were along the trajectory suggested by the more reliable GPS-like factors.

Such a test and several others work fine on my desktop (x86_64, Ubuntu 18.04) but fail on my Jetson Xavier NX (aarch64, Ubuntu 18.04).

Checking out borglab gtsam source from github and running `make check` to invoke unit tests on the aarch64 device results in many failures, which I expect are related to my own, the following example of which I think is similar to one of the more basic included examples in gtsam :
```
gtsam/tests/testNonlinearOptimizer.cpp:484: Failure: "assert_equal(expected, dl_result, tol)" 
Not equal:
expected:
Values with 3 values:
Value x1: (gtsam::Pose2)
(0, 0, 0)
Value x2: (gtsam::Pose2)
(1.5, 0, 0)
Value x3: (gtsam::Pose2)
(3, 0, 0)

actual:
Values with 3 values:
Value x1: (gtsam::Pose2)
(-3.74995e-06, 0, 0)
Value x2: (gtsam::Pose2)
(1.49999, 0, 0)
Value x3: (gtsam::Pose2)
(3, 0, 0)
```

I am in the process of debugging gtsam source through verbose optimizer output and will provide updates as I make progress, but I am suspicious of Eigen anti-aliasing or use of `auto` keyword and type resolution on aarch64.

One thing I have observed so far only in aarch64 is:
```
Partial Cholesky on HessianFactor failed.

Frontal keys Position 0: 0
HessianFactor:
 keys: 0(3) 1(3) 
Augmented information matrix: [
1.00453889587e+81, -8.863578493e+80, 1.35333966459e+82, -1.22992576855e+42, -7.68703605343e+40, 0, -1.78961791309e+41;
-8.863578493e+80, 1.00453889587e+81, -1.19233688982e+82, 1.22992576855e+42, -7.68703605343e+40, 0, 3.06711287414e+41;
1.35333966459e+82, -1.19233688982e+82, 1.82326703797e+83, -1.65582451852e+43, -1.04723572027e+42, -400, -2.39906858576e+42;
-1.22992576855e+42, 1.22992576855e+42, -1.65582451852e+43, 1600, 0, 0, 315.903994301;
-7.68703605343e+40, -7.68703605343e+40, -1.04723572027e+42, 0, 100, 0, -83.0941179516;
0, 0, -400, 0, 0, 400, 628.318530718;
-1.78961791309e+41, 3.06711287414e+41, -2.39906858576e+42, 315.903994301, -83.0941179516, 628.318530718, 304968.261602
]
```

Another thing I have noticed in aarch64 is that the initial total error of the same system seems much lower than that of amd64, possibly resulting in less/no optimization iterations.

Any help is greatly appreciated, and thanks in advance!
Jay

Jay Elrod

unread,
Jun 17, 2021, 3:07:03 PM6/17/21
to gtsam users
Seems like the calculation of error of `BetweenFactor<Pose2>` is indeed incorrect on aarch64.

Debug output from amd64 of BetweenFactor<Pose2>::evaluateError for the very first error calculation based on estimates:
> p1: (11, 10.5, 0.392699081699), p2: (20, 23, 0.785398163397), err: [13.1184182938, 7.63899870558, 0.352566699052]

Debug output from aarch64 of BetweenFactor<Pose2>::evaluateError for almost the exact same poses:
> p1: (11, 10.5, 0.392699081699), p2: (20, 23, 0.785398163397), err: [-0.261580542738, -0.309698692039, -1.57079632679]

Playing around inside of `BetweenFactor` and `LieGroup` but haven't had much luck so far.

Dellaert, Frank

unread,
Jun 18, 2021, 6:43:25 AM6/18/21
to Jay Elrod, gtsam users
Would be good to find the root cause, as it could point to uninitialized memory bug on other platforms as well. It's a rare opportunity 🙂

From: gtsam...@googlegroups.com <gtsam...@googlegroups.com> on behalf of Jay Elrod <jelr...@gmail.com>
Sent: Thursday, June 17, 2021 15:07
To: gtsam users <gtsam...@googlegroups.com>
Subject: [GTSAM] Re: aarch64 Nonlinear Optimizer Issues
 
--
You received this message because you are subscribed to the Google Groups "gtsam users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtsam-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtsam-users/0145b558-bd88-4782-b29d-df86f49917f6n%40googlegroups.com.

Misaka Jiang

unread,
Jun 18, 2021, 3:36:18 PM6/18/21
to Jay Elrod, gtsam users
Hi Jay,

Could you submit this, along with how to reproduce, to the GTSAM GitHub issues?

Best,
Fan

--
You received this message because you are subscribed to the Google Groups "gtsam users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtsam-users...@googlegroups.com.

Jay Elrod

unread,
Jun 22, 2021, 10:43:48 AM6/22/21
to gtsam users
Sure I can submit an issue on Github. As for repro, any factor graph with BetweenFactor<Pose2> is going to fail in aarch64. I haven't 100% figured out why, but `Rot2 inverse()` in geometry/Rot2.h was returning an inverse with incorrect heading angle when inverting Pose2 as a part of factorization of `BetweenFactor<Pose2>`. 

Changing this line, https://github.com/borglab/gtsam/blob/aec2cf06a5501840f0302344bca03c197522a335/gtsam/geometry/Rot2.h#L110 to read the following fixes at least that issue (seems like there are more still to find):
    `Rot2 inverse() const { return fromCosSin(c_, -s_);}`

Like I said, not yet sure why this makes a difference, as all fromCosSin appears to be doing is normalizing c_ & s_, and those should have already been normalized...

Jay Elrod

unread,
Jun 22, 2021, 3:01:28 PM6/22/21
to gtsam users
Also, Rot2::operator*(const Rot2& R) does not behave as expected in aarch64. The following seems to relieve the problem: return fromAngle(theta() + R.theta());

Although, again, I am not sure why this works, as making use of the trig identity for cos(th1+th2) and sin(th1+th2) as is done in master branch should be identical in terms of resulting behavior...

Frank Dellaert

unread,
Jun 23, 2021, 4:40:20 AM6/23/21
to gtsam users
This is all great sleuthing - please do create the issue so we can track it with this info in it.
Also, I am hoping the sleuthing will lead to finding the root cause - the above fixes are great (and in extremis we could put them inside preprocessor directives) but they might cover up the real issue.
Frank

Jay Elrod

unread,
Jun 23, 2021, 10:58:35 AM6/23/21
to gtsam users
Thanks, created: https://github.com/borglab/gtsam/issues/803

I agree 100%. Normalizing the rotation appears to fix some things (enough things, even, that one of my semi-complex tests does pass), but there are still some unresolved issues. Currently have reverted those band-aid fixes and trying to find out the true root cause of why any Rot2 is ever not in a normalized state to begin with.

Somehow (I think it is maybe happening when the LM optimizer is initialized) the left operand of `Rot2*Rot2` has an uninitialized `c_` and `s_` during an invocation before the optimizer takes its first iteration. I think what's happening is that when this uninitialized rotation gets inverted and multiplied during the optimization iterations, the cosine and/or sine of the rotation component of pose estimates becomes extremely positive or negative, the result of which is `atan2()` for the pose heading angle returning one of its inflection points, as documented in cppreference. Have not figured out from where this uninitialized rotation is coming, but given that reproducing it repeatedly yields wildly different figures for `c_` and `s_` for that rotation that shows up before the optimization, that does appear to be what is happening.

Reply all
Reply to author
Forward
0 new messages