rover crash bug

299 views
Skip to first unread message

Andrew Tridgell

unread,
Apr 21, 2014, 5:23:00 AM4/21/14
to thomasc...@comcast.net, drones-...@googlegroups.com
Hi Tom,

I've pushed a few changes to git master today that may have an impact on
the rover crash bug that you and Mike have seen.

The first change is to ensure that if the FMU firmware crashes that the
PX4IO co-processor will output zero throttle, so the rover won't run
away. The video that Mike sent shows that the FMU firmware was
definately dead, which means we know it is a FMU software bug. This
change (also made in the plane code) means that at least the throttle
will be cut.

The second change is to the EKF code, to make it not use the 'fly
forward' state of the AHRS object for rovers. One of the main changes
that happened at about the time these crashes started was the rapid
changing of the fly_forward state in AHRS when the rover does active
braking. Paul and I can't find any reason why this should make any
difference with respect to crashing, but we thought it was better not to
do this anyway so we fixed it.

I've been running a Pixhawk with the rover code in HIL for a few hours
today and haven't managed to reproduce the crash. I'll keep trying
different ways of producing the crash in the coming days.

Cheers, Tridge

Linus Penzlien

unread,
Apr 21, 2014, 7:42:06 AM4/21/14
to drones-...@googlegroups.com
Hi Tridge,

thank you for work on this!

i hope to get the dualgps and debug console working on px4-v1 and can
send you a patch soon so i can add findings as well.

have fun!
Linus

Tom Coyle

unread,
Apr 21, 2014, 9:36:41 AM4/21/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

Nice work!
The weather here in southern FL is supposed to be good for most of the week so I will be able to get in plenty of testing.

Regards,
Tom C

Tom Coyle

unread,
Apr 21, 2014, 3:04:43 PM4/21/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

I was able to duplicate the runaway condition after the rover had made two circuits around the test course.

Though during the two runs around the test circuit, the rover looped de loop on the second to last chicane waypoint, but recovered and continued on around the course.

I had EKF disabled, brake percent at 0, and cruise speed at 3m/s. 

Fortunately there was no runaway per say as Tridge's change to the code last night to put the servo outputs at neutral when there is a loss of the FMU worked perfectly.

I was lining the rover up for a third run, when the MP voice stated that there was a loss of telemetry.

I promptly ran over to the rover and examined the Pixhawk status lights:

FMU: All the leds were off and the telemetry transmitter was dead.

I/O: The B/E led was flashing red, the ACT led was flash blue and the PWR led was green

I believe the arming switch was still solid red, but I cannot be sure.

I started to shut the rover down by turning off the ESC and then pushing the arming switch at which time the Pixhawk rebooted and the FMU section came back to life along with the telemetry radio.

This is beginning to look like an issue with the Pixhawk power selector circuitry unless my ESCs BEC is going bad. The ESC is brand new, but I will check its BEC output voltage asap.

tlogs and data flash log attached.

Regards,
Tom C ArduRover2 Developer


On Monday, April 21, 2014 5:23:00 AM UTC-4, Andrew Tridgell wrote:
1.7z

Robert Lefebvre

unread,
Apr 21, 2014, 3:58:36 PM4/21/14
to drones-discuss, Tom Coyle, Andrew Tridgell
Tom, is it possible there is some connection between steering servo movement and this condition?  How big is your steering servo?

Not to jump the gun and diagnose this prematurely, but there's good info in this thread:


Particularly the last page.  You'll see some links with interesting testing.  While you only have a single servo, you probably don't have a very powerful BEC in that ESC.

If you are powering the Pixhawk from the same source as the servo, you might want to rework that.  I might have thought you were using the Power Module though?


--
You received this message because you are subscribed to the Google Groups "drones-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to drones-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Coyle

unread,
Apr 21, 2014, 4:22:50 PM4/21/14
to drones-...@googlegroups.com, Tom Coyle, Andrew Tridgell
Hi Rob,

Thanks for the input, much appreciated.

The servo on this rove is of medium size and I do not believe that it is the cause of the FMU shutdown.

Here is what I do believe:

When I was putting together a Pixhawk in a Wild Thumper Rover that was using a Sabertooth 2X25 ESC to control the motors I had a problem with the Pixhawk dual power supply function.

Since the Sabertooth BEC was powering up at the same time as the Pixhawk was power up from the PM, the Pixhawk got into a never ending boot up cycle, because it could not decide what power input to use, and killing the power was the only way shut off the Pixhawk.

The solution to the startup issue was to not connect the Sabertooth BEC power to the Pixhawk servo output power rail and just use the PM as the only source of power.

In the case of my rover, I turn the ESC on after the Pixhawk has booted up and I have armed it. Therefore there is no coincident start up power issue.

However, I have found that my Traxxas ESC BEC is putting out 6.07vdc unloaded (I thought it was 5.0vdc) and I believe that this is what might be causing the problem with the Pixhawk.

Is the Pixhawk expecting only 5vdc from the BEC on the servo output power rail?

If so, then this must be the issue with the dual power input function.

I suspect that if I power the rover servo directly from the ESC BEC and not through the Pixhawk servo output power bus, the issue might go away.

However this means that the Pixhawk will only have one power input option and not two.

I do not think that this will be a problem as my APMs work fine on just one power supply and rovers do not fall out of the sky:-)

Paul Riseborough

unread,
Apr 21, 2014, 4:38:21 PM4/21/14
to drones-...@googlegroups.com, Tom Coyle, Andrew Tridgell
Tom,

The IO unit can power off a much wider range of servo voltages than the FMU can. Bottom line is that the FMU will shutdown if it is running off the servo power and that voltage goes above about 5.8v from memory. Even if you were using a lower voltage BEC, there are enough voltage spikes on a servo rail due to motor back emf and inductance, that it cannot be relied on to power the FMU unless some form of voltage limiting has been applied. In addition, with current component values the switchover to servo rail power will not be fast enough if the PMU voltage is removed suddenly depending on the total electrical load being drawn by the Pixhawk.

I lost a plane as a result of this issue, and it has been discussed extensively in the PX4 dev forum.

Bottom line is, if you run servos from pixhawk, you really have only one power source unless you reorganise your wiring such that the BEC power goes direct to the servo, and you use a separate 5V regulator to power the rail.

Andrew Tridgell

unread,
Apr 21, 2014, 4:49:32 PM4/21/14
to Tom Coyle, drones-...@googlegroups.com, thomasc...@comcast.net
Hi Tom,

> Fortunately there was no runaway per say as Tridge's change to the code
> last night to put the servo outputs at neutral when there is a loss of the
> FMU worked perfectly.

ok, at least that works!

> FMU: All the leds were off and the telemetry transmitter was dead.
>
> I/O: The B/E led was flashing red, the ACT led was flash blue and the PWR
> led was green

Just to confirm, was the green FMU power LED off? (it is the one closest
to AUX1 servo output).

If I understand the schematic correctly then that LED (LED704) is not
under software control. It is connected directly to the FMU 5V rail. So
if it goes off then your FMU has lost power.

In the video of the crash from Mike the FMU did retain power, with both
the green FMU power light staying on, and the big multicolour LED
staying on.

This makes me think we have two different types of problem.

btw Tom, the log you sent shows the servo rail at 6.1V. Is that
deliberate? It is above the 5.8V limit for the power selector. That
should be OK, but it does mean the servo rail can't provide backup power
for the FMU.

Cheers, Tridge

Tom Coyle

unread,
Apr 21, 2014, 4:53:16 PM4/21/14
to drones-...@googlegroups.com, Tom Coyle, Andrew Tridgell
Hi Paul,

From what you have said, it looks like I should power the servo directly from the ESC BEC and only let the PM power the Pixhawk.

I wonder why it took so long for this issue to occur as I have been running my Pixhawk/rover with this power configuration from the beginning.

However, I think I know what might be causing the problem.

I was originally running a Traxxas brushed motor and ESC on the Pixhawk rover and recently switched over to a Traxxas brushless motor and ESC.

I suspect that the brushed ESC BEC was providing 5vdc while the brushless ESC is providing 6.07vdc.

I will check the BEC output of the brushed BEC to see what it is.

Robert Lefebvre

unread,
Apr 21, 2014, 4:57:30 PM4/21/14
to drones-discuss
Good info Paul, I wasn't aware of much of this.  How come the IO can use a wider range than the FMU?  

And the idea that the switchover might not be fast enough is new and troubling.  I had bench tested the switchover myself, and found it worked.  What condition is needed to make it fail?

What is the PX4 dev forum?  The PX4 User group email list?


Tom Coyle

unread,
Apr 21, 2014, 4:58:49 PM4/21/14
to drones-...@googlegroups.com, Tom Coyle, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

The FMU PWR led was definitely off.

I think that you are right about there being two different problems. Mine and Mike Roberts.

See my analysis of what the BEC in my brushless servo is providing to the Pixhawk servo output power rail.

I will power the steering servo separately from the ESC BEC and only use the PM to power the Pixhawk.

I hope that I have not damaged the FMU:-(

Comments?

Regards,
Tom C  

Andrew Tridgell

unread,
Apr 21, 2014, 5:02:00 PM4/21/14
to Tom Coyle, drones-...@googlegroups.com, thomasc...@comcast.net
Hi Tom,

If you do more testing then I think it would be good to get the
following:

1) if the crash bug happens again then get a video of the state of the
pixhawk.

2) look at the state of not only the Pixhawk LEDs, but also the LEDs on
the peripherals, for example, whether the GPS power LED was on.

3) take a multimeter with you and check some key voltages after the
failure. Look at the voltage on the output of the 3DR power brick (you
can get a multimeter probe onto the metal tabs on the output side of
the brick). Also check the power pins on the I2C bus connector (it would
be easiest if you use a I2C expander), and on the spare telemetry UART
port. Maybe check all the voltages before you go to the field as well,
so you are sure you know which pins to probe.

4) once you have all the voltages written down, press the reset button
on the FMU and see if it reacts (the button is accessible through the
side of the case, on the same side as the USB connector). I'd like to
see if the reset button causes the Pixhawk to reboot, and if it does,
then I'd like to see the dataflash log from the reboot

5) also make sure you have the buzzer attached and note down any sounds
it makes on the crash and after the reset.

Thanks!

Cheers, Tridge

Robert Lefebvre

unread,
Apr 21, 2014, 5:03:47 PM4/21/14
to drones-discuss
Hi Tom, just something else to think about:  I recall you mentioned something about this happening under braking?  And now you say that you have changed motors/ESC.  There could also be some connection here.  When the a car ESC goes into braking, it will dump a lot of energy back into the battery pack. This would cause a positive spike on the battery voltage, which maybe the BEC or PM is having trouble dealing with, so you could have too much voltage coming into the Pixhawk.

I wonder if maybe, when you hit the brakes, the battery voltage spikes, the PM can't remove it, and the PM output voltage goes over 5.8V, so the switch-over circuit changes to the servo rail. Then if that rail is at 6.07V, the FMU shuts down but the IO does not?

Just another idea.

On all my heli builds, I try to add at least 470uF if not 4700uF capacitance on the power input to the flight control board.  And I almost always use 2S direct to high-voltage servos.  I've never had any power supply problems or reboots or anything like that.  It's a pretty simply recipe that seems to work well.

Andrew Tridgell

unread,
Apr 21, 2014, 5:05:08 PM4/21/14
to Tom Coyle, drones-...@googlegroups.com
Hi Tom,

> I will power the steering servo separately from the ESC BEC and only use
> the PM to power the Pixhawk.

I'd actually prefer you don't change your setup if you don't mind. The
setup you describe is within the normal operating specs of the Pixhawk,
so if the setup is a problem then we really want to get to the bottom of
it. If you change the hw setup then we may lose the ability to reproduce
it.

> I hope that I have not damaged the FMU:-(

I'm sure we can get you another one if you have, but first I want to
make sure we get as much data as we can. See my last email on data to
gather next time.

Cheers, Tridge

Andrew Tridgell

unread,
Apr 21, 2014, 5:07:54 PM4/21/14
to Robert Lefebvre, drones-discuss
Hi Robert,

> Good info Paul, I wasn't aware of much of this. How come the IO can use a
> wider range than the FMU?

IO has its own voltage regulator that can take up to 10V. The difference
with the FMU is that the FMU needs to supply external peripherals (such
as the GPS) that may be voltage sensitive. The power selector shuts off
power to the FMU 5V rail from the servo rail if the power would go above
5.8V to prevent damage to those external peripherals.

It doesn't put it via a LDO on the FMU side as that would involve a
voltage drop that could cause the peripherals to go below their minimum
voltage.

Cheers, Tridge

Andrew Tridgell

unread,
Apr 21, 2014, 5:10:04 PM4/21/14
to Robert Lefebvre, drones-discuss
Hi Robert,

> And the idea that the switchover might not be fast enough is new and
> troubling. I had bench tested the switchover myself, and found it worked.
> What condition is needed to make it fail?

If the voltage from the 3DR brick drops very rapidly then the swithover
may happen after the 3.3V rail on the FMU drops below its critical
level.

You can see some graphs here:

http://uav.tridgell.net/Pixhawk/Power/

Cheers, Tridge

Robert Lefebvre

unread,
Apr 21, 2014, 5:31:26 PM4/21/14
to drones-discuss
I see. Thanks for the explanation.  On the next iteration of the control board, would it make sense to put the FMU processor on the same LDO as the IO processor.  Then these could both stay alive with the higher voltage range.  Then you can still protect the 5V bus for the peripherals in the same way we do now.  Net result would be that, the IO/FMU can stay alive over a wider range, and continue to provide basic functionality, even if the 5V GPS and other sensors are lost.  

And as for the 3.3V dropping below critical too fast, sounds like having some capacitance on the PM output could also help with that.  Unfortunately that's a bit trickier to do than it is to add capacitance to the servo rail.

I wonder if a capacitor bank plugged into an unused 5V peripheral port would help?



Cheers, Tridge

--
You received this message because you are subscribed to the Google Groups "drones-discuss" group.

mroberts

unread,
Apr 21, 2014, 6:33:01 PM4/21/14
to drones-...@googlegroups.com
Tridge, you said my FMU firmware was dying, based on the lights. Is there any point in me pursuing the debug cable still?

Hope to have some more test data this afternoon.

Andrew Tridgell

unread,
Apr 21, 2014, 9:14:57 PM4/21/14
to mroberts, drones-...@googlegroups.com
Hi Mike,

> Tridge, you said my FMU firmware was dying, based on the lights. Is
> there any point in me pursuing the debug cable still?

yes!

When the firmware dies it is quite possible it will write a "oops" crash
trace on the console. If we can catch that then we can use the numbers
in the trace to narrow down the bug (it may even tell us what line of
code the bug is on).

Even if we don't catch a console crash report then just running "perf"
and "ps" on the console after the crash may give us valuable clues.

It would actually be ideal if you had the debug console connected and
capturing output when the FMU crash happens, but I know that may be
difficult to arrange.

Cheers, Tridge

Tom Coyle

unread,
Apr 22, 2014, 12:50:28 PM4/22/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

This morning I charged up the battery on the Pixhawk equipped rover and ran a bench top test for 20 min with the ESC on running the motor up and down unloaded and wagging the steering from side to side. There was no shutdown of the FMU during that time period.

I then took the rover outdoors to run it manually to see if I could induce the FMU shutdown. Ten minutes of hard acceleration, braking, and hard over steering did not cause the FMU to shutdown.

Therefore the last option is to run the rover on my test course in successive Auto Mode runs until I can induce the FMU to shutdown and collect symptom data per your request.

Collecting electrical symptom data will not be easy as I am operating out of the back of my SUV and have no one to assist me. I may try and bring the rover home in its failed state (still powered up) where I have better troubleshooting facilities.

Based on the fact that I could not induce a FMU shutdown with hard acceleration, braking, and hard over steering in the Manual Mode may point to the fact that it is operation in the Auto mode that is inducing the FMU shutdown?

Regards,
Tom C ArduRover2 Developer

On Monday, April 21, 2014 5:23:00 AM UTC-4, Andrew Tridgell wrote:

john...@gmail.com

unread,
Apr 22, 2014, 3:44:23 PM4/22/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Try bogging down the motor and servos (prevent them from moving) while pulling the stick back and forth. That will draw the highest current possible from motor and servo.

- JAB

Tom Coyle

unread,
Apr 22, 2014, 4:20:54 PM4/22/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

Took the Pixhawk equipped rover out to the test track this afternoon, but could not get into the condition where the FMU would turn off.

However the rover was doing loop de loops through the chicanes and eventually would not navigate from the starting point to the first waypoint.

It would either go to the left or right of the intended course.

I tried reloading the waypoints, but no joy.

Finally something happened to the GPS and the MP was showing its location about 100 ft from where it was actually sitting.

Rebooting cured the GPS location issue and I got in one more run, but the rover was still doing the loop de loops through the chicanes.

Finally shutdown and brought the rover home and downloaded the data flash logs.

I found that the compass is showing the correct heading on the MP, but the GPS track keeps moving round like it is the hands of a clock even though I have 9 sats and an HDOP of <2 .

So I loaded on an earlier build of the v2.45 firmware and still got random movement of the GPS track around the car icon.

I reloaded the latest build from Sunday and am still seeing the same erratic GPS track movement. The compass appears to be okay.

I suspect that the GPS may be going south unless there is a hardware issue with the Pixhawk.

As a reference my APM2.5 equipped rover has a stationary GPS track and good compass tracking.

tlog and data flash logs are in the Google drive: Rover Crash Bug log data


On Monday, April 21, 2014 5:23:00 AM UTC-4, Andrew Tridgell wrote:

Andrew Tridgell

unread,
Apr 26, 2014, 5:06:25 PM4/26/14
to thomasc...@comcast.net, drones-...@googlegroups.com
Hi Tom and Mike,

A bit of an update on the rover FMU firmware crash bug.

I have been successfully reproducing the bug for a couple of days now,
using a cable made from a piece of the cable that Mike gave me. It is a
1.3 meter 9 core cable:

http://www.jaycar.com.au/productView.asp?ID=WB1578

All 9 cores are connected as follows:

one core: common ground between I2C and UART
3 cores: remaining connectors for DF13-4 I2C
5 cores: remaining connectors for DF13-6 UART

The bug triggers on average about every 7 minutes or so, but varies
quite lot in timing. I have reproduced it on a Pixhawk 2.1 board with
JTAG connectors, and have caught the error using gdb with a black magic
probe.

Despite all this I still haven't found the bug! It is one of the
weirdest bugs I've ever dealt with.

When the bug triggers I get a few 'impossible' things. It causes memory
corruption at very repeatable addresses with very repeatable values. So
I can use asserts and hardware watch points to catch the
bug. Unfortunately when the debugger catches the error I then look at
the variables and registers and it looks like the assert should not have
triggered.

For example, this line of code triggers an assert:

https://github.com/tridge/PX4Firmware/blob/the-bug-from-hell-wip/src/drivers/hmc5883/hmc5883.cpp#L751

when that triggers I then look at in_trampoline3 and it3 and they are
both 8945. This is despite it being in a region of code with interrupts
disabled, and surrounded by other code that should ensure it can never
happen. Note that this file has optimisation disabled.

Lorenz and I will keep poking at it, and I'm sure we'll find the issue,
but for now the bug is winning :-)

Cheers, Tridge

john...@gmail.com

unread,
Apr 26, 2014, 5:47:46 PM4/26/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Ahh. It so much more fun dealing with potential hardware and software bugs at the same time, instead of just software.. :)

Tom Coyle

unread,
Apr 26, 2014, 6:09:19 PM4/26/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge,

Nice detective work, but sounds frustrating that you still cannot pinpoint the cause.

Since the ArduPlane and ArduCopter users have not experienced similar symptoms, does this mean that it is isolated to just the rover code?

The laborious way to isolate the bug is to fall back to pre active braking code and load one new build at a time until the bug appears.

This may be your fallback option if you cannot isolate the bug in the present version of the code.

Is there any connection between the APM runaway issue and this code even though the hardware and code is different?

Regards,
Tom C ArduRover2 Developer

MikeRover

unread,
Apr 27, 2014, 1:30:22 AM4/27/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Tridge et al,
  Glad to hear you've been able to reproduce it and it wasn't just dodgy soldering! I've run for about an hour or so with a new cable that hasn't produced the bug.  Heading out to the field tomorrow with that cable, so we'll see how it goes.
 
  The lesson at the moment seems to be not to run UART and I2C wiring together over an excessively long run.
 
  To assist your debugging, is there a specific build you'd like me to run?  I have your debug build loaded at the moment.
 
  I can also do an update halfway through the day, after I've exercised the current build with the new cable.
 
  Regarding the new IO failsafe stuff, does that use the RCx_TRIM values as the failsafe values? 
 
  Mike.

Andrew Tridgell

unread,
Apr 27, 2014, 1:33:46 AM4/27/14
to Tom Coyle, drones-...@googlegroups.com
Hi Tom,

> Since the ArduPlane and ArduCopter users have not experienced similar
> symptoms, does this mean that it is isolated to just the rover code?

I suspect it is not Rover specific, but I can't prove that yet. I'll
keep working on it this evening to see if I can pin it down.

> Is there any connection between the APM runaway issue and this code even
> though the hardware and code is different?

My current suspicion is that they are separate bugs, but until I have
one of the bugs really pinned down I can't be certain.

I'm looking forward to us analysing your logs tomorrow morning in our
weekly hangout to see what we can find.

Cheers, Tridge

Roberto Navoni

unread,
Apr 27, 2014, 3:38:04 AM4/27/14
to drones-discuss
Mhhh ,
not good news ... :(
Ask to Luca to check if we have same bug on VR Brain port of nuttx on
APM Rover for check if our revision of OS is immune or not. So if we
found somethings try to help you ....
best
Roberto

Andrew Tridgell

unread,
Apr 27, 2014, 7:52:34 AM4/27/14
to Tom Coyle, drones-...@googlegroups.com
> I suspect it is not Rover specific, but I can't prove that yet. I'll
> keep working on it this evening to see if I can pin it down.

I've now proven the bug is not Rover specific, and in fact isn't in any
of the ardupilot code. It is in the core PX4 code somewhere.

I've reproduced the bug without starting ardupilot, starting just the
uorb, ms5611, hmc5883 and either l3gd20 or gps drivers (either of those
is sufficient).

Cheers, Tridge

Andrew Tridgell

unread,
Apr 27, 2014, 7:53:17 AM4/27/14
to MikeRover, drones-...@googlegroups.com, thomasc...@comcast.net
> To assist your debugging, is there a specific build you'd like me to
> run? I have your debug build loaded at the moment.

just current git master.

> Regarding the new IO failsafe stuff, does that use the RCx_TRIM values as
> the failsafe values?

yep!

Tom Coyle

unread,
Apr 27, 2014, 3:44:11 PM4/27/14
to drones-...@googlegroups.com, thomasc...@comcast.net, and...@tridgell.net
Hi Tridge

It was ugly today.

PX4 firmware: v2.45(20de5b30)  f05f42cc NuttX  ed45e813

The rover did the loop de loops in the chicanes and failed to complete the course after the first left hand turn.

After the first left hand turn, the rover would not navigate correctly to the next waypoint and complete the second left hand turn.

After making the first left hand turn, the rover would navigate diagonally into the curb on the right side of the roadway.

After each curb impact, the rover would not navigate correctly from the starting point and would veer off to the left side of the path to the first waypoint.

The only way to recover was to reboot the Pixhawk.

I tried both DCM and EKF.

In one instance with EKF enabled the rover went straight through the chicane, made the first left hand turn and then navigated diagonally away from the next waypoint and hit the right side roadway curb.

tlogs and data flash logs at Google Drive: Pixhawk test data


Regards,
Tom C


On Monday, April 21, 2014 5:23:00 AM UTC-4, Andrew Tridgell wrote:

Andrew Tridgell

unread,
Apr 27, 2014, 9:55:11 PM4/27/14
to Tom Coyle, drones-...@googlegroups.com, lom...@inf.ethz.ch
Hi All,

I believe I have now found and fixed this bug. The fix is here:

https://github.com/diydrones/PX4NuttX/commit/65cd7f85f31ac895f142771f1bb0b27a1a69832b

The bug was that the interrupt service routine for the I2C bus could
write to transfer buffers from a previous transfer after that transfer
had completed. So the sequence of events was:

1) HMC5883::collect() does a I2C read transfer to a stack buffer, this
setup priv->ptr and priv->dcnt to point to an area on the stack of the
hpwork task

2) HMC5883::measure() gets called to setup the HMC5883 for the next
reading. It sets up priv->msgv, but left priv->ptr and priv->dcnt
at the values from the previous transfer

3) while in stm32_i2c_process() for the write_reg() in
HMC5883::measure() we get an unexpected interrupt from the I2C bus
before the start bit has been seen. This means priv->dcnt and
priv->ptr have not yet been setup for the new
transfer. Specifically, we get a I2C status which includes the
I2C_SR1_RXNE bit, which is for receiving a byte (remember that we
are in a send, not a receive). The code sees this status bit and
does this:

*priv->ptr++ = stm32_i2c_getreg(priv, STM32_I2C_DR_OFFSET);

thst overwrites the previously setup stack area from collect(),
which is now a piece of stack used by another function.

4) that overwrite happens to be in the area of memory that holds the
dq_queue_t that is used to control the queueing of tasks to HPWORK

5) when the dq_rem() function is next called on the HPWORK queue, it
then uses that now corrupt queue structure, which causes an
overwrite of a different area of memory, which happens now to be in
the heap nodelist. It wipes out the high byte of the flink in a
nodelist element

6) the next malloc call that is of the right size to walk this part of
the heap (usually from starting a new task, such as running "perf")
then dereferences the invalid flink, and the processor faults. The
FMU firmware is then dead.

Lorenz, please check my logic and the patch.

As far as the practical impact goes, this bug affects all PX4 builds
since we started using the PX4 code. So it affects both PX4v1 and
Pixhawk, and affects all vehicle types and the PX4 native firmware as
well. Once Lorenz and the rest of the PX4 dev team have had a chance to
check over the fix I'd suggest we need to put out patch releases for all
vehicle types.

Many thanks to Mike and Tom for bringing this bug to our attention!

Cheers, Tridge

Kevin Hester

unread,
Apr 27, 2014, 10:13:13 PM4/27/14
to drones-discuss, Tom Coyle, Lorenz Meier
Thanks Tridge!  (For others - he needed to spend a number of days of his valuable time on this super nasty bug)




Philip Rowse

unread,
Apr 27, 2014, 10:15:05 PM4/27/14
to drones-...@googlegroups.com, Tom Coyle, Lorenz Meier
Great job Tridge :)

Philip Rowse
Electronics Engineering Dept
3DRobotics
Ballarat
Australia

Ben Nizette

unread,
Apr 27, 2014, 10:58:27 PM4/27/14
to drones-...@googlegroups.com
Nice one, I've been following this on Skype, been a pleasure to watch (also a pleasure not to have to deal with myself!), I learned some neat GDB tricks for next time I'm hit with something similar, cheers! Reminds me of an IMU bug I had back in the day which was related to the data values; i.e. the bug triggered if and only if the device was in a specific pose on my desk.

Makes me think though, I wonder how wide-spread this programming pattern is within NuttX/PX4Firmware/etc.  This is an interesting and non-obvious case of exposing a pointer to a stack variable outside the scope of that function - in this case, an ISR.  Lines like

ret = transfer(&cmd, 1, (uint8_t *)&hmc_report, sizeof(hmc_report));

Where hmc_report is on the stack /look/ perfectly innocuous but if transfer uses an ISR internally (it does) and can't guarantee that the ISR is synchronous with the function call (it can't) then we get problems as we all now know!  With your fix, the pointer to the stack is scrubbed when transfer returns so even if the ISR is not synchronous with the function call, at least the visibility of the pointer is.

Having said that, I guess the fact that it's a stack variable isn't strictly relevant to correctness here though, come to think of it.  If it were a heap variable it's likely to have been freed before the ISR tries to write, if it's static then the ISR is still going to race with a consumer, it's all about limiting the action of the function to between that function's call and return.

Once again, seems like pretty much any of the other bus drivers that use ISRs or DMA (i.e. any of the other bus drivers) would be worth a quick audit (along with a fair chunk of code I've written myself over the years!).

Anyway, great job, thanks!
  --Ben.

Tom Coyle

unread,
Apr 27, 2014, 11:18:51 PM4/27/14
to drones-...@googlegroups.com, Tom Coyle, lom...@inf.ethz.ch, and...@tridgell.net
Hi Tridge,

Good work. Nice that you have put this one to bed!

Regards,
Tom C ArduRover2 Developer

Mark Colwell

unread,
Apr 28, 2014, 7:18:42 AM4/28/14
to drones-...@googlegroups.com
Andrew, Thank you for finding this deep bug, wish I could have helped more.


--

Robert Lefebvre

unread,
Apr 28, 2014, 1:05:20 PM4/28/14
to drones-discuss
Wow Tridge, great work finding this!

Just curious, I don't see any connection between this, and the cable construction methods talked about on Saturday/Sunday. (ie: long, I2C and UART together).  This is the same issue right?  So does it not only affect those with long cables?  It's just more likely with long cables, and then why?

Or is this a completely separate issue?


Andrew Tridgell

unread,
Apr 28, 2014, 10:24:01 PM4/28/14
to Ben Nizette, drones-...@googlegroups.com
Hi Ben,

> Once again, seems like pretty much any of the other bus drivers that use
> ISRs or DMA (i.e. any of the other bus drivers) would be worth a quick
> audit (along with a fair chunk of code I've written myself over the years!).

I looked over the stm32_spi.c code and I couldn't see anything
equivalent. I haven't examined the UART code for this type of bug.

Cheers, Tridge
Reply all
Reply to author
Forward
0 new messages