Possible link between EKF and thermal overload -> Pixhawk crash

507 views
Skip to first unread message

mroberts

unread,
Apr 17, 2014, 1:23:35 AM4/17/14
to drones-...@googlegroups.com
Hi all,
  This is currently just a theory, but it has potentially serious implications for Plane and Copter.
 
  From what Tom and I have seen in Rover - https://groups.google.com/forum/#!topic/drones-discuss/fhhgJ8z7zL4 - the EKF may be working the Pixhawk hard enough to overheat it and shut it down.
 
  I've had numerous Pixhawk shutdowns (to the point where it would last less than five minutes between resets) and in each case, the temperature (dataflash BARO->Temp) was getting up to 45-48 degrees, then the log would end.
 
  When this happens, I lose telemetry and RC control, and worst of all, the servos hold their last position rather than going into HOLD or RTL.
 
  If anyone has some dataflash logs from EKF testing on Plane or Copter, it could be worth checking the temperature plot. Contributing factors for me are running two large, heavily loaded servos through the IO side of the Pixhawk, two GPS, and running it inside a box in quite warm weather. I'd say it's unlikely on copter since no-one runs power through the IO side, and they're usually well ventilated.  Plane could be a serious problem.
 
  If I'm wrong, then I still have something to debug and everyone else's planes are safe.  I wanted to post here rather than on Github or DIYDrones to enable some filtering and verification by developers without panicking anyone. 
 
  Mike.

Jonathan Challinger

unread,
Apr 17, 2014, 1:51:06 AM4/17/14
to drones-...@googlegroups.com
Wow, temperatures do get high in the case. Over a few hours of flying, my baro temp started at something like 22C and made it up to a peak of 55C!

I don't like to see that. Never got a shutdown, but temp variation that big is bad. The sensors want a nice consistent temperature.

I'm going to de-case the pixhawk in my plane. I can't replace this airframe, hobbyking hasn't got any more.


--
You received this message because you are subscribed to the Google Groups "drones-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to drones-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonathan Challinger

unread,
Apr 17, 2014, 2:17:06 AM4/17/14
to drones-...@googlegroups.com
[11:04:27 PM] Jonathan Challinger: have you toasted a pixhawk with a benchmark in its case, in a 50C+ environment?
[11:04:51 PM] Jonathan Challinger: because i think that's what it needs to withstand.
[11:05:28 PM] Jonathan Challinger: i get temps of 112F here (45C) and i don't want to watch any of my aircraft tumble out of the sky
[11:05:32 PM] Craig Elder: Yes

Looks like you guys should look at other causes. I had much higher temps than you without trouble.

mroberts

unread,
Apr 17, 2014, 2:25:50 AM4/17/14
to drones-...@googlegroups.com
Bugger. Back to the drawing board.
 

mroberts

unread,
Apr 17, 2014, 2:27:30 AM4/17/14
to drones-...@googlegroups.com

Andrew Tridgell

unread,
Apr 17, 2014, 2:28:00 AM4/17/14
to mroberts, drones-...@googlegroups.com
Hi Mike,

I just wanted to correct a misconception. Enabling EKF does not change
the CPU load on a Pixhawk. On all releases since we added the EKF it is
running all the time. The DCM code is also running all the time. The
only difference is which of the two sets of outputs is actually used.

I'm not saying that there is no thermal problem (I can't prove that one
way or the other yet). But I can tell you that enabling the EKF won't
make a significant difference in the thermal behaviour of the CPU.

I've started looking through the logs files from you and Tom. I don't
have a lot of time right now, but I'll have a look as I can.

Here are some things I noticed in your "lost 10.BIN" file:

- the servo rail voltage in averaging about 5.9V, and is fluctuating
above and below 5.8V. That is interesting, as 5.8V is the critical
transition voltage for the Pixhawk using the IO servo rail as a
backup power supply for the FMU board. As long as the main 3DR brick
power remains solid that shouldn't matter, but I thought I'd note it
as it is unusual, and when looking at bugs like this anything unusual
is worth noting. (see the POWR.Vservo graph)

- the barometer temperature is indeed rising, but not to particularly
high levels. I've certainly seen it much higher without any
issues. It rises to about 45C in your log, which should not be an
issue for any of the components

There is nothing else really notable in the log that could explain a
bug of this type. If you or Tom can find a way to reproduce the problem
reliably then what we would need is you to connect a debug console cable
then to run the following commands and capture the output:

ps
perf

If NuttX is still running then those two commands will output quite a
bit of output on the console. If you can reproduce the crash and get
that output than it would enable us to eliminate a long list of possible
causes.

I'll keep looking at the log files from Tom as I get time.

Cheers, Tridge

mroberts

unread,
Apr 17, 2014, 9:55:08 AM4/17/14
to drones-...@googlegroups.com
Tridge,
  Is there any way to run both a second GPS and the debug output easily, or should I just plan to yank the GPS cable from serial 4/5 and replace it with the debug cable when things go pear shaped?
 
 Hopefully I'll ge tthe cable built on Tuesday, and give the Rover another run on Wednesday.  If it's anything like yesterday, I should be able to repeat the problem every five minutes!!
 
  Would you recommend going to the latest build, or sticking with the 08 April?  If the blue ACT light on the IO side is still flashing (with the red IO light flashing twice as fast), and the big light green, does that mean the IO firmware side is still alive?  

Randy Mackay

unread,
Apr 17, 2014, 10:11:43 AM4/17/14
to drones-...@googlegroups.com

 

    I can answer one of those questions which is the Serial 4/5 cable.   You should be able to make Y cable.  A cable with one 6pin DF13 end that fits into the Pixhawk but then splits into two cables, one that goes to the GPS and one that goes to your FTDI cable or whatever you’re using to connect the debug output to the computer.

 

     We have a wiki page which shows how to build the debug cable but I believe it’s now incorrect.  The orange and yellow wires need to be moved along two places.  The power, gnd and the slots where the orange and yellow wires are currently connect in that diagram would instead need to be connected to the GPS.

          http://dev.ardupilot.com/wiki/interfacing-with-pixhawk-using-the-nsh/

 

-Randy

--

Craig Elder

unread,
Apr 18, 2014, 1:07:34 PM4/18/14
to drones-discuss
Alex has changed that image and has moved the wires to the correct positions in the connector

Craig Elder

unread,
Apr 18, 2014, 1:25:10 PM4/18/14
to drones-discuss
Temperature rise above ambient measured in 3 places on the board I/O processor , 3.3V regulator, and FMU processor.  Test was conducted at 25C and shows about 20C rise above ambient


Inline image 1

Robert Lefebvre

unread,
Apr 18, 2014, 4:19:23 PM4/18/14
to drones-discuss
That's in the case with no airflow?  If so, it seems pretty good.

Could the issue be more to do with the sun beating down on a flat black case?

Andrew Tridgell

unread,
Apr 18, 2014, 5:01:17 PM4/18/14
to mroberts, drones-...@googlegroups.com
Hi Mike,

> Would you recommend going to the latest build, or sticking with the 08
> April?

go with whichever version produces the problem!

> If the blue ACT light on the IO side is still flashing (with the
> red IO light flashing twice as fast), and the big light green, does that
> mean the IO firmware side is still alive?

If the blue light is flashing then IO is running.

btw, another approach to finding the issue is a bisection search in the
firmwares.

If you look here:

http://firmware.diydrones.com/Rover/

you'll see every build we've done for Rover. With log(N) tries you could
narrow down exactly which change caused the problem to appear.

I'd also be interested to know if disabling a particular features stops
the problem happening. For example, does disabling the braking code
prevent the issue?

Cheers, Tridge

Lorenz Meier

unread,
Apr 18, 2014, 5:29:38 PM4/18/14
to drones-...@googlegroups.com
Hi,

Just to make sure this is well understood in this thread: The temperature of 55 degrees reported is perfectly within the normal operating conditions. In fact the whole board should be fine up to 85 degrees and the main processor certainly is.

Could you please confirm you are using the power brick? The servo rail is not a trusted primary power source (depending on the servos and BEC used on it it can be, but we find that many setups have extremely noisy servo rails) and the brick needs to be in place and powered - its what has been designed and tested for the autopilot.

The thermal load inside the autopilot is not changed significantly by more servos. The traces connecting the back rail are wide and on several layers and offer very low resistance (= very low thermal losses).

Cheers,
Lorenz


Am 18.04.2014 um 22:19 schrieb Robert Lefebvre <robert....@gmail.com>:

That's in the case with no airflow?  If so, it seems pretty good.

Could the issue be more to do with the sun beating down on a flat black case?
On 18 April 2014 13:25, Craig Elder <cr...@3drobotics.com> wrote:
Temperature rise above ambient measured in 3 places on the board I/O processor , 3.3V regulator, and FMU processor.  Test was conducted at 25C and shows about 20C rise above ambient


<Plot.png>

Tom Coyle

unread,
Apr 18, 2014, 6:10:38 PM4/18/14
to drones-...@googlegroups.com, mroberts, and...@tridgell.net
Hi Tridge,

I experienced the runaway issue with the build that you completed on Sunday April 6th during the ArduRover2 Developers Code Hacking Hangout.

Regards,
Tom C

mroberts

unread,
Apr 18, 2014, 8:31:20 PM4/18/14
to drones-...@googlegroups.com
@lorenz, running the standard power module.

@chris, I must have got the wrong end of the stick somewhere. At one point I thought I saw 50C as the limit for the Pixhawk.

@Tridge, GitHub shows the last Rover specific edits seemingly about 10 days ago, yet there are more than daily updates on the firmware site - is Rover rebuilt every time there's a commit in another platform even if there are no changes? Thus I only need to look at builds between the last rover edit and theJan release?

Tom Coyle

unread,
Apr 18, 2014, 8:57:13 PM4/18/14
to drones-...@googlegroups.com
Hi Mike,

The first appearance of the braking percentage parameter should be around April 6th. That was the first attempt to incorporate the active braking function.

Regards,
Tom C Ardurover2 Developer

Andrew Tridgell

unread,
Apr 18, 2014, 9:07:22 PM4/18/14
to mroberts, drones-...@googlegroups.com
> @Tridge, GitHub shows the last Rover specific edits seemingly about 10
> days ago, yet there are more than daily updates on the firmware site -
> is Rover rebuilt every time there's a commit in another platform even
> if there are no changes?

yes

> Thus I only need to look at builds between the last rover edit and
> theJan release?

if there is a library change (or PX4Firmare or PX4NuttX change) then
that can affect rover too, depending on the change. That is why it
rebuilds. It could be made smarter, avoiding some builds, but right now
it rebuilds every time.

Cheers, Tridge

Mike Ellery

unread,
Apr 19, 2014, 11:46:52 AM4/19/14
to drones-...@googlegroups.com, mroberts, and...@tridgell.net
Do you know what rev of PX4Firmware was used when building the rover 2.45 release firmware?

Thanks

Craig Elder

unread,
Apr 20, 2014, 11:16:22 AM4/20/14
to drones-discuss, Andrew Tridgell, mroberts

If you look in the first part of a dataflash log you can see the builds used to create the version of code including the px4 and nuttx versions.

--

mroberts

unread,
Apr 22, 2014, 7:50:53 AM4/22/14
to drones-...@googlegroups.com
First of all, is there a mod who can change the title to something more correct and less scaremongery (since I was wrong about the thermal thing)?
 
Second, I have achieved debug output.  The curious thing is that the debug connection seems to crash the Pixhawk in exactly the same way as I was seeing before.
 
I had it powered via USB, with no servos attached (direct from a wall charger).  I connected to through the debug cable and almost straight away it crashed with the same symptoms:
  • Big LED: Green
  • FMU PWR: Green
  • IO PWR: Green
  • IO B/E: Flashing Red
  • IO ACT: Flashing Blue
I've attached a screenshot of the debug output.
 
A couple of interesting things:
Re-arming doesn't reset things, but the safety light toggles between flashing and solid.
When it reboots, it automatically reconnects with Mission Planner
 
I can crash it fairly reliably, so am open to suggestions of things to try.  This is all with April 08.
 
I still need to start working backwards on the builds.
Terminal Debug 2014-04-22_2115.png

Tom Coyle

unread,
Apr 22, 2014, 8:02:12 AM4/22/14
to drones-...@googlegroups.com
Hi Mike,

You might want to start positing in the rover crash bug thread to achieve more visibility with your troubleshooting.

Regards,
Tom C ArduRover2 Developer

Andrew Tridgell

unread,
Apr 22, 2014, 8:04:23 AM4/22/14
to Mike Ellery, drones-...@googlegroups.com, mroberts
Hi Mike,

> Do you know what rev of PX4Firmware was used when building the rover 2.45
> release firmware?

yep. It shows a banner on connection like this:

APM: ArduRover v2.45 (5b1ac474)
APM: PX4: 2699e15d NuttX: a6686464

the hex numbers are the git hashes of PX4Firmware and PX4NuttX

Cheers, Tridge

PS: This info is also in dataflash logs and tlogs

Andrew Tridgell

unread,
Apr 22, 2014, 8:23:49 AM4/22/14
to mroberts, drones-...@googlegroups.com
> Second, I have achieved debug output. The curious thing is that the debug
> connection seems to crash the Pixhawk in exactly the same way as I was
> seeing before.

Thats great Mike!

> I've attached a screenshot of the debug output.

good. Now to interpret the hex numbers in that crash I need to know the
precise firmware you are running. A tlog or dataflash log would tell me
that, but it is also important that I be able to get the exact binary
with the same compiler. Did you use firmware loaded via MP? If so,
precisely which version? If you compiled it yourself I'll need the .elf
file from the build.

> I can crash it fairly reliably, so am open to suggestions of things to
> try. This is all with April 08.
>
> I still need to start working backwards on the builds.

bisecting backwards in the builds may indeed uncover the issue.

If you can get me a log from the current build then I can see if I can
find the exact elf file, and try to get some call information from the
above trace.

Otherwise maybe I'll pay you a visit and see if we can debug it in
person :)

Cheers, Tridge
Message has been deleted

mroberts

unread,
Apr 22, 2014, 9:30:36 AM4/22/14
to drones-...@googlegroups.com, mroberts, and...@tridgell.net
Crash from latest build - 2014-04-22 02:04
 
Had to do it in two parts due to my screen res - a and b.
FW 2014-04-22-0204a.png
FW 2014-04-22-0204b.png

Andrew Tridgell

unread,
Apr 22, 2014, 5:11:03 PM4/22/14
to mroberts, drones-...@googlegroups.com
Hi Mike,

> Crash from latest build - 2014-04-22 02:04

When I check this trace against an elf image from that build I get:

(gdb) info line *0x0808300b
Line 77 of "mm_size2ndx.c" starts at address 0x808300a <mm_size2ndx+46>
and ends at 0x8083014 <umm_givesemaphore>.
(gdb) info line *0x08082bae
Line 183 of "mm_malloc.c" starts at address 0x8082bae <mm_malloc+90>
and ends at 0x8082bb8 <mm_malloc+100>.

I'm not at all confident that we are getting valid information here
though. Looking at that code I don't see how an oops could be produced
on those lines.

So this may be an indication that we have a memory corruption problem,
or it may be a red herring.

Are you available at all today to meet up and try to work through this
together? I can offer some fresh roasted coffee if you want to drop in :-)

btw, please don't try to fix the problem by changing anything on the
board apart from the firmware version. This bug may have some very
particular circumstances and we don't want to lose those.

For example, one area that I think is a possibility is that it is
related to the microSD card (possibly NuttX not handling some subtle FAT
corruption). If that is the case then reformatting the card may fix the
issue, but I wouldn't want you to do that as then we wouldn't be able to
pin down exactly what property of the card is causing it, so we wouldn't
be able to fix the bug.

If you can drop in then we'll either find/fix the problem, or if it
turns out to be board specific then I'll swap you for a Pixhawk that
doesn't show the problem.

The ideal would be to reproduce the issue on one of my boards that has a
JTAG connector, so we can catch the error in gdb. I've failed to
reproduce it so far though, which suggests there is something subtle in
the testing process or something in the board or environment that
triggers it.

Cheers, Tridge

mroberts

unread,
Apr 22, 2014, 8:33:00 PM4/22/14
to drones-...@googlegroups.com, mroberts, and...@tridgell.net
Still crashes with a different SD Card.  Doing the bare board test now.
Reply all
Reply to author
Forward
0 new messages