Hi All,
I believe I have now found and fixed this bug. The fix is here:
https://github.com/diydrones/PX4NuttX/commit/65cd7f85f31ac895f142771f1bb0b27a1a69832b
The bug was that the interrupt service routine for the I2C bus could
write to transfer buffers from a previous transfer after that transfer
had completed. So the sequence of events was:
1) HMC5883::collect() does a I2C read transfer to a stack buffer, this
setup priv->ptr and priv->dcnt to point to an area on the stack of the
hpwork task
2) HMC5883::measure() gets called to setup the HMC5883 for the next
reading. It sets up priv->msgv, but left priv->ptr and priv->dcnt
at the values from the previous transfer
3) while in stm32_i2c_process() for the write_reg() in
HMC5883::measure() we get an unexpected interrupt from the I2C bus
before the start bit has been seen. This means priv->dcnt and
priv->ptr have not yet been setup for the new
transfer. Specifically, we get a I2C status which includes the
I2C_SR1_RXNE bit, which is for receiving a byte (remember that we
are in a send, not a receive). The code sees this status bit and
does this:
*priv->ptr++ = stm32_i2c_getreg(priv, STM32_I2C_DR_OFFSET);
thst overwrites the previously setup stack area from collect(),
which is now a piece of stack used by another function.
4) that overwrite happens to be in the area of memory that holds the
dq_queue_t that is used to control the queueing of tasks to HPWORK
5) when the dq_rem() function is next called on the HPWORK queue, it
then uses that now corrupt queue structure, which causes an
overwrite of a different area of memory, which happens now to be in
the heap nodelist. It wipes out the high byte of the flink in a
nodelist element
6) the next malloc call that is of the right size to walk this part of
the heap (usually from starting a new task, such as running "perf")
then dereferences the invalid flink, and the processor faults. The
FMU firmware is then dead.
Lorenz, please check my logic and the patch.
As far as the practical impact goes, this bug affects all PX4 builds
since we started using the PX4 code. So it affects both PX4v1 and
Pixhawk, and affects all vehicle types and the PX4 native firmware as
well. Once Lorenz and the rest of the PX4 dev team have had a chance to
check over the fix I'd suggest we need to put out patch releases for all
vehicle types.
Many thanks to Mike and Tom for bringing this bug to our attention!
Cheers, Tridge