We get similar reports from Kasan since at least January 2020.
Sometimes about reads, sometimes about writes. Originally in vmalloc'd
area, then on the stack, then in vmalloc's aread again, and now just
"out-of-bound" without the memory type. What all these reports have in
common is that the problem happens in i801_isr_byte_done(). The exact
line differs though. Today's error is on this line:
} else if (priv->count < priv->len - 1) {
/* Write next byte, except for IRQ after last byte */
outb_p(priv->data[++priv->count], SMBBLKDAT(priv)); <<--- HERE
}
while the previous report was on that line:
/* Read next byte */
if (priv->count < priv->len)
priv->data[priv->count++] = inb(SMBBLKDAT(priv)); <<--- HERE
else
dev_dbg(&priv->pci_dev->dev,
"Discarding extra byte on block read\n");
All reports before that were on either location.
All reports also have in common that a previous transaction before the
bug triggers has failed with a timeout:
[ 376.721552][T12286] i801_smbus 0000:00:1f.3: Timeout waiting for interrupt!
[ 376.743857][T12286] i801_smbus 0000:00:1f.3: Transaction timeout
[ 376.766812][T12286] i801_smbus 0000:00:1f.3: Failed terminating the transaction
I think the restoration of the SMBus controller to a working state
after a transaction failed like that never actually worked, typically
the controller would stop responding until next reboot. Most probably
because the problem that caused the timeout (like a broken I2C device
holding either SDA or SCL permanently up or down) is still present. But
it should definitely not cause a memory overrun.
Looking at the ICH9 datasheet, I see the following description for the
KILL bit (which is what we try to use to reset the SMBus controller):
"Kills the current host transaction taking place, sets the FAILED
status bit, and asserts the interrupt (or SMI#)."
At the time the recovery code was written, i2c-i801 was a polling-only
driver, interrupts were not supported, so asserting the interrupt had
no effect. Now that the driver does support interrupts, this would call
i801_isr(), right?
So my theory is that our attempt to kill a timed-out byte-by-byte block
transaction triggers an interrupt, which calls in i801_isr() with the
SMBHSTSTS_BYTE_DONE bit set. This in turn causes i801_isr_byte_done()
to be called while we are absolutely not ready nor even supposed to
process the next data byte.
I guess we should clear SMBHSTSTS_BYTE_DONE before issuing a
SMBHSTCNT_KILL. Alternatively we could add a check at the beginning of
i801_isr() to bail out immediately if SMBHSTCNT_KILL is set. While
possibly more robust, this approach has the drawback of increasing the
processing time of all interrupts, even standard/legitimate ones. So
maybe just clearing SMBHSTSTS_BYTE_DONE is more reasonable. Something
like:
--- linux-5.11.orig/drivers/i2c/busses/i2c-i801.c
+++ linux-5.11/drivers/i2c/busses/i2c-i801.c
@@ -393,6 +393,8 @@ static int i801_check_post(struct i801_p
dev_err(&priv->pci_dev->dev, "Transaction timeout\n");
/* try to stop the current command */
dev_dbg(&priv->pci_dev->dev, "Terminating the current operation\n");
+ /* Clear BYTE_DONE so as to not confuse i801_isr() */
+ outb_p(SMBHSTSTS_BYTE_DONE, SMBHSTSTS(priv));
outb_p(inb_p(SMBHSTCNT(priv)) | SMBHSTCNT_KILL,
SMBHSTCNT(priv));
usleep_range(1000, 2000);
I must say I wonder why SMBHSTCNT_KILL generates an interrupt in the
first place, I can't see who would need this.
Jarkko, Andy, Mika, what do you think? I would appreciate a second pair
of eyes looking at the issue, in case I'm on the wrong track.
Of course, another question is why we hit the timeout in the first
place. But we would need a reproducer to investigate this. Is there any
place where I could see the actual I2C commands being sent by the bot?
The log only traces the bot-level functions, and I could guess how some
of the parameters are being mapped to struct members that are then fed
to i2c-dev, but some of the parameters are still a mystery to me.
To be honest, I find the bot logs pretty hard to read, as apparently a
lot of tests are running in parallel for various subsystems, and
everything ends mixed up in the log. Wouldn't it be more valuable to
serialize the tests on one given virtual host, so that the developers
can then read the log and more easily see what's going on?
Thanks,
--
Jean Delvare
SUSE L3 Support