On Fri, 20 Feb 2026 at 02:06, 'Damien Le Moal' via syzkaller-bugs
<
syzkall...@googlegroups.com> wrote:
>
> On 2/20/26 09:55, Niklas Cassel wrote:
> > On Thu, Feb 19, 2026 at 10:33:22AM +0900, Damien Le Moal wrote:
> >>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> >>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> >>>
> >>> 4210818301 is 0xfafbfcfd
> >>>
> >>> 0xfafbfcfd is ATA_TAG_POISON.
> >>>
> >>> ATA_TAG_POISON is set by ata_qc_free(), so it appears that
> >>> ata_scsi_deferred_qc_work() is trying to issue a QC that has
> >>> already been freed.
> >>
> >> I checked the code but I fail to see any path that can lead to this happening.
> >> I did more tests using qemu q35 machine as used by syzbot, and everything looks
> >> fine. So not sure what is happening here. I will dig further.
> >
> > Hello Damien,
> >
> >
> > My best guess:
> > since qc->tag is ATA_TAG_POISON, ata_qc_free() must have been called
> > on ap->deferred_qc.
> >
> > If it was an NCQ abort, ata_eh_set_pending() would have been called to
> > clear ap->deferred_qc. Since ap->deferred_qc is apparently set, it
> > appears that we did not get an error IRQ.
> >
> > To me, that leaves a timeout as the most likely scenario.
>
> Good point. I think the timeout case was completely overlooked...
> That should be fairly easy to debug: I just need to add have the deferred work
> do nothing to see the deferred qc timeout.
>
> Let me hack something and come up with a fix.
>
> >
> > I.e. SCSI EH is called without ata_eh_set_pending() having been called.
> > (Currently ata_eh_set_pending() is the function that clears
> > ap->deferred_qc)
> >
> >
> >
> > If I look at ata_scsi_cmd_error_handler() it will only break if:
> >
> > if (qc->flags & ATA_QCFLAG_ACTIVE && qc->scsicmd == scmd)
> >
> > If the deferred QC times out, flag ATA_QCFLAG_ACTIVE will not be set
> > (because ATA_QCFLAG_ACTIVE is only set by qc_issue()).
> >
> > Since ATA_QCFLAG_ACTIVE is not set i == ATA_MAX_QUEUE, so we will enter the
> > else clause which calls:
> > scsi_eh_finish_cmd(scmd, &ap->eh_done_q);
> >
> >
> > That might potentially free the tag to the block layer to reuse,
> > while ap->deferred_qc is still set (with the same tag).
> >
> > Possibly, next time ata_scsi_qc_issue() is called, ap->deferred_qc is still set,
> > so it calls ata_qc_free(qc), which, since it wasn't cleared, might have the same
> > tag? because block layer has now reused the tag (since SCSI completed the
> > command).
> >
> > I would possibly have expected some kind of print from SCSI in this case.
> > (But since the else clause finishes the command normally, perhaps not?)
> >
> > But perhaps it is wise to add some code to ata_scsi_cmd_error_handler()
> > which clears ap->deferred_qc.
> >
> >
> >
> > Another possibility... again, timed out commands will not have called
> > ata_eh_set_pending(). scsi_timeout() will call scsi_abort_command()
> > which will queue delayed work, and the worker function scmd_eh_abort_handler()
> > will call scsi_eh_scmd_add(), which calls
> > scsi_host_set_state(shost, SHOST_RECOVERY).
> >
> > We did add a guard in libata in commit e20e81a24a4d ("ata: libata-core: do not
> > issue non-internal commands once EH is pending"), so that we will defer commands
> > even when EH is pending. But in the case of timeout, there will be no error IRQ,
> > so we will not do an early return in __ata_scsi_queuecmd(), so we could set
> > qc->deferred_qc up until the worker function scmd_eh_abort_handler() has called
> > scsi_host_set_state(shost, SHOST_RECOVERY).
> >
> > Again, adding some code to ata_scsi_cmd_error_handler() to clear ap->deferred_qc
> > should handle this case.
> >
> >
> > I would probably hack some QEMU to not send a reply, so that we will get block
> > layer timeouts, because right now, ata_scsi_cmd_error_handler() seems like the
> > most likely problematic code to me.
Hi,
Some info I can infer from these 4 crashes.
There is some kind of race, or very rare timing is likely to be
involved. Only 4 crashes is not much. Usually the fuzzer triggers them
more often.
The crash happens in kworker, this makes it impossible to infer when
test programs may be involved.
In all 4 cases there is a preceding USB disconnect message:
[ 644.391966][ T5992] usb 11-1: USB disconnect, device number 24
It may be related. These devices can be connected via USB, right?
Unfortunately, I cannot infer much more.
These USB device numbers may theoretically allow to infer the test
program, but I think it's currently not possible.
It may be possible to reply these logs for longer to see if they
trigger the crash.