Re: [syzbot] [ide?] UBSAN: shift-out-of-bounds in ata_qc_issue

1 view
Skip to first unread message

Niklas Cassel

unread,
Feb 20, 2026, 2:35:08 AM (3 days ago) Feb 20
to syzbot, syzk...@googlegroups.com, dle...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello syzkaller folks,

We syzkaller seems to have found a bug that it can reproduce very easily.

Looking at the dashboard for this bug:
https://syzkaller.appspot.com/bug?extid=1f77b8ca15336fff21ff

It has so far been reproduced 4 times in 3 days.

However, there is no reproducer yet.

Any advice on how we can try to trigger this without an exact reproducer
available yet?


Kind regards,
Niklas


On Tue, Feb 17, 2026 at 12:55:35PM -0800, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: ca4ee40bf13d Partly revert "drm/hyperv: Remove reference t..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=13c6c722580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=a771bfd268751cd6
> dashboard link: https://syzkaller.appspot.com/bug?extid=1f77b8ca15336fff21ff
> compiler: gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
>
> Unfortunately, I don't have any reproducer for this issue yet.

Dmitry Vyukov

unread,
Feb 20, 2026, 4:17:22 AM (3 days ago) Feb 20
to Damien Le Moal, Niklas Cassel, syzbot, linu...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, syzkaller
On Fri, 20 Feb 2026 at 02:06, 'Damien Le Moal' via syzkaller-bugs
<syzkall...@googlegroups.com> wrote:
>
> On 2/20/26 09:55, Niklas Cassel wrote:
> > On Thu, Feb 19, 2026 at 10:33:22AM +0900, Damien Le Moal wrote:
> >>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> >>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> >>>
> >>> 4210818301 is 0xfafbfcfd
> >>>
> >>> 0xfafbfcfd is ATA_TAG_POISON.
> >>>
> >>> ATA_TAG_POISON is set by ata_qc_free(), so it appears that
> >>> ata_scsi_deferred_qc_work() is trying to issue a QC that has
> >>> already been freed.
> >>
> >> I checked the code but I fail to see any path that can lead to this happening.
> >> I did more tests using qemu q35 machine as used by syzbot, and everything looks
> >> fine. So not sure what is happening here. I will dig further.
> >
> > Hello Damien,
> >
> >
> > My best guess:
> > since qc->tag is ATA_TAG_POISON, ata_qc_free() must have been called
> > on ap->deferred_qc.
> >
> > If it was an NCQ abort, ata_eh_set_pending() would have been called to
> > clear ap->deferred_qc. Since ap->deferred_qc is apparently set, it
> > appears that we did not get an error IRQ.
> >
> > To me, that leaves a timeout as the most likely scenario.
>
> Good point. I think the timeout case was completely overlooked...
> That should be fairly easy to debug: I just need to add have the deferred work
> do nothing to see the deferred qc timeout.
>
> Let me hack something and come up with a fix.
>
> >
> > I.e. SCSI EH is called without ata_eh_set_pending() having been called.
> > (Currently ata_eh_set_pending() is the function that clears
> > ap->deferred_qc)
> >
> >
> >
> > If I look at ata_scsi_cmd_error_handler() it will only break if:
> >
> > if (qc->flags & ATA_QCFLAG_ACTIVE && qc->scsicmd == scmd)
> >
> > If the deferred QC times out, flag ATA_QCFLAG_ACTIVE will not be set
> > (because ATA_QCFLAG_ACTIVE is only set by qc_issue()).
> >
> > Since ATA_QCFLAG_ACTIVE is not set i == ATA_MAX_QUEUE, so we will enter the
> > else clause which calls:
> > scsi_eh_finish_cmd(scmd, &ap->eh_done_q);
> >
> >
> > That might potentially free the tag to the block layer to reuse,
> > while ap->deferred_qc is still set (with the same tag).
> >
> > Possibly, next time ata_scsi_qc_issue() is called, ap->deferred_qc is still set,
> > so it calls ata_qc_free(qc), which, since it wasn't cleared, might have the same
> > tag? because block layer has now reused the tag (since SCSI completed the
> > command).
> >
> > I would possibly have expected some kind of print from SCSI in this case.
> > (But since the else clause finishes the command normally, perhaps not?)
> >
> > But perhaps it is wise to add some code to ata_scsi_cmd_error_handler()
> > which clears ap->deferred_qc.
> >
> >
> >
> > Another possibility... again, timed out commands will not have called
> > ata_eh_set_pending(). scsi_timeout() will call scsi_abort_command()
> > which will queue delayed work, and the worker function scmd_eh_abort_handler()
> > will call scsi_eh_scmd_add(), which calls
> > scsi_host_set_state(shost, SHOST_RECOVERY).
> >
> > We did add a guard in libata in commit e20e81a24a4d ("ata: libata-core: do not
> > issue non-internal commands once EH is pending"), so that we will defer commands
> > even when EH is pending. But in the case of timeout, there will be no error IRQ,
> > so we will not do an early return in __ata_scsi_queuecmd(), so we could set
> > qc->deferred_qc up until the worker function scmd_eh_abort_handler() has called
> > scsi_host_set_state(shost, SHOST_RECOVERY).
> >
> > Again, adding some code to ata_scsi_cmd_error_handler() to clear ap->deferred_qc
> > should handle this case.
> >
> >
> > I would probably hack some QEMU to not send a reply, so that we will get block
> > layer timeouts, because right now, ata_scsi_cmd_error_handler() seems like the
> > most likely problematic code to me.

Hi,

Some info I can infer from these 4 crashes.

There is some kind of race, or very rare timing is likely to be
involved. Only 4 crashes is not much. Usually the fuzzer triggers them
more often.

The crash happens in kworker, this makes it impossible to infer when
test programs may be involved.

In all 4 cases there is a preceding USB disconnect message:
[ 644.391966][ T5992] usb 11-1: USB disconnect, device number 24
It may be related. These devices can be connected via USB, right?

Unfortunately, I cannot infer much more.
These USB device numbers may theoretically allow to infer the test
program, but I think it's currently not possible.

It may be possible to reply these logs for longer to see if they
trigger the crash.

Niklas Cassel

unread,
Feb 20, 2026, 4:27:54 AM (3 days ago) Feb 20
to Dmitry Vyukov, Damien Le Moal, syzbot, linu...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, syzkaller
Hello Dmitry,

On Fri, Feb 20, 2026 at 10:17:05AM +0100, Dmitry Vyukov wrote:
> Some info I can infer from these 4 crashes.
>
> There is some kind of race, or very rare timing is likely to be
> involved. Only 4 crashes is not much. Usually the fuzzer triggers them
> more often.
>
> The crash happens in kworker, this makes it impossible to infer when
> test programs may be involved.
>
> In all 4 cases there is a preceding USB disconnect message:
> [ 644.391966][ T5992] usb 11-1: USB disconnect, device number 24
> It may be related. These devices can be connected via USB, right?
>
> Unfortunately, I cannot infer much more.
> These USB device numbers may theoretically allow to infer the test
> program, but I think it's currently not possible.
>
> It may be possible to reply these logs for longer to see if they
> trigger the crash.

It seems that my suspicion that the bug occurs after a block layer timeout,
was correct.

Damien managed to reproduce the bug and have sent a fix:
https://lore.kernel.org/linux-ide/20260220050053....@kernel.org/T/#t

A lot of thanks to syzbot for finding this bug that we failed to find
during review.


Kind regards,
Niklas
Reply all
Reply to author
Forward
0 new messages