Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Salvatore Bonaccorso

unread,

May 28, 2023, 3:00:10 AM5/28/23

to

Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y.

#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> Control: tags -1 - moreinfo
>
> Hi,
>
> I repeated the git bisect, and the bad commit seems to be:
>
> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> commit 24867516f06dabedef3be7eea0ef0846b91538bc
> Author: Mario Limonciello <mario.li...@amd.com>
> Date: Tue Aug 23 13:51:31 2022 -0500
>
> ACPI: OSI: Remove Linux-Dell-Video _OSI string
>
> This string was introduced because drivers for NVIDIA hardware
> had bugs supporting RTD3 in the past.
>
> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> had a mechanism for switching PRIME on and off, though it had required
> to logout/login to make the library switch happen.
>
> When the PRIME had been off, the mechanism had unloaded the NVIDIA
> driver and put the device into D3cold, but the GPU had never come back
> to D0 again which is why ODMs used the _OSI to expose an old _DSM
> method to switch the power on/off.
>
> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> on runtime resume despite being unbound"). so vendors shouldn't be
> using this string to modify ASL any more.
>
> Reviewed-by: Lyude Paul <ly...@redhat.com>
> Signed-off-by: Mario Limonciello <mario.li...@amd.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j...@intel.com>
>
> drivers/acpi/osi.c | 9 ---------
> 1 file changed, 9 deletions(-)
>
> This machine is a Dell with an nvidia chip so it looks like this really
> could be the commit that that is causing the problems. The description
> of the commit also seems (to my untrained eye) to be consistent with the
> error reported on the console when the lockup occurs:
>
> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>
> Hopefully this is enough information for experts to resolve this.

Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore

Nick Hastings

unread,

May 29, 2023, 12:00:12 AM5/29/23

to

* Mario Limonciello <mario.li...@amd.com> [230529 10:14]:
> On 5/28/23 19:56, Nick Hastings wrote:
> > Hi,
> >
> > * Mario Limonciello <mario.li...@amd.com> [230528 21:44]:

> > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> >
> > I booted into a 6.1 kernel with this option. It has been running without
> > problems for 1.5 hours. Usually I would expect the lockup to have
> > occurred by now.

I let this run for 3 hours without issue.

> > > Does this happen in the latest 6.4 RC as well?
> >
> > I have compiled that kernel and will boot into it after running this one
> > with the pcie_port_pm=off for another hour or so.

I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did however see two unrelated problems that I include here for
completeness:
1. iwlwifi module did not automatically load
2. Xwayland used huge amount of CPU even though was not running any X
programs. Recompiling my wayland compositor without XWayland support
"fixed" this.

> > > I think we need to see a full dmesg and acpidump to better
> > > characterize it.
> >
> > Please find attached. Let me know if there is anything else I can provide.
> >
> > Regards,
> >
> > Nick.
>
> I don't see nouveau loading, are you explicitly preventing it from
> loading?

Yes nouveau is blacklisted.

> Can I see the journal from a boot when it reproduced?

Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
what you are requesting?). The commit hash doesn't not seem to be
listed. I may have to boot into a bad kernel again.

Regards,

Ncik.

Nick Hastings

unread,

May 30, 2023, 3:10:10 AM5/30/23

to

Hi,

* Mario Limonciello <mario.li...@amd.com> [230530 13:00]:

> On 5/29/23 18:01, Nick Hastings wrote:
> > Hi,
> >

> > * Nick Hastings <nicholas...@gmail.com> [230529 12:51]:

> > I did eventually see a lockup of this kernel. On the console I saw:
> >
> > [ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> >
> > I did not see the other two lines that were present in earlier lock ups >

> > > I did however see two unrelated problems that I include here for
> > > completeness:
> > > 1. iwlwifi module did not automatically load
> > > 2. Xwayland used huge amount of CPU even though was not running any X
> > > programs. Recompiling my wayland compositor without XWayland support
> > > "fixed" this.
> > >
> > > > > > I think we need to see a full dmesg and acpidump to better
> > > > > > characterize it.
> > > > >
> > > > > Please find attached. Let me know if there is anything else I can provide.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick.
> > > >
> > > > I don't see nouveau loading, are you explicitly preventing it from
> > > > loading?
> > >
> > > Yes nouveau is blacklisted.
> > >
> > > > Can I see the journal from a boot when it reproduced?
> > >
> > > Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> > > what you are requesting?). The commit hash doesn't not seem to be
> > > listed. I may have to boot into a bad kernel again.
> >

> > Please find attached the output from a "journalctl --system -bN" for a
> > kernel that has this issue.
> >
> > Regards,
> >
> > Nick.
>
> In this log I see nouveau loaded, but I also don't see the failure
> occurring.

I never saw anything in the logs from a lockup either. I had assumed it
was no longer able to write to disk. The failure did occur on that
occasion.

> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> the kernel command line?

I'm not intentionally loading it. This machine also has intel graphics
which is what I prefer. Checking my
/etc/modprobe.d/blacklist-nvidia-nouveau.conf
I see:

blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
blacklist nvidia-uvm
blacklist ipmi_msghandler
blacklist ipmi_devintf

So I thought I had blacklisted it but it seems I did not. Since I do not
want to use it maybe it is better to check if the lock up occurs with
nouveau blacklisted. I will try that now.

> If that helps the issue; I strongly suggest you cross reference the latest
> kernel to see if this bug still exists.

I did. See above.

Regards,

Nick.

Salvatore Bonaccorso

unread,

May 30, 2023, 7:30:09 AM5/30/23

to

Hi Nick,

Thanks to you both for triaging the issue!

Can you try if you would get more out of it using netconsole?

https://www.kernel.org/doc/html/latest/networking/netconsole.html

Regards,
Salvatore

Nick Hastings

unread,

May 31, 2023, 7:50:09 PM5/31/23

to

Hi,

* Nick Hastings <nicholas...@gmail.com> [230530 16:01]:

>
> * Mario Limonciello <mario.li...@amd.com> [230530 13:00]:

<snip>

> > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > the kernel command line?
>
> I'm not intentionally loading it. This machine also has intel graphics
> which is what I prefer. Checking my
> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> I see:
>
> blacklist nvidia
> blacklist nvidia-drm
> blacklist nvidia-modeset
> blacklist nvidia-uvm
> blacklist ipmi_msghandler
> blacklist ipmi_devintf
>
> So I thought I had blacklisted it but it seems I did not. Since I do not
> want to use it maybe it is better to check if the lock up occurs with
> nouveau blacklisted. I will try that now.

I blacklisted nouveau and booted into a 6.1 kernel:
% uname -a
Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

It has been running without problems for nearly two days now:
% uptime
08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27

Regards,

Nick.

Nick Hastings

unread,

Jun 1, 2023, 8:00:08 PM6/1/23

to

Hi,

* Limonciello, Mario <mario.li...@amd.com> [230602 01:18]:
> +Lyude, Lukas, Karol

> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?

I reported this twice already. I guess it was lost since for some
reason emails in this thread are not being trimmed. I'll repeat here:

I did eventually see a lockup of this kernel. On the console I saw:

[ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups.

Regards,

Nick.

Linux regression tracking (Thorsten Leemhuis)

unread,

Jun 26, 2023, 8:40:09 AM6/26/23

to

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Nick, what's the status/was there any progress? Did you do what Mario
suggested and file a nouveau bug?

I ask, as I still have this on my list of regressions and it seems there
was no progress in three+ weeks now.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot backburner: slow progress, likely just affects one machine
#regzbot poke

On 02.06.23 02:57, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Nick Hastings <nicholas...@gmail.com>
>> Sent: Thursday, June 1, 2023 7:02 PM
>> To: Karol Herbst <khe...@redhat.com>
>> Cc: Limonciello, Mario <Mario.Li...@amd.com>; Lyude Paul
>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael J.
>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>> regre...@lists.linux.dev
>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>
>> Hi,
>>
>> * Karol Herbst <khe...@redhat.com> [230602 03:10]:
>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>> <Mario.Li...@amd.com> wrote:
>>>>> -----Original Message-----
>>>>> From: Karol Herbst <khe...@redhat.com>
>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>> To: Limonciello, Mario <Mario.Li...@amd.com>
>>>>> Cc: Nick Hastings <nicholas...@gmail.com>; Lyude Paul
>>>>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>>>>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael J.
>>>>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>>>>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>>>>> regre...@lists.linux.dev
>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>> system)
>>>>>
>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>> <Mario.Li...@amd.com> wrote:
>>>>>>
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Karol Herbst <khe...@redhat.com>
>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>> To: Limonciello, Mario <Mario.Li...@amd.com>
>>>>>>> Cc: Nick Hastings <nicholas...@gmail.com>; Lyude Paul
>>>>>>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>>>>>>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael
>> J.
>>>>>>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>>>>>>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>>>>>>> regre...@lists.linux.dev
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>> _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>>>>>>>>
>>>>>>>> Lyude, Lukas, Karol
>>>>>>>>
>>>>>>>> This thread is in relation to this commit:
>>>>>>>>
>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>>>>>>>>
>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
>>>>>>>>
>>>>>>>
>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
>>>>>>> workaround:
>>>>>>>
>>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>>>>>
>>>>>>> And I suspect there might be one or two more IDs we'll have to add
>>>>>>> there. Do we have any logs?
>>>>>>
>>>>>> There's some archived onto the distro bug. Search this page for
>>>>> "journalctl.log.gz"
>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>>>>>
>>>>>
>>>>> interesting.. It seems to be the same controller used here. I wonder
>>>>> if the pci topology is different or if the workaround is applied at
>>>>> all.
>>>>
>>>> I didn't see the message in the log about the workaround being applied
>>>> in that log, so I guess PCI topology difference is a likely suspect.
>>>>
>>>
>>> yeah, but I also couldn't see a log with the usual nouveau messages,
>>> so it's kinda weird.
>>>
>>> Anyway, the output of `lspci -tvnn` would help
>>
>> % lspci -tvnn
>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
>> Mobile / Max-Q] [10de:1f91]
>
> So the bridge it's connected to is the same that the quirk *should have been* triggering.
>
> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
>
> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> Nouveau drm bug to figure out why.
>
>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
>> [8086:3e9b]
>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
>> Processor Thermal Subsystem [8086:1903]
>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
>> [8086:a379]
>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
>> [8086:a36d]
>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
>> [8086:a368]
>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
>> [8086:a369]
>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
>> [8086:a353]
>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
>> | +-01.0-[05-39]--
>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
>> [8086:15db]
>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
>> Express Card Reader [10ec:525a]
>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
>> SM981/PM981/PM983 [144d:a808]
>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
>> [8086:a323]
>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
>> [8086:a324]
>>
>>
>> Regards,
>>
>> Nick.
>

Thorsten Leemhuis

unread,

Jun 30, 2023, 9:10:11 AM6/30/23

to

On 27.06.23 00:34, Nick Hastings wrote:
> * Linux regression tracking (Thorsten Leemhuis) <regre...@leemhuis.info> [230626 21:09]:

>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Nick, what's the status/was there any progress? Did you do what Mario
>> suggested and file a nouveau bug?
>

> It was not apparent that the suggestion to open "a Nouveau drm bug" was
> addressed to me.

I wish things were earlier for reporters, but from what I can see this
is the only way forward if you or some silent bystander cares.

>> I ask, as I still have this on my list of regressions and it seems there
>> was no progress in three+ weeks now.
>

> I have not pursued this further since as far as I could tell I already
> provided all requested information and I don't actually use nouveau, so
> I blacklisted it.

I doubt any developer cares enough to take a closer look[1] without a
proper nouveau bug and some help & prodding from someone affected. And
looks to me like reverting the culprit now might create even bigger
problems for users.

Hence I guess then this won't be fixed in the end. In a ideal world this
would not happen, but we don't live in one and all have just 24 hours in
a day. :-/

Nevertheless: thx for your report your help through this thread.

[1] some points on the following page kinda explain this
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot inconclusive: reporting deadlock (see thread for details)