Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1036530: linux-signed-amd64: Hard lock up of system

68 views
Skip to first unread message

Nick Hastings

unread,
May 21, 2023, 8:10:04 PM5/21/23
to
Source: linux-signed-amd64
Severity: important
Tags: upstream
X-Debbugs-Cc: nicholas...@gmail.com

Dear Maintainer,

after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced
hard lockups on my Dell XPS 15 7590 a few minutes after each boot. On
more than one occasion I was on the console and was able to see the
error message. It was the same error on each occasion, and I reproduce
it here:

[ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

N.B. the message on the console was recorded with at photograph and
then manually typed in, so it is possible that it may contain one or
more errors.

I ran git bisect as descirbed at
https://wiki.debian.org/DebianKernel/GitBisect which seems to have
found the bad commit. It is a merge commit that deals with acpi code.
However I don't see what may actually be causing this issue.
The commit is e996c7e01892ac18ec0db447294d4f591c325efe

Please find the report from git bisect below.

Regards,

Nick.

-- System Information:
Debian Release: 12.0
APT prefers testing
APT policy: (990, 'testing'), (500, 'unstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.0.0-rc6-00001-g018d6711c26e (SMP w/16 CPU threads; PREEMPT)
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled


% git bisect good
e996c7e01892ac18ec0db447294d4f591c325efe is the first bad commit
commit e996c7e01892ac18ec0db447294d4f591c325efe
Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648
Author: Rafael J. Wysocki <rafael.j...@intel.com>
Date: Fri Sep 30 20:52:39 2022 +0200

Merge branches 'acpi-properties', 'acpi-tables', 'acpi-x86' and 'acpi-soc'

Merge changes related to ACPI data-only tables handling and ACPI device
properties management, x86-specific ACPI code changes and ACPI SoC driver
changes for 6.1-rc1:

- Clean up the ACPI LPSS (Intel SoC) driver (Andy Shevchenko).

- Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable (Mario
Limonciello).

- Drop unused dev_fmt() and redundant 'HMAT' prefix from the HMAT
parsing code (Liu Shixin).

- Make ACPI FPDT parsing code avoid calling acpi_os_map_memory() on
invalid physical addresses (Hans de Goede).

- Silence missing-declarations warning related to Apple device
properties management (Lukas Wunner).

* acpi-properties:
ACPI: property: Silence missing-declarations warning in apple.c

* acpi-tables:
ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix
ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys address

* acpi-x86:
ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable

* acpi-soc:
ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
ACPI: LPSS: Replace loop with first entry retrieval

drivers/acpi/acpi_fpdt.c | 22 ++++++++++++++++++++++
drivers/acpi/acpi_lpss.c | 45 +++++++++++++++++++++------------------------
drivers/acpi/numa/hmat.c | 25 ++++++++++++-------------
drivers/acpi/x86/apple.c | 1 +
drivers/acpi/x86/utils.c | 19 ++++++++++++++++++-
5 files changed, 74 insertions(+), 38 deletions(-)
[0 running job(s)] {history#6810} 2023-05-20 20:54:16

Salvatore Bonaccorso

unread,
May 24, 2023, 6:40:05 AM5/24/23
to
Control: tags -1 + moreinfo

Hi Nick,

On Mon, May 22, 2023 at 08:56:12AM +0900, Nick Hastings wrote:
> Source: linux-signed-amd64
> Severity: important
> Tags: upstream
> X-Debbugs-Cc: nicholas...@gmail.com
>
> Dear Maintainer,
>
> after upgrading from a 6.0.0 kernel to a 6.1.0 kernel I experienced
> hard lockups on my Dell XPS 15 7590 a few minutes after each boot. On
> more than one occasion I was on the console and was able to see the
> error message. It was the same error on each occasion, and I reproduce
> it here:
>
> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>
> N.B. the message on the console was recorded with at photograph and
> then manually typed in, so it is possible that it may contain one or
> more errors.
>
> I ran git bisect as descirbed at
> https://wiki.debian.org/DebianKernel/GitBisect which seems to have
> found the bad commit. It is a merge commit that deals with acpi code.
> However I don't see what may actually be causing this issue.
> The commit is e996c7e01892ac18ec0db447294d4f591c325efe
>
> Please find the report from git bisect below.

Given you were able to bisect it so far, can you try to isolate the
commit from the merge commit causing it? One remotely related might be
"ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for
StorageD3Enable".

Regards,
Salvatore

Nick Hastings

unread,
May 24, 2023, 7:30:05 PM5/24/23
to
Hi,

* Salvatore Bonaccorso <car...@debian.org> [230524 19:26]:
>
> Given you were able to bisect it so far, can you try to isolate the
> commit from the merge commit causing it?

I guess I can try. The commit message states:

Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648

Is there a way extract out each of those?

> One remotely related might be "ACPI: x86: Add a quirk for Dell
> Inspiron 14 2-in-1 for StorageD3Enable".

Manually looking at the diff with
git diff e996c7e01892ac18ec0db447294d4f591c325efe~ e996c7e01892ac18ec0db447294d4f591c325efe
I guess that means the following:

--- a/drivers/acpi/x86/utils.c
+++ b/drivers/acpi/x86/utils.c
@@ -207,9 +207,26 @@ static const struct x86_cpu_id storage_d3_cpu_ids[] = {
{}
};

+static const struct dmi_system_id force_storage_d3_dmi[] = {
+ {
+ /*
+ * _ADR is ambiguous between GPP1.DEV0 and GPP1.NVME
+ * but .NVME is needed to get StorageD3Enable node
+ * https://bugzilla.kernel.org/show_bug.cgi?id=216440
+ */
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 14 7425 2-in-1"),
+ }
+ },
+ {}
+};
+
bool force_storage_d3(void)
{
- return x86_match_cpu(storage_d3_cpu_ids);
+ const struct dmi_system_id *dmi_id = dmi_first_match(force_storage_d3_dmi);
+
+ return dmi_id || x86_match_cpu(storage_d3_cpu_ids);
}


Thanks,

Nick.

Salvatore Bonaccorso

unread,
May 25, 2023, 11:31:20 AM5/25/23
to
Hi Nick,

On Thu, May 25, 2023 at 08:23:15AM +0900, Nick Hastings wrote:
> Hi,
>
> * Salvatore Bonaccorso <car...@debian.org> [230524 19:26]:
> >
> > Given you were able to bisect it so far, can you try to isolate the
> > commit from the merge commit causing it?
>
> I guess I can try. The commit message states:
>
> Merge: c77f54a9bcec a1cf1fd62ae7 562163595a91 018d6711c26e 6cc401be1648
>
> Is there a way extract out each of those?

Th way i usuually get all commits from a merge request is

git log --oneline $mergecommit^$mergecommit^2

though here we have three merge commits, merged with one merge commit
on top, so you would go down the merges of the acpi-properties,
acpi-tables, acpi-x86 and acpi-soc branches. Those are those:

* acpi-properties:
ACPI: property: Silence missing-declarations warning in apple.c

* acpi-tables:
ACPI: HMAT: Drop unused dev_fmt() and redundant 'HMAT' prefix
ACPI: tables: FPDT: Don't call acpi_os_map_memory() on invalid phys address

* acpi-x86:
ACPI: x86: Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable

* acpi-soc:
ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
ACPI: LPSS: Replace loop with first entry retrieval

That probably won't work actually as the code has been refactored
substantiantly after the commit.

In the ideal case we could confirm the quirk change is the responsable
commit, so we can make upstream aware.

Regards,
Salvatore

Nick Hastings

unread,
May 25, 2023, 8:31:48 PM5/25/23
to
Hi Salvatore,

thanks for your help. However, I'm now not sure if I really have
identified the commit that causes my problems. I fear I may have made
one or more mistakes when setting "git bisect good". I had been under
the impression that the lock up would happen no more than a few tens of
minutes after booting, however it seems that sometimes it can take a few
hours to occur.

So, I'm running the git bisect again and will be more careful before
marking "git bisect good". It could take a few days.

Should this particular bug be closed?

Thanks,

Nick.


* Salvatore Bonaccorso <car...@debian.org> [230526 00:19]:

Salvatore Bonaccorso

unread,
May 26, 2023, 7:40:06 AM5/26/23
to
Control: tags -1 + moreinfo

Hi Nick,

On Fri, May 26, 2023 at 09:25:23AM +0900, Nick Hastings wrote:
> Hi Salvatore,
>
> thanks for your help. However, I'm now not sure if I really have
> identified the commit that causes my problems. I fear I may have made
> one or more mistakes when setting "git bisect good". I had been under
> the impression that the lock up would happen no more than a few tens of
> minutes after booting, however it seems that sometimes it can take a few
> hours to occur.
>
> So, I'm running the git bisect again and will be more careful before
> marking "git bisect good". It could take a few days.
>
> Should this particular bug be closed?

Thanks a lot for reporting back, you time put in into bisect is very
appreciated and valued! No, no need to close this one, as the bug
still persist. Just followup please once you have identified the
culprit with the fresh bisect.

Please do remove by then as well the moreinfo tag again (you can write
a control message with tag -1 - moreinfo, so won't appear as bug
needing information from reporter).

Thank you!

Regards,
Salvatore

Nick Hastings

unread,
May 27, 2023, 9:20:04 PM5/27/23
to
Control: tags -1 - moreinfo

Hi,

I repeated the git bisect, and the bad commit seems to be:

(git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
commit 24867516f06dabedef3be7eea0ef0846b91538bc
Author: Mario Limonciello <mario.li...@amd.com>
Date: Tue Aug 23 13:51:31 2022 -0500

ACPI: OSI: Remove Linux-Dell-Video _OSI string

This string was introduced because drivers for NVIDIA hardware
had bugs supporting RTD3 in the past.

Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
had a mechanism for switching PRIME on and off, though it had required
to logout/login to make the library switch happen.

When the PRIME had been off, the mechanism had unloaded the NVIDIA
driver and put the device into D3cold, but the GPU had never come back
to D0 again which is why ODMs used the _OSI to expose an old _DSM
method to switch the power on/off.

That has been fixed by commit 5775b843a619 ("PCI: Restore config space
on runtime resume despite being unbound"). so vendors shouldn't be
using this string to modify ASL any more.

Reviewed-by: Lyude Paul <ly...@redhat.com>
Signed-off-by: Mario Limonciello <mario.li...@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j...@intel.com>

drivers/acpi/osi.c | 9 ---------
1 file changed, 9 deletions(-)

This machine is a Dell with an nvidia chip so it looks like this really
could be the commit that that is causing the problems. The description
of the commit also seems (to my untrained eye) to be consistent with the
error reported on the console when the lockup occurs:

[ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
[ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

Hopefully this is enough information for experts to resolve this.

Regards,

Nick.

* Salvatore Bonaccorso <car...@debian.org> [230526 20:30]:

Salvatore Bonaccorso

unread,
May 28, 2023, 3:00:05 AM5/28/23
to
Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y.

#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:
Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore

Nick Hastings

unread,
May 29, 2023, 12:00:07 AM5/29/23
to
* Mario Limonciello <mario.li...@amd.com> [230529 10:14]:
> On 5/28/23 19:56, Nick Hastings wrote:
> > Hi,
> >
> > * Mario Limonciello <mario.li...@amd.com> [230528 21:44]:
> > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > >
> > > > > Hopefully this is enough information for experts to resolve this.
> > > >
> > > > Does this ring some bell for you? Do you need any further information
> > > > from Nick?
> > > >
> > > > Regards,
> > > > Salvatore
> > >
> >
> > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> >
> > I booted into a 6.1 kernel with this option. It has been running without
> > problems for 1.5 hours. Usually I would expect the lockup to have
> > occurred by now.

I let this run for 3 hours without issue.

> > > Does this happen in the latest 6.4 RC as well?
> >
> > I have compiled that kernel and will boot into it after running this one
> > with the pcie_port_pm=off for another hour or so.

I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did however see two unrelated problems that I include here for
completeness:
1. iwlwifi module did not automatically load
2. Xwayland used huge amount of CPU even though was not running any X
programs. Recompiling my wayland compositor without XWayland support
"fixed" this.

> > > I think we need to see a full dmesg and acpidump to better
> > > characterize it.
> >
> > Please find attached. Let me know if there is anything else I can provide.
> >
> > Regards,
> >
> > Nick.
>
> I don't see nouveau loading, are you explicitly preventing it from
> loading?

Yes nouveau is blacklisted.

> Can I see the journal from a boot when it reproduced?

Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
what you are requesting?). The commit hash doesn't not seem to be
listed. I may have to boot into a bad kernel again.

Regards,

Ncik.

Nick Hastings

unread,
May 30, 2023, 3:10:04 AM5/30/23
to
Hi,

* Mario Limonciello <mario.li...@amd.com> [230530 13:00]:
> On 5/29/23 18:01, Nick Hastings wrote:
> > Hi,
> >
> > * Nick Hastings <nicholas...@gmail.com> [230529 12:51]:
> > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > > > > >
> > > > > > > > Hopefully this is enough information for experts to resolve this.
> > > > > > >
> > > > > > > Does this ring some bell for you? Do you need any further information
> > > > > > > from Nick?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Salvatore
> > > > > >
> > > > >
> > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > > > >
> > > > > I booted into a 6.1 kernel with this option. It has been running without
> > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > occurred by now.
> > >
> > > I let this run for 3 hours without issue.
> > >
> > > > > > Does this happen in the latest 6.4 RC as well?
> > > > >
> > > > > I have compiled that kernel and will boot into it after running this one
> > > > > with the pcie_port_pm=off for another hour or so.
> > >
> > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
> >
> > I did eventually see a lockup of this kernel. On the console I saw:
> >
> > [ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> >
> > I did not see the other two lines that were present in earlier lock ups >
> > > I did however see two unrelated problems that I include here for
> > > completeness:
> > > 1. iwlwifi module did not automatically load
> > > 2. Xwayland used huge amount of CPU even though was not running any X
> > > programs. Recompiling my wayland compositor without XWayland support
> > > "fixed" this.
> > >
> > > > > > I think we need to see a full dmesg and acpidump to better
> > > > > > characterize it.
> > > > >
> > > > > Please find attached. Let me know if there is anything else I can provide.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick.
> > > >
> > > > I don't see nouveau loading, are you explicitly preventing it from
> > > > loading?
> > >
> > > Yes nouveau is blacklisted.
> > >
> > > > Can I see the journal from a boot when it reproduced?
> > >
> > > Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> > > what you are requesting?). The commit hash doesn't not seem to be
> > > listed. I may have to boot into a bad kernel again.
> >
> > Please find attached the output from a "journalctl --system -bN" for a
> > kernel that has this issue.
> >
> > Regards,
> >
> > Nick.
>
> In this log I see nouveau loaded, but I also don't see the failure
> occurring.

I never saw anything in the logs from a lockup either. I had assumed it
was no longer able to write to disk. The failure did occur on that
occasion.

> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> the kernel command line?

I'm not intentionally loading it. This machine also has intel graphics
which is what I prefer. Checking my
/etc/modprobe.d/blacklist-nvidia-nouveau.conf
I see:

blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
blacklist nvidia-uvm
blacklist ipmi_msghandler
blacklist ipmi_devintf

So I thought I had blacklisted it but it seems I did not. Since I do not
want to use it maybe it is better to check if the lock up occurs with
nouveau blacklisted. I will try that now.

> If that helps the issue; I strongly suggest you cross reference the latest
> kernel to see if this bug still exists.

I did. See above.

Regards,

Nick.

Salvatore Bonaccorso

unread,
May 30, 2023, 7:30:04 AM5/30/23
to
Hi Nick,

Thanks to you both for triaging the issue!
Can you try if you would get more out of it using netconsole?

https://www.kernel.org/doc/html/latest/networking/netconsole.html

Regards,
Salvatore

Nick Hastings

unread,
May 31, 2023, 7:50:04 PM5/31/23
to
Hi,

* Nick Hastings <nicholas...@gmail.com> [230530 16:01]:
>
> * Mario Limonciello <mario.li...@amd.com> [230530 13:00]:
<snip>
> > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > the kernel command line?
>
> I'm not intentionally loading it. This machine also has intel graphics
> which is what I prefer. Checking my
> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> I see:
>
> blacklist nvidia
> blacklist nvidia-drm
> blacklist nvidia-modeset
> blacklist nvidia-uvm
> blacklist ipmi_msghandler
> blacklist ipmi_devintf
>
> So I thought I had blacklisted it but it seems I did not. Since I do not
> want to use it maybe it is better to check if the lock up occurs with
> nouveau blacklisted. I will try that now.

I blacklisted nouveau and booted into a 6.1 kernel:
% uname -a
Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

It has been running without problems for nearly two days now:
% uptime
08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27

Regards,

Nick.

Nick Hastings

unread,
Jun 1, 2023, 8:00:04 PM6/1/23
to
Hi,

* Limonciello, Mario <mario.li...@amd.com> [230602 01:18]:
> +Lyude, Lukas, Karol
> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?

I reported this twice already. I guess it was lost since for some
reason emails in this thread are not being trimmed. I'll repeat here:

I did eventually see a lockup of this kernel. On the console I saw:

[ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups.

Regards,

Nick.

Linux regression tracking (Thorsten Leemhuis)

unread,
Jun 26, 2023, 8:40:04 AM6/26/23
to
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Nick, what's the status/was there any progress? Did you do what Mario
suggested and file a nouveau bug?

I ask, as I still have this on my list of regressions and it seems there
was no progress in three+ weeks now.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot backburner: slow progress, likely just affects one machine
#regzbot poke


On 02.06.23 02:57, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Nick Hastings <nicholas...@gmail.com>
>> Sent: Thursday, June 1, 2023 7:02 PM
>> To: Karol Herbst <khe...@redhat.com>
>> Cc: Limonciello, Mario <Mario.Li...@amd.com>; Lyude Paul
>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael J.
>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>> regre...@lists.linux.dev
>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>
>> Hi,
>>
>> * Karol Herbst <khe...@redhat.com> [230602 03:10]:
>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>> <Mario.Li...@amd.com> wrote:
>>>>> -----Original Message-----
>>>>> From: Karol Herbst <khe...@redhat.com>
>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>> To: Limonciello, Mario <Mario.Li...@amd.com>
>>>>> Cc: Nick Hastings <nicholas...@gmail.com>; Lyude Paul
>>>>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>>>>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael J.
>>>>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>>>>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>>>>> regre...@lists.linux.dev
>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>> system)
>>>>>
>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>> <Mario.Li...@amd.com> wrote:
>>>>>>
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Karol Herbst <khe...@redhat.com>
>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>> To: Limonciello, Mario <Mario.Li...@amd.com>
>>>>>>> Cc: Nick Hastings <nicholas...@gmail.com>; Lyude Paul
>>>>>>> <ly...@redhat.com>; Lukas Wunner <lu...@wunner.de>; Salvatore
>>>>>>> Bonaccorso <car...@debian.org>; 103...@bugs.debian.org; Rafael
>> J.
>>>>>>> Wysocki <raf...@kernel.org>; Len Brown <le...@kernel.org>; linux-
>>>>>>> ac...@vger.kernel.org; linux-...@vger.kernel.org;
>>>>>>> regre...@lists.linux.dev
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>> _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>>>>>>>>
>>>>>>>> Lyude, Lukas, Karol
>>>>>>>>
>>>>>>>> This thread is in relation to this commit:
>>>>>>>>
>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>>>>>>>>
>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
>>>>>>>>
>>>>>>>
>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
>>>>>>> workaround:
>>>>>>>
>>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>>>>>
>>>>>>> And I suspect there might be one or two more IDs we'll have to add
>>>>>>> there. Do we have any logs?
>>>>>>
>>>>>> There's some archived onto the distro bug. Search this page for
>>>>> "journalctl.log.gz"
>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>>>>>
>>>>>
>>>>> interesting.. It seems to be the same controller used here. I wonder
>>>>> if the pci topology is different or if the workaround is applied at
>>>>> all.
>>>>
>>>> I didn't see the message in the log about the workaround being applied
>>>> in that log, so I guess PCI topology difference is a likely suspect.
>>>>
>>>
>>> yeah, but I also couldn't see a log with the usual nouveau messages,
>>> so it's kinda weird.
>>>
>>> Anyway, the output of `lspci -tvnn` would help
>>
>> % lspci -tvnn
>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
>> Mobile / Max-Q] [10de:1f91]
>
> So the bridge it's connected to is the same that the quirk *should have been* triggering.
>
> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
>
> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> Nouveau drm bug to figure out why.
>
>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
>> [8086:3e9b]
>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
>> Processor Thermal Subsystem [8086:1903]
>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
>> [8086:a379]
>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
>> [8086:a36d]
>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
>> [8086:a368]
>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
>> [8086:a369]
>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
>> [8086:a353]
>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
>> | +-01.0-[05-39]--
>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
>> [8086:15db]
>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
>> Express Card Reader [10ec:525a]
>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
>> SM981/PM981/PM983 [144d:a808]
>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
>> [8086:a323]
>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
>> [8086:a324]
>>
>>
>> Regards,
>>
>> Nick.
>

Thorsten Leemhuis

unread,
Jun 30, 2023, 9:10:05 AM6/30/23
to
On 27.06.23 00:34, Nick Hastings wrote:
> * Linux regression tracking (Thorsten Leemhuis) <regre...@leemhuis.info> [230626 21:09]:
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Nick, what's the status/was there any progress? Did you do what Mario
>> suggested and file a nouveau bug?
>
> It was not apparent that the suggestion to open "a Nouveau drm bug" was
> addressed to me.

I wish things were earlier for reporters, but from what I can see this
is the only way forward if you or some silent bystander cares.

>> I ask, as I still have this on my list of regressions and it seems there
>> was no progress in three+ weeks now.
>
> I have not pursued this further since as far as I could tell I already
> provided all requested information and I don't actually use nouveau, so
> I blacklisted it.

I doubt any developer cares enough to take a closer look[1] without a
proper nouveau bug and some help & prodding from someone affected. And
looks to me like reverting the culprit now might create even bigger
problems for users.

Hence I guess then this won't be fixed in the end. In a ideal world this
would not happen, but we don't live in one and all have just 24 hours in
a day. :-/

Nevertheless: thx for your report your help through this thread.

[1] some points on the following page kinda explain this
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot inconclusive: reporting deadlock (see thread for details)
0 new messages