Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

170 views
Skip to first unread message

YunQiang Su

unread,
Sep 8, 2021, 9:00:03 AM9/8/21
to
Package: src:linux
Version: 5.10

After upgrade to bullseyes' kernel, the system always hang after about 10 min
with an error from IML log

An Unrecoverable System Error (NMI) has occurred (Service Information:
0x00000008, 0x89480000)

Kernel 5.14 from experimental also has this problem.
Kernel 4.19 works fine.
Fedora 34 seems to be working well.

--
YunQiang Su

suyunqiang

unread,
Sep 8, 2021, 11:50:03 PM9/8/21
to
On Thu, 9 Sep 2021 11:11:45 +0800 Yunqiang Su <wzs...@gmail.com> wrote:
> This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
> Wish they are useful.

The problem seems due to some problem of the driver/firmware of bnx2x.
Since If I purge firmware-bnx2x, the OS will not hang (although no network connection then).

I check md5sum of the firmware of Bullseye: they have the same value with Fedora ones.
Note: fedora ones is compressed by xz. I test them after decompress.

My hardware requires: bnx2x-e2-7.13.15.0.fw

> >
> > --
> > YunQiang Su
> >
> >

YunQiang Su

unread,
Sep 9, 2021, 9:50:03 PM9/9/21
to
Yunqiang Su <wzs...@gmail.com> 于2021年9月9日周四 上午11:11写道:
>
>
> On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su <wzs...@gmail.com> wrote:
> This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
> Wish they are useful.
>

Finally, we find the problem:

https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf

In the first patch:
They thought `err' is not used at all, and removed it.
In the second patch:
They add it back and a wrong value "-EINVAL" is given.

Better KPI got.

> >
> > --
> > YunQiang Su
> >
> >



--
YunQiang Su

Yunqiang Su

unread,
Oct 21, 2021, 4:50:03 AM10/21/21
to
On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su <wzs...@gmail.com> wrote:
> Yunqiang Su <wzs...@gmail.com> 于2021年9月9日周四 上午11:11写道:
> >
> >
> > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su <wzs...@gmail.com> wrote:
> > This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
> > Wish they are useful.
> >
>
> Finally, we find the problem:
>
> https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
> https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf
>
> In the first patch:
> They thought `err' is not used at all, and removed it.
> In the second patch:
> They add it back and a wrong value "-EINVAL" is given.
>
> Better KPI got.
>

The NICs can be detected now, while the machine continue to hang…
4.19.y works fine, while 5.10, 5.14 cannot.

I think that we need more dig.

Claudio Kuenzler

unread,
Oct 22, 2021, 1:40:03 AM10/22/21
to

YunQiang Su

unread,
Oct 22, 2021, 2:00:03 AM10/22/21
to
Claudio Kuenzler <c...@claudiokuenzler.com> 于2021年10月22日周五 下午1:18写道:
I built kernel by myself (5.14.12), same version as the current debian sid one.
in fact 5.14.14 is also tested.
It won't trigger this problem.
And I make sure that hpwdt module is loaded.

No idea why Debian's kernel cannot work.
--
YunQiang Su

Claudio Kuenzler

unread,
Oct 22, 2021, 2:10:04 AM10/22/21
to
I have not tested sid or a newer Kernel on our HP machines though.
If you've compiled your own Kernel and this one works (did your do a multiple reboot test?), maybe there's a difference in the Kernel "config"?

What happens if you disable the hpwdt module as mentioned in the other bug reports? Does Bullseye with 5.10 and experimental with 5.14 work in this case?

YunQiang Su

unread,
Oct 22, 2021, 10:40:04 AM10/22/21
to
Claudio Kuenzler <c...@claudiokuenzler.com> 于2021年10月22日周五 下午2:03写道:
I test upstream linux and debian-linux with the same config.
All of the upstream config works fine, while debian-linux has this problem.
I guess it is due to one patch by Debian.

--
YunQiang Su

YunQiang Su

unread,
Oct 23, 2021, 12:00:03 PM10/23/21
to
YunQiang Su <wzs...@gmail.com> 于2021年10月22日周五 下午10:36写道:
I find the real problem: it is due to intel_iommu by default.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=934309
0 new messages