Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bullseye (mostly) not booting on Proliant DL380 G7

2,495 views
Skip to first unread message

Claudio Kuenzler

unread,
Jun 28, 2021, 2:40:04 PM6/28/21
to
Hello!

Currently testing the new Bullseye release (using firmware-bullseye-DI-rc2-amd64-netinst.iso) and see a strange phenomenon on a HP Proliant DL380 G7 server.

During boot, the following messages show up in the console:

[63.063844] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
[63.063895] pcc_cpufreq_init: Try to enable another scaling driver through BIOS settings
[63.063943] pcc_cpufreq_init: and complain to the system vendor

According to Andreas Herrmann, the settings can be defined in the HP server BIOS:

Power Management -> Advanced Power Options -> Collaborative Power Control = enabled

This is active (is the default I believe). The Power Regulator is set to "Dynamic Power Savings Mode".

After these messages show up on the console, no login prompt appears. No network started. The server seems frozen - doesn't even react to CTRL+ALT+DEL on the console anymore. Not sure if this is caused by cpufreq or something else though.

This boot problem happened on 2 out of 3 server boots.

Is this a bug in Bullseye?

thx for any hints.

Claudio Kuenzler

unread,
Jun 29, 2021, 4:10:05 AM6/29/21
to
Meanwhile I was able to identify more by removing "quiet" from the grub loader.
The pcc_cpufreq_init does not seem to hurt the booting - these are just warnings popping up.

The following messages appear on the console before the server freezes:

[ OK ] Finished Load Kernel Module fuse.
[ 62.887855] systemd[1]: Mounting FUSE Control File System...
   Mounting FUSE Controle File System...
[ 62.891852] systemd[1]: Finished Apply Kernel Variables.
[ OK ] Finished Apply Kernel Variables.
[ 62.892237] systemd[1]: Mounted FUSE Control File System.
[ OK ] Mounted FUSE Control File System.
[ 62.900668] systemd[1]: Finished Create System Users.
[ OK ] Finished Create System Users.
[ 62.902224] systemd[1]: Starting Create Static Device Nodes in /dev...
  Starting Create Static Device Nodes in /dev...
[ 62.920767] systemd[1]: modp...@drm.service: Succeeded.
[ 62.921202] systemd[1]: Finished Load Kernel Module drm.
[ OK ] Finished Load Kernel Module drm.
[ 62.921979] systemd[1]: Finished Create Static Device Nodes in /dev.
[ OK ] Finished Create Static Device Nodes in /dev.
[ 62.925007] systemd[1]: Starting Rule-based Manager for Device Events and Files...
   Starting Rule-based Manager for Device Events and Files...
[ 62.955322] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[ OK ] Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[ 62.962186] systemd[1]: Started Rule-based Manager for Device Events and Files.

After this, no further messages, no login prompt, server does not react to keyboard input anymore. Only a hardware reset works in this case.
Out of ~10 server reboots this problem occurred 4 or 5 times.

Could it have something to do with drm? I've seen a drm driver error during earlier boot phase.

Jun 28 16:15:05 irczsrvp08 kernel: [   63.182074] [drm] radeon kernel modesetting enabled.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.182197] radeon 0000:01:03.0: vgaarb: deactivate vga console
Jun 28 16:15:05 irczsrvp08 kernel: [   63.183720] Console: switching to colour dummy device 80x25
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184088] [drm] initializing kernel modesetting (RV100 0x1002:0x515E 0x103C:0x31FB 0x02).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184208] radeon 0000:01:03.0: VRAM: 128M 0x00000000E8000000 - 0x00000000EFFFFFFF (64M used)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184210] radeon 0000:01:03.0: GTT: 512M 0x00000000C8000000 - 0x00000000E7FFFFFF
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184219] [drm] Detected VRAM RAM=128M, BAR=128M
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184220] [drm] RAM width 16bits DDR
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184302] [TTM] Zone  kernel: Available graphics memory: 49487844 KiB
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184304] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184305] [TTM] Initializing pool allocator
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184310] [TTM] Initializing DMA pool allocator
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184333] [drm] radeon: 64M of VRAM memory ready
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184334] [drm] radeon: 512M of GTT memory ready.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.184371] [drm] GART: num cpu pages 131072, num gpu pages 131072
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205645] [drm] PCI GART of 512M enabled (table at 0x00000000FFF00000).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205890] radeon 0000:01:03.0: WB disabled
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205894] radeon 0000:01:03.0: fence driver on ring 0 use gpu addr 0x00000000c8000000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205967] [drm] radeon: irq initialized.
Jun 28 16:15:05 irczsrvp08 kernel: [   63.205980] [drm] Loading R100 Microcode
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206233] radeon 0000:01:03.0: firmware: failed to load radeon/R100_cp.bin (-2)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206241] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206246] radeon 0000:01:03.0: Direct firmware load for radeon/R100_cp.bin failed with error -2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206311] [drm:r100_cp_init [radeon]] *ERROR* Failed to load firmware!
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206318] radeon 0000:01:03.0: failed initializing CP (-2).
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206321] radeon 0000:01:03.0: Disabling GPU acceleration
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206329] [drm] radeon: cp finalized
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206961] [drm] No TV DAC info found in BIOS
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206996] [drm] Radeon Display Connectors
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206997] [drm] Connector 0:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206998] [drm]   VGA-1
Jun 28 16:15:05 irczsrvp08 kernel: [   63.206999] [drm]   DDC: 0x60 0x60 0x60 0x60 0x60 0x60 0x60 0x60
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207000] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207001] [drm]     CRT1: INTERNAL_DAC1
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207002] [drm] Connector 1:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207003] [drm]   VGA-2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   DDC: 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207005] [drm]     CRT2: INTERNAL_DAC2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.236242] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jun 28 16:15:05 irczsrvp08 kernel: [   63.245005] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250269] [drm] fb mappable at 0xE8040000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250270] [drm] vram apper at 0xE8000000
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] size 1572864
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] fb depth is 16
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250272] [drm]    pitch is 2048

Maybe related to the known bullseye errata https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989863 ?


Claudio Kuenzler

unread,
Jun 29, 2021, 5:10:05 AM6/29/21
to
Sorry for auto-responding all the time ;-)
I was just able to catch a "freeze" followed by a successful boot afterwards.

The successful boot continues with these lines:

[   62.922169] systemd[1]: Finished Create System Users.
[   62.923633] systemd[1]: Starting Create Static Device Nodes in /dev...
[   62.941753] systemd[1]: Finished Create Static Device Nodes in /dev.
[   62.944691] systemd[1]: Starting Rule-based Manager for Device Events and Files...
[   62.953082] systemd[1]: modp...@drm.service: Succeeded.
[   62.953539] systemd[1]: Finished Load Kernel Module drm.
[   62.983630] systemd[1]: Started Rule-based Manager for Device Events and Files.
[   62.991307] systemd[1]: Finished Set the console keyboard layout.
[   63.015898] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input5
[   63.016490] systemd[1]: Finished Coldplug All udev Devices.
[   63.018250] systemd[1]: Starting Helper to synchronize boot up for ifupdown...
[   63.020119] power_meter ACPI000D:00: Found ACPI power meter.
[   63.020214] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[   63.020280] power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[   63.029971] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
[   63.030392] systemd[1]: Reached target Local File Systems (Pre).
[   63.031784] IPMI message handler: version 39.2
[   63.035060] ipmi device interface
[   63.036149] ACPI: Power Button [PWRF]
[   63.038539] EDAC MC1: Giving out device to module i7core_edac.c controller i7 core #1: DEV 0000:3e:03.0 (INTERRUPT)
[   63.038670] EDAC PCI0: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:3e:03.0 (POLLED)
[   63.039204] EDAC MC0: Giving out device to module i7core_edac.c controller i7 core #0: DEV 0000:3f:03.0 (INTERRUPT)
[   63.039315] EDAC PCI1: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:3f:03.0 (POLLED)
[   63.039405] EDAC i7core: Driver loaded, 2 memory controller(s) found.
[   63.044910] ipmi_si: IPMI System Interface driver
[   63.044996] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[   63.045059] ipmi_platform: ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[   63.045134] ipmi_si: Adding SMBIOS-specified kcs state machine
[   63.045263] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[   63.045393] ipmi_si IPI0001:00: ipmi_platform: [io  0x0ca2-0x0ca3] regsize 1 spacing 1 irq 0
[   63.045652] iTCO_vendor_support: vendor-support=0
[   63.046504] hpwdt 0000:02:00.0: HPE Watchdog Timer Driver: NMI decoding initialized

This line catches my attention:

[   62.953082] systemd[1]: modp...@drm.service: Succeeded.

This is missing (doesn't show) when the freeze happens.

FYI in the meantime I also installed firmware-amd-graphics however the behaviour (sometimes freeze, sometimes boot) is still the same.

I continue to troubleshoot but if anyone has experienced something similar or has some hints or can point to existing bugs please let me know.

Georgi Naplatanov

unread,
Jun 29, 2021, 5:50:04 AM6/29/21
to
Hi Claudio,

I noticed that kernel logs you posted are between 62nd - 64th second
after kernel loading. Why is the boot process so slow?

If you think that video driver can be an issue then you can try to
configure the system not to use framebuffer (if the system
doesn't use GUI).

Kind regards
Georgi

Claudio Kuenzler

unread,
Jun 29, 2021, 6:40:05 AM6/29/21
to
Hi Georgi

I noticed that kernel logs you posted are between 62nd - 64th second
after kernel loading. Why is the boot process so slow?

Due to a disabled SATA device in BIOS, the kernel tries to do an ERST and SRST and does this until 60s after boot. 
That's OK, it's been the same on Buster, too.
 
If you think that video driver can be an issue then you can try to
configure the system not to use framebuffer (if the system
 doesn't use GUI).

Could you tell me how? Or a reference to it?

In the meantime I re-configured grub to boot with the following parameters:
debug rootwait earlyprintk=vga,keep earlycon pause_on_oops=5 panic=60 no_console_suspend

The last two boots now show a crash in the console - even with firmware-amd-graphics and firmware-linux-nonfree installed.

[ 69.546005 ] asm_call_irq_on_stack+0x12/0x20
[ 69.546005 ] </IRQ>
[ 69.546006 ] common_interrupt+0xb0/0x130
[ 69.546006 ] asm_common_interrupt+0x1e/0x40
[ 69.546006 ] RIP: 0010:cpuidle_enter_state+0xc4/0x350
[ 69.546007 ] Code: a2 ff 65 8b 3d 6d 39 f7 7b e8 78 2e a2 ff 49 89 c5 66 66 66
  66 90 31 ff e8 09 39 a2 ff 45 84 ff 0f 85 fa 00 00 00 fb 66 66 90 <66> 66 90 45
  85 f6 0f 88 06 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14
[ 69.546008 ] RSP: 0018:ffff9dec062cfea8 EFLAGS: 00000246
[ 69.546008 ] RAX: ffff89266fa2bc00 RBX: 0000000000000004 RCX: 000000000000001f

[ 69.546009 ] RDX: 0000000000000000 RSI: 0000000024eefefa RDI: 0000000000000000

[ 69.546009 ] RBP: ffffbdebff218e00 R08: 0000001022521b20 R09: 0000000000000018

[ 69.546010 ] R10: 00000000000004fa R11: 000000000000006cd R12: ffffffff851ae680

[ 69.546010 ] R13: 0000001022521b20 R14: 00000000000000004 R15: 0000000000000000

[ 69.546010 ] ? cpuidle_enter_state+0xb7/0x350
[ 69.546011 ] cpuidle_enter+0x29/0x40
[ 69.546011 ] do_idle+0x1ef/0x2b0
[ 69.546011 ] cpu_startup_entry+0x19/0x20
[ 69.546012 ] secondary_startup_64_no_verify+0xb0/0xbb
[ 69.546012 ] ---[ end trace 96fbf4be0200356d ]---

And on another crash almost the same but slightly different:

[ 69.331313 ] ? mwait_idle_with_hints.constprop.0+0x4b/0x90
[ 69.331313 ] ? mwait_idle_with_hints.constprop.0+0x4b/0x90
[ 69.331313 ] </NMI>
[ 69.331314 ] intel_idle+0x1f/0x30
[ 69.331314 ] cpuidle_enter_state+0x89/0x350
[ 69.331314 ] cpuidle_enter+0x29/0x40
[ 69.331315 ] do_idle+0x1ef/0x2b0
[ 69.331315 ] cpu_startup_entry+0x19/0x20
[ 69.331315 ] start_kernel+0x587/0x5a8
[ 69.331315 ] secondary_startup_64_no_verify+0xb0/0xbb
[ 69.511534 ] DMAR: [DMA Read] Request device [00:1e.0] PASID ffffffff fault ad
r 3000 [fault reason 06] PTE Read access is not set
[ 69.511541 ] Kernel Offset: 0x28a00000 from 0xffffffff81000000 (relocation ran
ge: 0xffffffff80000000-0xffffffffbfffffff)

I have recorded the boot crash: https://youtu.be/TIfX-isjM3E (see between second 47 and 48).

Alexander V. Makartsev

unread,
Jun 29, 2021, 7:20:04 AM6/29/21
to
On 29.06.2021 15:29, Claudio Kuenzler wrote:
Hi Georgi

I noticed that kernel logs you posted are between 62nd - 64th second
after kernel loading. Why is the boot process so slow?

Due to a disabled SATA device in BIOS, the kernel tries to do an ERST and SRST and does this until 60s after boot. 
That's OK, it's been the same on Buster, too.
 
If you think that video driver can be an issue then you can try to
configure the system not to use framebuffer (if the system
 doesn't use GUI).

Could you tell me how? Or a reference to it?

In the meantime I re-configured grub to boot with the following parameters:
debug rootwait earlyprintk=vga,keep earlycon pause_on_oops=5 panic=60 no_console_suspend

The last two boots now show a crash in the console - even with firmware-amd-graphics and firmware-linux-nonfree installed.

[ 69.546005 ] asm_call_irq_on_stack+0x12/0x20
[ 69.546005 ] </IRQ>
[ 69.546006 ] common_interrupt+0xb0/0x130
[ 69.546006 ] asm_common_interrupt+0x1e/0x40
[ 69.546006 ] RIP: 0010:cpuidle_enter_state+0xc4/0x350

Trace dump suggests that crash occurs while executing cpuidle module.
Try to boot with "intel_pstate=force" kernel parameter [1] to force different CPU driver (if CPU supports it) and\or "cpuidle.off=1" to disable cpuidle subsystem.

Also, it is a good practice to update BIOS\firmware of a server. [2] By doing this you might solve a lot of issues with low-level next to hardware problems, like these.
And installation of "intel-microcode" package should be a good idea.


[1] https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html?#kernel-command-line-options-for-intel-pstate
[2] https://support.hpe.com/hpesc/public/km/product/4091567/hpe-proliant-dl380-g7-server-models#t=DriversandSoftware&f:@kmswsoftwaretypekey=[swt8000012]
-- 
With kindest regards, Alexander.

⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀⠀⠀⠀ 

Georgi Naplatanov

unread,
Jun 29, 2021, 8:50:04 AM6/29/21
to
On 6/29/21 1:29 PM, Claudio Kuenzler wrote:
> Hi Georgi
>
> I noticed that kernel logs you posted are between 62nd - 64th second
> after kernel loading. Why is the boot process so slow?
>
>
> Due to a disabled SATA device in BIOS, the kernel tries to do an ERST
> and SRST and does this until 60s after boot. 
> That's OK, it's been the same on Buster, too.
>  
>
> If you think that video driver can be an issue then you can try to
> configure the system not to use framebuffer (if the system
>  doesn't use GUI).
>
>
> Could you tell me how? Or a reference to it?
>
> In the meantime I re-configured grub to boot with the following parameters:
>
> debug rootwait earlyprintk=vga,keep earlycon pause_on_oops=5 panic=60 no_console_suspend
>

I have not done this but found that article:

https://atkdinosaurus.wordpress.com/2017/03/23/how-to-almost-disable-framebuffer-in-ubuntu/

Kind regards
Georgi

Claudio Kuenzler

unread,
Jun 29, 2021, 4:30:04 PM6/29/21
to

Trace dump suggests that crash occurs while executing cpuidle module.
Try to boot with "intel_pstate=force" kernel parameter [1] to force different CPU driver (if CPU supports it) and\or "cpuidle.off=1" to disable cpuidle subsystem.


Thank you Alexander and Georgi (thanks for the link!) for your answers. highly appreciate it!

I have tried the additional kernel parameters intel_pstate=force and cpuidle.off=1 but unfortunately this didn't solve the problem. The freeze still happened at around 50% of the boots.

I now wiped Bullseye and installed Buster. The very same server was rebooted at least 10 times without any hiccup/freeze/crash.
Seems there are indeed some major issues which are not solved yet. If they come from Debian Installer, they may be related to bug #987441 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987441).
Or if they are Kernel related (Buster uses 4.19, Bullseye 5.10) it might be a completely different problem.

Michael Stone

unread,
Jun 29, 2021, 4:40:04 PM6/29/21
to
On Tue, Jun 29, 2021 at 11:08:09AM +0200, Claudio Kuenzler wrote:
>This line catches my attention:
>
>[   62.953082] systemd[1]: modp...@drm.service: Succeeded.
>
>This is missing (doesn't show) when the freeze happens.

I tend to suspect it's unrelated, but if you add "nomodeset nofb" to
your boot command line it will turn off the graphics drivers.

Claudio Kuenzler

unread,
Jun 30, 2021, 2:10:05 AM6/30/21
to
I tend to suspect it's unrelated, but if you add "nomodeset nofb" to
your boot command line it will turn off the graphics drivers.

Yes, I guess it is indeed unrelated. With buster I can see the same messages during boot:
- *ERROR* Failed to load firmware! on drm
- pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled

But no crash happens, Buster boots correctly (every time) and I can use the system.
I currently suspect a Kernel bug in 5.10. I might be dead wrong, but that's my current guess.


 

Paul Wise

unread,
Jun 30, 2021, 4:00:05 AM6/30/21
to
Claudio Kuenzler wrote:

> I currently suspect a Kernel bug in 5.10.

You could try booting a bullseye install with the buster kernel,
if that works then it sounds like you need a kernel git bisect.

https://wiki.debian.org/DebianKernel/GitBisect

--
bye,
pabs

https://wiki.debian.org/PaulWise
signature.asc

Claudio Kuenzler

unread,
Sep 16, 2021, 5:20:04 AM9/16/21
to
On Wed, Jun 30, 2021 at 9:51 AM Paul Wise <pa...@debian.org> wrote:
Claudio Kuenzler wrote:

> I currently suspect a Kernel bug in 5.10.

Thanks to everyone for hints and suggestions!
At the end it turned out to be an issue with the hpwdt module. After blacklisting this module, no boot or stability issues with Bullseye were detected anymore.

Linux-Fan

unread,
Sep 16, 2021, 6:30:05 AM9/16/21
to
Thanks for sharing and digging up all that information. I found the article
worth reading despite not having had this issue - a nicely structured
approach to tackle such problems!

Linux-Fan
öö
0 new messages