Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Current panic on boot on H11DSI motherboard with epyc cpu

9 views
Skip to first unread message

Vitalij Satanivskij

unread,
Apr 16, 2018, 4:30:39 PM4/16/18
to
Oh bios.
It's already lastest bios for now with agesa 1.0.0.5 in it.
It's dated 2/14/2018 So most likely new version will not appear soon


Stephen Hurd wrote:
SH> Yeah, this looks like some sort of general MSI issue, not igb specific.
SH> I'm not familiar with that part of the kernel, but maybe check if there's a
SH> BIOS update available?
SH>
SH> On Mon, Apr 16, 2018 at 3:51 PM, Vitalij Satanivskij <sa...@ukr.net> wrote:
SH>
SH> > Dear Stephen
SH> >
SH> > I'm disable msix on igb both 1 and 0
SH> > and enable HPET in bios
SH> >
SH> > get hpet_attach panic. http://hell.ukr.net/panic/recorder_hpet.webm
SH> > so i disable hpet again and get msi_alloc and so on
SH> > http://hell.ukr.net/panic/recorder_msi.webm
SH> >
SH> > So for test I'm set hw.pci.enable_msi=0 and get panic in cpp_hw_attach
SH> > wich autoloaded later wile system run rc scripts
SH> >
SH> > panic here - http://hell.ukr.net/panic/recorder_ccp.webm
SH> >
SH> > For me it's look like some kind of resource menegment problem?
SH> >
SH> >
SH> > Stephen Hurd wrote:
SH> > SH> If you disable msix just for igb0, does it crash somewhere else?
SH> > SH>
SH> > SH> On Mon, Apr 16, 2018 at 3:13 PM, Stephen Hurd <sh...@llnw.com> wrote:
SH> > SH>
SH> > SH> > Oh, you may need to disable msix to boot...
SH> > SH> >
SH> > SH> > dev.igb.0.iflib.disable_msix=1
SH> > SH> >
SH> > SH> > On Mon, Apr 16, 2018 at 3:02 PM, Stephen Hurd <sh...@llnw.com>
SH> > wrote:
SH> > SH> >
SH> > SH> >> Hrm, it should be trying to allocate three msi-x vectors there, and
SH> > it
SH> > SH> >> appears that it's reported that 10 are available. What's the
SH> > output of
SH> > SH> >> ``pciconf -lcv pci1:0:0''?
SH> > SH> >>
SH> > SH> >> On Mon, Apr 16, 2018 at 1:27 PM, Conrad Meyer <c...@freebsd.org>
SH> > wrote:
SH> > SH> >>
SH> > SH> >>> Hi Vitalij,
SH> > SH> >>>
SH> > SH> >>> On Mon, Apr 16, 2018 at 3:27 AM, Vitalij Satanivskij <
SH> > sa...@ukr.net>
SH> > SH> >>> wrote:
SH> > SH> >>> > DUMP can be found here http://hell.ukr.net/panic/panic.jpg
SH> > SH> >>> > or even video record from screen http://hell.ukr.net/panic/reco
SH> > SH> >>> rder.webm
SH> > SH> >>>
SH> > SH> >>> Looks like the panic message is printed directly after: "igb0:
SH> > using 2
SH> > SH> >>> rx queues 2 tx queues" (iflib_msix_init(), called by
SH> > SH> >>> iflib_device_register()).
SH> > SH> >>>
SH> > SH> >>> And stack is indeed coming from iflib in probe (0:17 in linked
SH> > video):
SH> > SH> >>>
SH> > SH> >>> panic()
SH> > SH> >>> nexus_add_irq()
SH> > SH> >>> msix_alloc()
SH> > SH> >>> pci_alloc_msix_method()
SH> > SH> >>> iflib_device_register()
SH> > SH> >>> iflib_device_attach()
SH> > SH> >>> device_attach()
SH> > SH> >>> ...
SH> > SH> >>>
SH> > SH> >>> Stephen, Matt, or Sean might be able to help diagnose further.
SH> > SH> >>>
SH> > SH> >>> Best,
SH> > SH> >>> Conrad
SH> > SH> >>>
SH> > SH> >>
SH> > SH> >>
SH> > SH> >>
SH> > SH> >> --
SH> > SH> >> [image: Limelight Networks] <http://www.limelight.com>
SH> > SH> >> Stephen Hurd* Principal Engineer*
SH> > SH> >> EXPERIENCE FIRST.
SH> > SH> >> +1 616 848 0643 <+1+616+848+0643>
SH> > SH> >> www.limelight.com
SH> > SH> >> [image: Facebook] <https://www.facebook.com/LimelightNetworks
SH> > >[image:
SH> > SH> >> LinkedIn] <http://www.linkedin.com/company/limelight-networks>[
SH> > image:
SH> > SH> >> Twitter] <https://twitter.com/llnw>
SH> > SH> >>
SH> > SH> >
SH> > SH> >
SH> > SH> >
SH> > SH> > --
SH> > SH> > [image: Limelight Networks] <http://www.limelight.com>
SH> > SH> > Stephen Hurd* Principal Engineer*
SH> > SH> > EXPERIENCE FIRST.
SH> > SH> > +1 616 848 0643 <+1+616+848+0643>
SH> > SH> > www.limelight.com
SH> > SH> > [image: Facebook] <https://www.facebook.com/LimelightNetworks
SH> > >[image:
SH> > SH> > LinkedIn] <http://www.linkedin.com/company/limelight-networks>[
SH> > image:
SH> > SH> > Twitter] <https://twitter.com/llnw>
SH> > SH> >
SH> > SH>
SH> > SH>
SH> > SH>
SH> > SH> --
SH> > SH> [image: Limelight Networks] <http://www.limelight.com>
SH> > SH> Stephen Hurd* Principal Engineer*
SH> > SH> EXPERIENCE FIRST.
SH> > SH> +1 616 848 0643 <+1+616+848+0643>
SH> > SH> www.limelight.com
SH> > SH> [image: Facebook] <https://www.facebook.com/LimelightNetworks>[image:
SH> > SH> LinkedIn] <http://www.linkedin.com/company/limelight-networks>[image:
SH> > SH> Twitter] <https://twitter.com/llnw>
SH> >
SH>
SH>
SH>
SH> --
SH> [image: Limelight Networks] <http://www.limelight.com>
SH> Stephen Hurd* Principal Engineer*
SH> EXPERIENCE FIRST.
SH> +1 616 848 0643 <+1+616+848+0643>
SH> www.limelight.com
SH> [image: Facebook] <https://www.facebook.com/LimelightNetworks>[image:
SH> LinkedIn] <http://www.linkedin.com/company/limelight-networks>[image:
SH> Twitter] <https://twitter.com/llnw>
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

John Baldwin

unread,
Apr 17, 2018, 12:35:35 PM4/17/18
to
On Monday, April 16, 2018 10:12:13 PM Vitalij Satanivskij wrote:
>
> igb0@pci0:1:0:0: class=0x020000 card=0x152115d9 chip=0x15218086 rev=0x01 hdr=0x00
> vendor = 'Intel Corporation'
> device = 'I350 Gigabit Network Connection'
> class = network
> subclass = ethernet
> cap 01[40] = powerspec 3 supports D0 D3 current D0
> cap 05[50] = MSI supports 1 message, 64 bit, vector masks
> cap 11[70] = MSI-X supports 10 messages
> Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
> cap 10[a0] = PCI-Express 2 endpoint max data 512(512) FLR RO NS
> link x4(x4) speed 5.0(5.0) ASPM L1(L0s/L1)
> ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
> ecap 0003[140] = Serial 1 ac1f6bffff620e0c
> ecap 000e[150] = ARI 1
> ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
> 0 VFs configured out of 8 supported
> First VF RID Offset 0x0180, VF RID Stride 0x0004
> VF Device ID 0x1520
> Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
> ecap 0017[1a0] = TPH Requester 1
> ecap 0018[1c0] = LTR 1
> ecap 000d[1d0] = ACS 1
>
> It's info from system booted with HPET disabled and
> hw.pci.enable_msix: 0
> hw.pci.enable_msi: 0
>
> If one of this parameters not set as described system not boot ^(

Please try the patch from here https://reviews.freebsd.org/P165

--
John Baldwin

Vitalij Satanivskij

unread,
Apr 17, 2018, 3:21:10 PM4/17/18
to
Dear John

I'm try patch with no success

http://hell.ukr.net/panic/recorder_patch165.webm

Also I'm enable verbose boot and record boot process (hpet was disabled so crash in another driver atach)
http://hell.ukr.net/panic/recorder_patch_verbose.webm

root@test:/usr/src # svnlite diff
Index: sys/x86/x86/msi.c
===================================================================
--- sys/x86/x86/msi.c (revision 332650)
+++ sys/x86/x86/msi.c (working copy)
@@ -404,7 +404,7 @@
/* Do we need to create some new sources? */
if (cnt < count) {
/* If we would exceed the max, give up. */
- if (i + (count - cnt) > FIRST_MSI_INT + NUM_MSI_INTS) {
+ if (i + (count - cnt) >= FIRST_MSI_INT + NUM_MSI_INTS) {
mtx_unlock(&msi_lock);
free(mirqs, M_MSI);
return (ENXIO);
@@ -645,7 +645,7 @@
/* Do we need to create a new source? */
if (msi == NULL) {
/* If we would exceed the max, give up. */
- if (i + 1 > FIRST_MSI_INT + NUM_MSI_INTS) {
+ if (i + 1 >= FIRST_MSI_INT + NUM_MSI_INTS) {
mtx_unlock(&msi_lock);
return (ENXIO);
}
root@test:/usr/src

If you need any aditional information please tell me about.



JB> > If one of this parameters not set as described system not boot ^(
JB>
JB> Please try the patch from here https://reviews.freebsd.org/P165
JB>
JB> --
JB> John Baldwin
JB> _______________________________________________
JB> freebsd...@freebsd.org mailing list
JB> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
JB> To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Vitalij Satanivskij

unread,
Apr 18, 2018, 7:00:43 AM4/18/18
to
JB> > If you need any aditional information please tell me about.
JB>
JB> Can you perhaps turn off the stack trace on boot to not lose the panic messages
JB> (remove KDB_TRACE from kernel config) and maybe modify the panic message to
JB> include the IRQ number passed to nexus_add_irq?


Hm looks like it's always irq with number 256
eg hpet - 256
igb - 256

Chenged made for it was

Index: sys/x86/x86/nexus.c
===================================================================
--- sys/x86/x86/nexus.c (revision 332663)
+++ sys/x86/x86/nexus.c (working copy)
@@ -698,7 +698,7 @@
{

if (rman_manage_region(&irq_rman, irq, irq) != 0)
- panic("%s: failed", __func__);
+ panic("%s: failed irq is: %lu", __func__, irq);

John Baldwin

unread,
Apr 18, 2018, 2:46:30 PM4/18/18
to
On Wednesday, April 18, 2018 01:56:49 PM Vitalij Satanivskij wrote:
> JB> > If you need any aditional information please tell me about.
> JB>
> JB> Can you perhaps turn off the stack trace on boot to not lose the panic messages
> JB> (remove KDB_TRACE from kernel config) and maybe modify the panic message to
> JB> include the IRQ number passed to nexus_add_irq?
>
>
> Hm looks like it's always irq with number 256
> eg hpet - 256
> igb - 256
>
> Chenged made for it was
>
> Index: sys/x86/x86/nexus.c
> ===================================================================
> --- sys/x86/x86/nexus.c (revision 332663)
> +++ sys/x86/x86/nexus.c (working copy)
> @@ -698,7 +698,7 @@
> {
>
> if (rman_manage_region(&irq_rman, irq, irq) != 0)
> - panic("%s: failed", __func__);
> + panic("%s: failed irq is: %lu", __func__, irq);
> }

Ohhhh, this is a different issue. Sorry. As a hack, try changing
'FIRST_MSI_INT' to 512 in sys/amd64/include/intr_machdep.h. The issue
is that some systems now include more than 256 interrupt pins on I/O
APICs, so IRQ 256 is already reserved for use by one of those
interrupt pins. The real fix is that I need to make FIRST_MSI_INT
dynamic instead of a constant and just define it as the first free IRQ
after the I/O APICs have probed.

--
John Baldwin

Kevin Day

unread,
May 18, 2018, 7:10:46 PM5/18/18
to

> On Apr 18, 2018, at 1:42 PM, John Baldwin <j...@freebsd.org> wrote:
>>
>> Chenged made for it was
>>
>> Index: sys/x86/x86/nexus.c
>> ===================================================================
>> --- sys/x86/x86/nexus.c (revision 332663)
>> +++ sys/x86/x86/nexus.c (working copy)
>> @@ -698,7 +698,7 @@
>> {
>>
>> if (rman_manage_region(&irq_rman, irq, irq) != 0)
>> - panic("%s: failed", __func__);
>> + panic("%s: failed irq is: %lu", __func__, irq);
>> }
>
> Ohhhh, this is a different issue. Sorry. As a hack, try changing
> 'FIRST_MSI_INT' to 512 in sys/amd64/include/intr_machdep.h. The issue
> is that some systems now include more than 256 interrupt pins on I/O
> APICs, so IRQ 256 is already reserved for use by one of those
> interrupt pins. The real fix is that I need to make FIRST_MSI_INT
> dynamic instead of a constant and just define it as the first free IRQ
> after the I/O APICs have probed.

I'm testing a very large AMD Epyc system, and I had to change FIRST_MSI_INT to 768, but that fixed this issue for me.
0 new messages