Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?

Markus Gebert

unread,

Jul 12, 2010, 8:25:54 AM7/12/10

to

On 10.07.2010, at 19:37, Alan Cox wrote:

> On Fri, Jul 9, 2010 at 6:53 PM, Markus Gebert <markus...@hostpoint.ch> wrote:
> [snip]
>
> Yes, this hardware comes from Sun directly, but getting Sun (/Oracle) support for this issue is gonna be tough. FreeBSD is unsupported, and in a short test we couldn't reproduce the problem with a Linux kernel. While I agree that a hardware issue has always been and still is a possibility to be considered, the fact that we tested this on two machines remains as well as the fact that 6.x, 7.x do not show the behavior. Another possibility is of course, that the X4100 is prone to such issues and somehow 6.x and 7.x have workarounds we're not aware of or just do something different in way so that this issue does not get triggered.
>
>
> 8.1 is our first release to have the driver for configuring and reporting machine check exceptions enabled by default. Prior to 8.1, you had to explicitly enable the driver at boot time.

I was aware of that, but I don't think that it might be the cause. Disabling MCA just makes the reporting go away, but the MCE and subsequent fatal trap remain. With default BIOS settings, the OS does not even get a chance to panic, the system just forces a reset before the OS could do anything. And, as far as I can tell, that did not happen on previous stable branches.

Don't know though wether MCA changes the situation even when disabled in loader.conf (hw.mca.enabled=0). I just checked our 7.2 setup, and MCA does not seem to be in an 7.2 kernel, so I guess this was added to 8.0 and activated by default in 8.1. To be honest, we did not check, wether 8.0 shows the same behavior, but I guess running 8.1 with hw.mca.enabled=0 should pretty much give the same situation as far as MCA is concerned.

Is there a way to get rid of MCA completely? (as opposed to just "turning it off" via loader.conf)

Markus

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

John Baldwin

unread,

Jul 12, 2010, 8:48:48 AM7/12/10

to

On Monday, July 12, 2010 8:25:54 am Markus Gebert wrote:
>
> On 10.07.2010, at 19:37, Alan Cox wrote:
>
> > On Fri, Jul 9, 2010 at 6:53 PM, Markus Gebert <markus...@hostpoint.ch>
wrote:
> > [snip]
> >
> > Yes, this hardware comes from Sun directly, but getting Sun (/Oracle)
support for this issue is gonna be tough. FreeBSD is unsupported, and in a
short test we couldn't reproduce the problem with a Linux kernel. While I
agree that a hardware issue has always been and still is a possibility to be
considered, the fact that we tested this on two machines remains as well as
the fact that 6.x, 7.x do not show the behavior. Another possibility is of
course, that the X4100 is prone to such issues and somehow 6.x and 7.x have
workarounds we're not aware of or just do something different in way so that
this issue does not get triggered.
> >
> >
> > 8.1 is our first release to have the driver for configuring and reporting
machine check exceptions enabled by default. Prior to 8.1, you had to
explicitly enable the driver at boot time.
>
>
> I was aware of that, but I don't think that it might be the cause. Disabling
MCA just makes the reporting go away, but the MCE and subsequent fatal trap
remain. With default BIOS settings, the OS does not even get a chance to
panic, the system just forces a reset before the OS could do anything. And, as
far as I can tell, that did not happen on previous stable branches.

Hmm with mca disabled in the loader you should not be getting any MCE's at all
as we don't enable the MCE interrupt in the CPU in that case. Are you
disabling it in the BIOS rather than loader.conf?

> Don't know though wether MCA changes the situation even when disabled in
loader.conf (hw.mca.enabled=0). I just checked our 7.2 setup, and MCA does not
seem to be in an 7.2 kernel, so I guess this was added to 8.0 and activated by
default in 8.1. To be honest, we did not check, wether 8.0 shows the same
behavior, but I guess running 8.1 with hw.mca.enabled=0 should pretty much
give the same situation as far as MCA is concerned.

7.3 has MCA support, but disabled by default.

> Is there a way to get rid of MCA completely? (as opposed to just "turning it
off" via loader.conf)

Turning it off in loader.conf does get rid of it completely as it prevents us
from initializing the MSRs.

--
John Baldwin

John Baldwin

unread,

Jul 12, 2010, 8:51:35 AM7/12/10

to

On Monday, July 12, 2010 8:41:51 am Markus Gebert wrote:
>
> On 10.07.2010, at 01:53, Markus Gebert wrote:
>
> >> I'm curious if disabling USB legacy support in the BIOS causes it to
still die
> >> even with ehci not loaded. If so, then the SMI# for the ehci controller
must
> >> somehow prevent the issue, perhaps by triggering frequently enough to
slow the
> >> rate of I/O requests down?
> >
> >
> > I disabled usb legacy support in the BIOS and booted a kernel with
usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce the MCE.
>
>
> Well, the situation has changed. Machine died over the weekend running our
test load with above kernel configuration. It seems that not having ehci in
the kernel at boot just makes the MCE much more unlikely to occur, but it
occurs. With ehci, I can panic the machine within a minute, without ehci it
seems to take at least hours. Still, I don't get why not having the ehci
driver in the kernel should have any effect, especially because nothing is
attached to it.

Ok, so maybe the SMI# interrupts do play a role somehow, at least as far as
altering the timing.

> Panic message:
>
> ----
> MCA: Bank 4, Status 0xb400004000030c2b
> MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007
> MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2
> MCA: CPU 2 UNCOR BUSLG Observer WR I/O
> MCA: Address 0xfd00000000
> panic: blockable sleep lock (sleep mutex) 128 @
/usr/src/sys/vm/uma_core.c:1992
> cpuid = 2
> KDB: enter: panic
> [thread pid 12 tid 100039 ]
> Stopped at kdb_enter+0x3d: movq $0,0x69ccb0(%rip)
> ----
>
> Don't know, why it's not a fatal trap 28 this time despite an MCE was
detected. Seen this before though, also with kernels that have ehci and with
usb legacy support, so seeing a different panic this time seems not related to
the way the kernel was configured. Maybe a symptom? Or may it even be useful?
If yes, what should I pull out of DDB?
>
> In the meantime, I'll try harder to reproduce the MCE on current...

Well, it panic'd trying to malloc something in a non-safe place, because the
machine check can happen at any time like an NMI. The panic was caused by the
MCE however.

Markus Gebert

unread,

Jul 12, 2010, 9:57:29 AM7/12/10

to

On 12.07.2010, at 14:51, John Baldwin wrote:

>> Well, the situation has changed. Machine died over the weekend running our
>> test load with above kernel configuration. It seems that not having ehci in
>> the kernel at boot just makes the MCE much more unlikely to occur, but it
>> occurs. With ehci, I can panic the machine within a minute, without ehci it
>> seems to take at least hours. Still, I don't get why not having the ehci
>> driver in the kernel should have any effect, especially because nothing is
>> attached to it.
>
> Ok, so maybe the SMI# interrupts do play a role somehow, at least as far as
> altering the timing.

Hm, if I've understood your other email correctly, disabling usb legacy support should get rid of SMIs just as well as loading the ehci driver. What I tested was kernel with ehci (panic within a minute) versus kernel without ehci (panic within hours), but both cases with usb legacy support disabled in BIOS. So, again, if I understand this correctly, the "SMI rate" should have been the same in both cases, because usb legacy support was turned off entirely, and therefore loading or not loading ehci should not impact the SMI rate. If this should be the case, why would there be an altering of timings between these two test cases?

Since SMM is out the the OS' control, I guess there's no good way to track SMIs?

Markus_______________________________________________

Markus Gebert

unread,

Jul 12, 2010, 9:57:58 AM7/12/10

to

On 12.07.2010, at 14:48, John Baldwin wrote:

> Hmm with mca disabled in the loader you should not be getting any MCE's at all
> as we don't enable the MCE interrupt in the CPU in that case. Are you
> disabling it in the BIOS rather than loader.conf?

I disabled it in loader.conf, just tested again, you're right, it just reboots in that case.

> 7.3 has MCA support, but disabled by default.

IIRC we also tested 7.3 without being able to reproduce, but I'm not sure (didn't do all the tests myself).

But I guess we can rule out MCA at this point, since we just get a forced reboot instead of a panic, right?

Markus

John Baldwin

unread,

Jul 12, 2010, 11:05:13 AM7/12/10

to

On Monday, July 12, 2010 9:57:58 am Markus Gebert wrote:
>
> On 12.07.2010, at 14:48, John Baldwin wrote:
>
>
> > Hmm with mca disabled in the loader you should not be getting any MCE's at
all
> > as we don't enable the MCE interrupt in the CPU in that case. Are you
> > disabling it in the BIOS rather than loader.conf?
>
> I disabled it in loader.conf, just tested again, you're right, it just
reboots in that case.
>
>
> > 7.3 has MCA support, but disabled by default.
>
> IIRC we also tested 7.3 without being able to reproduce, but I'm not sure
(didn't do all the tests myself).
>
> But I guess we can rule out MCA at this point, since we just get a forced
reboot instead of a panic, right?

Yes.

--
John Baldwin

John Baldwin

unread,

Jul 12, 2010, 11:06:59 AM7/12/10

to

On Monday, July 12, 2010 9:57:29 am Markus Gebert wrote:
>
> On 12.07.2010, at 14:51, John Baldwin wrote:
>
> >> Well, the situation has changed. Machine died over the weekend running our
> >> test load with above kernel configuration. It seems that not having ehci in
> >> the kernel at boot just makes the MCE much more unlikely to occur, but it
> >> occurs. With ehci, I can panic the machine within a minute, without ehci it
> >> seems to take at least hours. Still, I don't get why not having the ehci
> >> driver in the kernel should have any effect, especially because nothing is
> >> attached to it.
> >
> > Ok, so maybe the SMI# interrupts do play a role somehow, at least as far as
> > altering the timing.
>
> Hm, if I've understood your other email correctly, disabling usb legacy support should get rid of SMIs just as well as loading the ehci driver.
What I tested was kernel with ehci (panic within a minute) versus kernel without ehci (panic within hours), but both cases with usb legacy support
disabled in BIOS. So, again, if I understand this correctly, the "SMI rate" should have been the same in both cases, because usb legacy support was
turned off entirely, and therefore loading or not loading ehci should not impact the SMI rate. If this should be the case, why would there be an
altering of timings between these two test cases?

Oh, I didn't know that USB legacy support was disabled in both cases. That
should disable all the SMIs in both cases as you say.

Are you using Cx states other than C1 for the CPUs at all?

Markus Gebert

unread,

Jul 12, 2010, 11:23:23 AM7/12/10

to

On 12.07.2010, at 17:06, John Baldwin wrote:

> Are you using Cx states other than C1 for the CPUs at all?

Not sure how to find out, but I did not change anything in the BIOS settings (if even possible) or through sysctl regarding cpu idle modes. Anyway, here's what I found:

# sysctl machdep.idle machdep.idle_available
machdep.idle: amdc1e
machdep.idle_available: spin, amdc1e, hlt, acpi,

Not sure if "amdc1e" qualifies for something "other than C1". I tried "hlt" once, which didn't make a difference IIRC. And if that's not what you needed, here's more:

# sysctl dev.cpu
dev.cpu.0.%desc: ACPI CPU
dev.cpu.0.%driver: cpu
dev.cpu.0.%location: handle=\_PR_.P001
dev.cpu.0.%pnpinfo: _HID=none _UID=0
dev.cpu.0.%parent: acpi0
dev.cpu.0.freq: 2786
dev.cpu.0.freq_levels: 2786/95000 2587/81800 2388/69811 2189/58977 1990/49240 1791/44316 995/22525
dev.cpu.0.cx_supported: C1/0
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_usage: 100.00% last 500us
dev.cpu.1.%desc: ACPI CPU
dev.cpu.1.%driver: cpu
dev.cpu.1.%location: handle=\_PR_.P002
dev.cpu.1.%pnpinfo: _HID=none _UID=0
dev.cpu.1.%parent: acpi0
dev.cpu.1.cx_supported: C1/0
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_usage: 100.00% last 500us
dev.cpu.2.%desc: ACPI CPU
dev.cpu.2.%driver: cpu
dev.cpu.2.%location: handle=\_PR_.P003
dev.cpu.2.%pnpinfo: _HID=none _UID=0
dev.cpu.2.%parent: acpi0
dev.cpu.2.cx_supported: C1/0
dev.cpu.2.cx_lowest: C1
dev.cpu.2.cx_usage: 100.00% last 500us
dev.cpu.3.%desc: ACPI CPU
dev.cpu.3.%driver: cpu
dev.cpu.3.%location: handle=\_PR_.P004
dev.cpu.3.%pnpinfo: _HID=none _UID=0
dev.cpu.3.%parent: acpi0
dev.cpu.3.cx_supported: C1/0
dev.cpu.3.cx_lowest: C1
dev.cpu.3.cx_usage: 100.00% last 500us

Markus_______________________________________________

Jeremy Chadwick

unread,

Jul 12, 2010, 11:40:36 AM7/12/10

to

On Mon, Jul 12, 2010 at 05:23:23PM +0200, Markus Gebert wrote:
>
> On 12.07.2010, at 17:06, John Baldwin wrote:
>
> > Are you using Cx states other than C1 for the CPUs at all?
>
> Not sure how to find out, but I did not change anything in the BIOS settings (if even possible) or through sysctl regarding cpu idle modes. Anyway, here's what I found:
>
> # sysctl machdep.idle machdep.idle_available
> machdep.idle: amdc1e
> machdep.idle_available: spin, amdc1e, hlt, acpi,
>
> Not sure if "amdc1e" qualifies for something "other than C1". I tried "hlt" once, which didn't make a difference IIRC. And if that's not what you needed, here's more:
>
> # sysctl dev.cpu

> [...]

> dev.cpu.0.freq_levels: 2786/95000 2587/81800 2388/69811 2189/58977 1990/49240 1791/44316 995/22525
> dev.cpu.0.cx_supported: C1/0
> dev.cpu.0.cx_lowest: C1

> [...]

cx_supported indicates your CPU only supports C1 and not lower
power-saving states (C2/C3/C4, etc.). Non-C1 states can sometimes do
"interesting" things when it comes to interrupt handling. I believe
your system may support the C1E state (given what machdep.idle_available
shows), but that's often controlled by the system BIOS (on both Intel
and AMD processors, but I'm trying to focus on AMD here). C1E, as far
as I know, is the same as C1 state except can save a little bit more
power.

I believe neither C1 nor C1E do anything with interrupts, instead just
halting the core when idle/not in use. HLT mode, at least on multi-core
AMD CPUs, equates to C1E.

Shot in the dark: you're not running powerd(8) on this system are you?

Markus Gebert

unread,

Jul 12, 2010, 11:52:37 AM7/12/10

to

On 12.07.2010, at 17:40, Jeremy Chadwick wrote:

> cx_supported indicates your CPU only supports C1 and not lower
> power-saving states (C2/C3/C4, etc.). Non-C1 states can sometimes do
> "interesting" things when it comes to interrupt handling. I believe
> your system may support the C1E state (given what machdep.idle_available
> shows), but that's often controlled by the system BIOS (on both Intel
> and AMD processors, but I'm trying to focus on AMD here). C1E, as far
> as I know, is the same as C1 state except can save a little bit more
> power.
>
> I believe neither C1 nor C1E do anything with interrupts, instead just
> halting the core when idle/not in use. HLT mode, at least on multi-core
> AMD CPUs, equates to C1E.

I see.

> Shot in the dark: you're not running powerd(8) on this system are you?

No, I'm not. But once in our long series of trial&error, I tried to enabled it, just to see wether it would trigger something. It didn't, but the system was not loaded at that time.

But I just remebered that I once tried to reproduce the problem with kern.smp.disabled=1 in loader.conf, but with the test load running only on the BSP, the problem did not seem to occur. Don't know if this is any of any help though.

Markus_______________________________________________

Markus Gebert

unread,

Jul 12, 2010, 12:08:54 PM7/12/10

to

On 12.07.2010, at 17:43, Adam McDougall wrote:

> I also get MCE on x4100m2 when causing significant disk activity in mpt
> while also downloading through em0 or em1.

Could you reproduce this on 6.x or 7.x? Because whatever we try here, we simply couldn't so far. A short test with Ubuntu also didn't show any sing of problems.

> I was not able to trigger it
> while using nfe, however nfe locked up on me during normal DNS server
> traffic so that was a wash.

We had issues with nfe pre-8.x, that's why we have been using the em nics, which seem to be part of the problem now in 8.x.

> What seemed to work for me was to add an
> Intel PCIE nic to the server and use it instead of the onboard NICS.

Thanks for the hint.

> For whatever reason I never experienced this problem until using ZFS.

We were able to reproduce it with UFS on 8.x. with just one disks (no gmirror), but I guess it's easier to trigger with ZFS especially in an mirror setup.

> I triggered it by downloading a 200m tgz file via http repeatedly
> over gigabit and it would reliably crash within a minute or two.

Our test case is basically:

1. fetch a large file using wget over em0 (100mbit link seems enough)
2. cp a large file locally to stress mpt
3. wait for MCE

> I ordered a dozen nics for probably around $20 each and was satisfied
> with this workaround given the age of the servers. I'm pinched on time
> for work so I often don't get around to reporting issues where I've
> found a workaround, I'm glad you can get that started.

"Glad" we're not the only ones :-)

Markus

Markus Gebert

unread,

Jul 13, 2010, 10:02:05 AM7/13/10

to

On 12.07.2010, at 14:41, Markus Gebert wrote:

> In the meantime, I'll try harder to reproduce the MCE on current...

Well, I can't. It's been running the test load for almost 24 hours without any sign of problems. The kernel config is the same as under 8.1, GENERIC+ipfw (USB not excluded!). Our 8.1 test config includes the CURRENT debug options, so as far as I can tell we should be testing under the same conditions. Yet I don't know wether CURRENT includes any additional debug magic I'm not aware of, which could slow the system down an prevent the issue from occurring.

Another difference to our primary 8.1 test machine is that we didn't use ZFS on CURRENT. But since we are able to crash 8.1 with UFS, I don't think this will matter. Still I'm going to try to get together a CURRENT ZFS setup on the second machine.

But, so far, bottom line seems to be:

- no problems on 6.x, 7.x and CURRENT
- no problems on Linux (that was only a short test though)
- crash within a minute on 8.1 (with ehci in the kernel)

Unfortunately, I have not been able to get anything useful out the svn commit logs, which could explain this. Maybe someone else has an idea what could have changed between 7 and 8 to break it, and again between 8 and CURRENT to magically fix it again.

Markus_______________________________________________