8.0RC2 amd64 - kernel panic running make buildworld

Kai Gallasch

unread,

Oct 31, 2009, 6:15:45 PM10/31/09

to

Hi.

I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.

When I try to do a make buildworld or make buildkernel the server
reboots without any message left in the logs. The same happens
when building bigger ports (for example ruby18 or perl58)

With 8.0-RC2 debug flags and witness seem to be disabled in the
standard GENERIC kernel, so unfortunately it is not possible for me to
build a debug kernel without my server crashing..

Now my idea was to install the old 8.0-BETA4 and upgrade to RC2 through
makeworld + buildkernel (gdb+witness). But no luck. When trying to
upgrade to RC2 the 8.0-BETA4 also crashes. At least 8.0-BETA4 has debug
+ witness active in the GENERIC kernel..

So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
problem with the BETA4 already been fixed?

Does it make sense to send in a pr with the old 8.0-BETA4?

BTW. I installed 7.2-STABLE on this same server and did a "make
buildworld" and "make buildkernel" which completed without any problem.

Cheers,
--Kai

----- make buildworld -j7 crash, freebsd 8.0-amd64-beta4 -----

lock order reversal:
1st 0xffffff00073d5ba8 ufs (ufs)
@ /usr/src/sys/ufs/ffs/ffs_snapshot.c:423 2nd 0xffffff819d921558
bufwait (bufwait) @ /usr/src/sys/kern/vfs_bio.c:2559 3rd
0xffffff00070c19d0 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:544
KDB: stack backtrace: db_trace_self_wrapper() at
db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
__lockmgr_args() at __lockmgr_args+0xcf3
ffs_lock() at ffs_lock+0x8c
VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
_vn_lock() at _vn_lock+0x47
ffs_snapshot() at ffs_snapshot+0x1b9d
ffs_mount() at ffs_mount+0x666
vfs_donmount() at vfs_donmount+0xcde
nmount() at nmount+0x63
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (378, FreeBSD ELF64, nmount), rip = 0x8007b14fc, rsp =
0x7fffffffe9b8, rbp = 0x800902530 --- lock order reversal:
1st 0xffffff819d921558 bufwait (bufwait)
@ /usr/src/sys/kern/vfs_bio.c:2559 2nd 0xffffff0007d9fa30 snaplk
(snaplk) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:793 KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
__lockmgr_args() at __lockmgr_args+0xcf3
ffs_lock() at ffs_lock+0x8c
VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
_vn_lock() at _vn_lock+0x47
ffs_snapshot() at ffs_snapshot+0x1a6a
ffs_mount() at ffs_mount+0x666
vfs_donmount() at vfs_donmount+0xcde
nmount() at nmount+0x63
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (378, FreeBSD ELF64, nmount), rip = 0x8007b14fc, rsp =
0x7fffffffe9b8, rbp = 0x800902530 --- lock order reversal:
1st 0xffffff0007d9fa30 snaplk (snaplk)
@ /usr/src/sys/kern/vfs_vnops.c:296 2nd 0xffffff00073d5ba8 ufs (ufs)
@ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
__lockmgr_args() at __lockmgr_args+0xcf3
ffs_snapremove() at ffs_snapremove+0xe7
softdep_releasefile() at softdep_releasefile+0x139
ufs_inactive() at ufs_inactive+0x1a5
vinactive() at vinactive+0x72
vput() at vput+0x230
vn_close() at vn_close+0x118
vn_closefile() at vn_closefile+0x5a
_fdrop() at _fdrop+0x23
closef() at closef+0x5b
kern_close() at kern_close+0x110
syscall() at syscall+0x1af
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (6, FreeBSD ELF64, close), rip = 0x80084cf9c, rsp =
0x7fffffffe9b8, rbp = 0 ---
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Gavin Atkinson

unread,

Nov 3, 2009, 5:42:40 AM11/3/09

to

On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote:
> Hi.
>
> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>
> When I try to do a make buildworld or make buildkernel the server
> reboots without any message left in the logs. The same happens
> when building bigger ports (for example ruby18 or perl58)
>
> With 8.0-RC2 debug flags and witness seem to be disabled in the
> standard GENERIC kernel, so unfortunately it is not possible for me to
> build a debug kernel without my server crashing..

First place I think I'd start id by running memtest86 on the machine
overnight. This sounds like possible hardware issue to me, it would be
good to see if we can confirm that that is the case.

> Now my idea was to install the old 8.0-BETA4 and upgrade to RC2 through
> makeworld + buildkernel (gdb+witness). But no luck. When trying to
> upgrade to RC2 the 8.0-BETA4 also crashes. At least 8.0-BETA4 has debug
> + witness active in the GENERIC kernel..
>
> So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
> problem with the BETA4 already been fixed?

The debug output you included were just lock order reversals, and don't
seem to be related to your crash.

I think 8.0-BETA4 still had the debugger compiled in (you can test by
pressing ctrl-alt-escape ion the console, if you do drop to the
debugger, give the "c" command to continue).

If the debugger is compiled in, then the spontaneous reboot without
dropping to the debugger suggests even more that it may be hardware
related. If you do get to the debugger, a copy of all of the messages
on screen and the output of the "bt" command would be very useful. When
you do your kernel recompile, please include full debugging, including
WITNESS, INVARIANTS, KDB, DDB etc.

FWIW, don't worry about building world now, a BETA4 world should work
fine with a RC2 kernel. You may be able to get a kernel built even
though it keeps crashing by clearing out /usr/obj to start with and then
just repeating
cd /usr/src && make buildkernel -DKERNFAST
after every crash.

> Does it make sense to send in a pr with the old 8.0-BETA4?

It depends what the bug is to be honest. So far there isn't really
enough information to determine the cause, and therefore there isn't
really enough info for a PR.

Gavin

Mark Atkinson

unread,

Nov 3, 2009, 12:26:15 PM11/3/09

to

Kai Gallasch wrote:
> Hi.
>
> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>
> When I try to do a make buildworld or make buildkernel the server
> reboots without any message left in the logs. The same happens
> when building bigger ports (for example ruby18 or perl58)
>
> With 8.0-RC2 debug flags and witness seem to be disabled in the
> standard GENERIC kernel, so unfortunately it is not possible for me to
> build a debug kernel without my server crashing..
>

> Now my idea was to install the old 8.0-BETA4 and upgrade to RC2 through
> makeworld + buildkernel (gdb+witness). But no luck. When trying to
> upgrade to RC2 the 8.0-BETA4 also crashes. At least 8.0-BETA4 has debug
> + witness active in the GENERIC kernel..
>
> So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
> problem with the BETA4 already been fixed?
>

> Does it make sense to send in a pr with the old 8.0-BETA4?
>

> BTW. I installed 7.2-STABLE on this same server and did a "make
> buildworld" and "make buildkernel" which completed without any problem.
>
> Cheers,
> --Kai
>
>
> ----- make buildworld -j7 crash, freebsd 8.0-amd64-beta4 -----

Definitely try the usual memory testing, power supply testing, etc.

I had a similar problem, but with a HP DL385G5 that has some sort of
"memory issue," and it would just silently reboot (which turned out to
be a machine check exception.) I could never finger the problem be it
with bios, the actual memory, or the fact that there's only one 4 core
cpu on a two socket board and only the associated memory bank filled.

I did various memory swaps to no avail, it would run memtest86 all day
with no errors, and in the end I just turned superpages off and it works
. Like a champ.

If vm.pmap.pg_ps_enabled is 1 in 8.0-rc2, you might try rebooting
with

vm.pmap.pg_ps_enabled="0"

in /boot/loader.conf and try another buildworld.

Kai Gallasch

unread,

Nov 3, 2009, 7:17:16 PM11/3/09

to

Am Tue, 03 Nov 2009 10:42:40 +0000
schrieb Gavin Atkinson <ga...@FreeBSD.org>:

> On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote:
> > Hi.
> >
> > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
> >
> > When I try to do a make buildworld or make buildkernel the server
> > reboots without any message left in the logs. The same happens
> > when building bigger ports (for example ruby18 or perl58)

> First place I think I'd start id by running memtest86 on the machine

> overnight. This sounds like possible hardware issue to me, it would
> be good to see if we can confirm that that is the case.

I will do so tomorrow. Following actions I have already taken to rule
out a hardware problem:

- ran several passes with diagnostic software from the manufacturer
- reset BIOS settings to default
- upgraded BIOS to newest release
- booted server from 2 year old backup BIOS
- took out the only pair of RAM modules that was different from the
rest of the modules
- installed freebsd 7.2-STABLE on the server to repeat the kernel
panic (no panic with 7.2)
- installed 8.0-BETA4 (crash)

Besides: The server was in production with 7.2 for some time, without
showing any such problems.

> > Now my idea was to install the old 8.0-BETA4 and upgrade to RC2
> > through makeworld + buildkernel (gdb+witness). But no luck. When
> > trying to upgrade to RC2 the 8.0-BETA4 also crashes. At least
> > 8.0-BETA4 has debug
> > + witness active in the GENERIC kernel..
> >
> > So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR
> > problem with the BETA4 already been fixed?
>

> The debug output you included were just lock order reversals, and
> don't seem to be related to your crash.

Sorry for causing possible confusion about this. I realized this after
my mail was already out.

> I think 8.0-BETA4 still had the debugger compiled in (you can test by
> pressing ctrl-alt-escape ion the console, if you do drop to the
> debugger, give the "c" command to continue).
>
> If the debugger is compiled in, then the spontaneous reboot without
> dropping to the debugger suggests even more that it may be hardware
> related. If you do get to the debugger, a copy of all of the messages
> on screen and the output of the "bt" command would be very useful.
> When you do your kernel recompile, please include full debugging,
> including WITNESS, INVARIANTS, KDB, DDB etc.

In the meantime I managed it to install a RELENG_8 world + GENERIC
kernel with all debug options enabled on the crashing server. (mounted
/usr/src and /usr/obj on another server running 8.0RC1 through NFS and
did buildworld + buildkernel over there..)

So now I have a debug kernel available with dumpev + dumpdir defined.

Here are my latest findings on this issue:

- Running a makeworld in about 80% leads to a server crash without
the server writing a crashdump to dumpdir. The server just reboots..
- In about 20% of the cases makeworld gets stuck in a not terminating
process that eats up 100% cpu. This process cannot be killed. When
restarting makeworld the server then reboots again
- It makes no difference doing makeworld -j1 or -j8, result is the same

> It depends what the bug is to be honest. So far there isn't really
> enough information to determine the cause, and therefore there isn't
> really enough info for a PR.

Mark Atkinson also commented on my mail and he gave the
hint: "If vm.pmap.pg_ps_enabled is 1 in 8.0-rc2, you might try
rebooting with c in /boot/loader.conf and try
another buildworld."

So I thought why not and just tried it - and surprise:

Disabling vm.pmap.pg_ps_enabled=1 in loader.conf resolves my problem
with 8.0RC2 crashing when doing a makeworld..

After successful buildworld and buildkernel I rebooted the server
again with commented out vm.pmap.pg_ps_enabled=0 and the problem
was there again. And then I disabled the option again in loader.conf,
rebooted + make buildworld .. no problem.

Seems to be deterministic. With vm.pmap.pg_ps_enabled=1 the server
crashes without being able to write crashdumps to dumpdev. (at least on
this specific Proliant DL385G2 server)

--Kai.

--
You need more time; and you probably always will.

S.N.Grigoriev

unread,

Nov 4, 2009, 5:44:10 AM11/4/09

to

Hi list,

I can confirm I've seen the same problem. After upgrading from 7-stable
to 8.0-RC2 my machine just reboots during 'make buildworld' without
diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
work for me. My machine reboots every time I try to build world.
I don't think I have a hardware problem: under 7-stable I can build
world/kernel for both 7-stable and 8.0-RC2 without problems.

--
Regards,
S.Grigoriev.

Dag-Erling Smørgrav

unread,

Nov 4, 2009, 10:24:01 AM11/4/09

to

Kai Gallasch <gall...@free.de> writes:
> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>
> When I try to do a make buildworld or make buildkernel the server
> reboots without any message left in the logs. The same happens
> when building bigger ports (for example ruby18 or perl58)

Could it be related to this? What's your CPUID?

Author: attilio
Date: Wed Nov 4 01:32:59 2009
New Revision: 198868
URL: http://svn.freebsd.org/changeset/base/198868

Log:
Opteron rev E family of processor expose a bug where, in very rare
ocassions, memory barriers semantic is not honoured by the hardware
itself. As a result, some random breakage can happen in uninvestigable
ways (for further explanation see at the content of the commit itself).

As long as just a specific familly is bugged of an entire architecture
is broken, a complete fix-up is impratical without harming to some
extents the other correct cases.
Considering that (and considering the frequency of the bug exposure)
just print out a warning message if the affected machine is identified.

Pointed out by: Samy Al Bahra <sbahra at repnop dot org>
Help on wordings by: jeff
MFC: 3 days

Modified:
head/sys/amd64/amd64/identcpu.c
head/sys/i386/i386/identcpu.c

DES
--
Dag-Erling Smørgrav - d...@des.no

Kai Gallasch

unread,

Nov 4, 2009, 10:36:21 AM11/4/09

to

Am Wed, 04 Nov 2009 16:24:01 +0100
schrieb Dag-Erling Smørgrav <d...@des.no>:

> Kai Gallasch <gall...@free.de> writes:
> > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
> >
> > When I try to do a make buildworld or make buildkernel the server
> > reboots without any message left in the logs. The same happens
> > when building bigger ports (for example ruby18 or perl58)
>
> Could it be related to this? What's your CPUID?

Found this in dmesg. Is this the CPUID? "Id = 0x100f23"

--Kai.

CPU: Quad-Core AMD Opteron(tm) Processor 2352 (2100.09-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f23 Stepping = 3
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x802009<SSE3,MON,CX16,POPCNT>
AMD
Features=0xee400800<SYSCALL,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
AMD
Features2=0x7ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS>
TSC: P-state invariant real memory = 21474836480 (20480 MB) avail
memory = 20701110272 (19742 MB) ACPI APIC Table: <HP ProLiant>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 2 package(s) x 4 core(s)
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
cpu2 (AP): APIC ID: 2
cpu3 (AP): APIC ID: 3
cpu4 (AP): APIC ID: 4
cpu5 (AP): APIC ID: 5
cpu6 (AP): APIC ID: 6
cpu7 (AP): APIC ID: 7

--
If it wasn't for the last minute, nothing would get done.

Mark Atkinson

unread,

Nov 4, 2009, 4:31:35 PM11/4/09

to

Kai Gallasch wrote:
> Am Wed, 04 Nov 2009 16:24:01 +0100
> schrieb Dag-Erling Smørgrav <d...@des.no>:
>
>> Kai Gallasch <gall...@free.de> writes:
>>> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>>>
>>> When I try to do a make buildworld or make buildkernel the server
>>> reboots without any message left in the logs. The same happens
>>> when building bigger ports (for example ruby18 or perl58)
>> Could it be related to this? What's your CPUID?
>
> Found this in dmesg. Is this the CPUID? "Id = 0x100f23"

That's generation 16 model 2 stepping 3. This errata only effects
generation 0xe or 15. BTW, I have the same processor/stepping/Mhz in
my system, but only a single physical processor.

> --Kai.
>
>
> CPU: Quad-Core AMD Opteron(tm) Processor 2352 (2100.09-MHz K8-class CPU)
> Origin = "AuthenticAMD" Id = 0x100f23 Stepping = 3
> Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
> Features2=0x802009<SSE3,MON,CX16,POPCNT>
> AMD
> Features=0xee400800<SYSCALL,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
> AMD
> Features2=0x7ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS>
> TSC: P-state invariant real memory = 21474836480 (20480 MB) avail
> memory = 20701110272 (19742 MB) ACPI APIC Table: <HP ProLiant>
> FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
> FreeBSD/SMP: 2 package(s) x 4 core(s)
> cpu0 (BSP): APIC ID: 0
> cpu1 (AP): APIC ID: 1
> cpu2 (AP): APIC ID: 2
> cpu3 (AP): APIC ID: 3
> cpu4 (AP): APIC ID: 4
> cpu5 (AP): APIC ID: 5
> cpu6 (AP): APIC ID: 6
> cpu7 (AP): APIC ID: 7
>
>

_______________________________________________

Mark Atkinson

unread,

Nov 4, 2009, 4:34:00 PM11/4/09

to

Kai Gallasch wrote:
> Am Wed, 04 Nov 2009 16:24:01 +0100
> schrieb Dag-Erling Smørgrav <d...@des.no>:
>
>> Kai Gallasch <gall...@free.de> writes:
>>> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>>>
>>> When I try to do a make buildworld or make buildkernel the server
>>> reboots without any message left in the logs. The same happens
>>> when building bigger ports (for example ruby18 or perl58)
>> Could it be related to this? What's your CPUID?
>

> Found this in dmesg. Is this the CPUID? "Id = 0x100f23"

That's generation 16 (0xf) model 2, stepping 3. This errata apparently
only effects gen 15 (0xe) and some pre-release -- never released to
public (0xf). I have the same processor in my system btw.

Mark Atkinson

unread,

Nov 4, 2009, 4:45:32 PM11/4/09

to

Mark Atkinson wrote:
> Kai Gallasch wrote:
>> Am Wed, 04 Nov 2009 16:24:01 +0100
>> schrieb Dag-Erling Smørgrav <d...@des.no>:
>>
>>> Kai Gallasch <gall...@free.de> writes:
>>>> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>>>>
>>>> When I try to do a make buildworld or make buildkernel the server
>>>> reboots without any message left in the logs. The same happens
>>>> when building bigger ports (for example ruby18 or perl58)
>>> Could it be related to this? What's your CPUID?
>
>> Found this in dmesg. Is this the CPUID? "Id = 0x100f23"
>
> That's generation 16 (0xf) model 2, stepping 3. This errata apparently
> only effects gen 15 (0xe) and some pre-release -- never released to
> public (0xf). I have the same processor in my system btw.

sorry for the double wrong posting. I see several webpages refer to 15
as f and 16 as f. usr/ports/misc/cpuid refers to it as 15.

The pages referenced via the bugzilla entry in the commit refer to it as
0xf but between 32 and 63. Does the model 2 correctly put us in the
range in the commit 0x20 and 0x3f? (i.e. stepping is included?)

Alexandre "Sunny" Kovalenko

unread,

Nov 4, 2009, 4:51:39 PM11/4/09

to

On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> Hi list,
>
> I can confirm I've seen the same problem. After upgrading from 7-stable
> to 8.0-RC2 my machine just reboots during 'make buildworld' without
> diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> work for me. My machine reboots every time I try to build world.
> I don't think I have a hardware problem: under 7-stable I can build
> world/kernel for both 7-stable and 8.0-RC2 without problems.
>

Is it by any chance possible that you have 'debug.debugger_on_panic' set
to '0' and no valid dump device configured?

--
Alexandre Kovalenko (Олександр Коваленко)

Mark Atkinson

unread,

Nov 4, 2009, 5:36:46 PM11/4/09

to

Mark Atkinson wrote:
> Mark Atkinson wrote:
>> Kai Gallasch wrote:
>>> Am Wed, 04 Nov 2009 16:24:01 +0100
>>> schrieb Dag-Erling Smørgrav <d...@des.no>:
>>>
>>>> Kai Gallasch <gall...@free.de> writes:
>>>>> I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
>>>>>
>>>>> When I try to do a make buildworld or make buildkernel the server
>>>>> reboots without any message left in the logs. The same happens
>>>>> when building bigger ports (for example ruby18 or perl58)
>>>> Could it be related to this? What's your CPUID?
>>> Found this in dmesg. Is this the CPUID? "Id = 0x100f23"
>> That's generation 16 (0xf) model 2, stepping 3. This errata apparently
>> only effects gen 15 (0xe) and some pre-release -- never released to
>> public (0xf). I have the same processor in my system btw.
>
> sorry for the double wrong posting. I see several webpages refer to 15
> as f and 16 as f. usr/ports/misc/cpuid refers to it as 15.
>
> The pages referenced via the bugzilla entry in the commit refer to it as
> 0xf but between 32 and 63. Does the model 2 correctly put us in the
> range in the commit 0x20 and 0x3f? (i.e. stepping is included?)

I'll answer my own question, no:

http://support.amd.com/us/Processor_TechDocs/25481.pdf

Although the some of the posts in

http://bugzilla.kernel.org/show_bug.cgi?id=11305

indicate any model < 0x40. Someone must have actually narrowed the range.

Adrian Chadd

unread,

Nov 4, 2009, 6:47:25 PM11/4/09

to

2009/11/5 Mark Atkinson <atki...@yahoo.com>:

>
> I'll answer my own question, no:
>
> http://support.amd.com/us/Processor_TechDocs/25481.pdf
>
> Although the some of the posts in
>
> http://bugzilla.kernel.org/show_bug.cgi?id=11305
>
> indicate any model < 0x40. Someone must have actually narrowed the range.

Is there a FreeBSD PR or errata URL which can be linked to instead,
complete with copies of the above in it?

Adrian

Kai Gallasch

unread,

Nov 5, 2009, 10:21:18 AM11/5/09

to

Am Tue, 03 Nov 2009 10:42:40 +0000
schrieb Gavin Atkinson <ga...@FreeBSD.org>:

> On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote:
> > Hi.
> >

> > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago.
> >
> > When I try to do a make buildworld or make buildkernel the server
> > reboots without any message left in the logs. The same happens
> > when building bigger ports (for example ruby18 or perl58)
> >

> > With 8.0-RC2 debug flags and witness seem to be disabled in the
> > standard GENERIC kernel, so unfortunately it is not possible for me
> > to build a debug kernel without my server crashing..
>

> First place I think I'd start id by running memtest86 on the machine
> overnight. This sounds like possible hardware issue to me, it would
> be good to see if we can confirm that that is the case.

Gavin.

memtest86 ran for 18 hours and showed no problem with RAM.

--Kai.

S.N.Grigoriev

unread,

Nov 5, 2009, 11:40:03 AM11/5/09

to

04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com>
wrote:

> On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > Hi list,
> >
> > I can confirm I've seen the same problem. After upgrading from 7-stable
> > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > work for me. My machine reboots every time I try to build world.
> > I don't think I have a hardware problem: under 7-stable I can build
> > world/kernel for both 7-stable and 8.0-RC2 without problems.
> >
> Is it by any chance possible that you have 'debug.debugger_on_panic' set
> to '0' and no valid dump device configured?

Hi Alexandre,

I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
output. Where cat I find it? All my sysctl variables are set by
default.
--
Regards,
S.Grigoriev.

Mark Atkinson

unread,

Nov 5, 2009, 12:43:00 PM11/5/09

to

Adrian Chadd wrote:
> 2009/11/5 Mark Atkinson <atki...@yahoo.com>:
>
>> I'll answer my own question, no:
>>
>> http://support.amd.com/us/Processor_TechDocs/25481.pdf
>>
>> Although the some of the posts in
>>
>> http://bugzilla.kernel.org/show_bug.cgi?id=11305
>>
>> indicate any model < 0x40. Someone must have actually narrowed the range.
>
> Is there a FreeBSD PR or errata URL which can be linked to instead,
> complete with copies of the above in it?

If you read the mysql related blog post on it:

http://timetobleed.com/mysql-doesnt-always-suck-this-time-its-amd/

Someone in the comments suggests this is AMD errata 147 and quotes the
text. I'll include a copy of the comment here below for the mail
archives (and since urls tend to disappear).

#
silverjam
The kernel bug:

http://bugzilla.kernel.org/show_bug.cgi?id=11305

Which references an AMD "errata 147" from "Revision Guide for AMD
Athlonï¿½ 64 and AMD Opteronï¿½ Processors."

http://support.amd.com/us/Processor_TechDocs/25759.pdf

Which says:
"""
Potential Violation of Read Ordering Rules Between Semaphore Operations
and Unlocked Read-Modify-Write Instructions

Description

Under a highly specific set of internal timing circumstances, the memory
read ordering between a
semaphore operation and a subsequent read-modify-write instruction (an
instruction that uses the
same memory location as both a source and destination) may be incorrect
and allow the read-modifywrite
instruction to operate on the memory location ahead of the completion of
the semaphore
operation. The erratum will not occur if there is a LOCK prefix on the
read-modify-write instruction.
This erratum does not apply if the read-only value in MSRC001_1023h[33]
is 1b.

Potential Effect on System

In the unlikely event that the condition described above occurs, the
read-modify-write instruction (in
the critical section) may operate on data that existed prior to the
semaphore operation. This erratum
can only occur in multiprocessor or multicore configurations.

Suggested Workaround

To provide a workaround for this unlikely event, software can perform
any of the following actions
for multiprocessor or multicore systems:
ï¿½ Place a LFENCE instruction between the semaphore operation and any
subsequent read-modifywrite
instruction(s) in the critical section.
ï¿½ Use a LOCK prefix with the read-modify-write instruction.
ï¿½ Decompose the read-modify-write instruction into separate instructions.

No workaround is necessary if software checks that MSRC001_1023h[33] is
set on all processors that
may execute the code. The value in MSRC001_1023h[33] may not be the same
on all processors in a
multi-processor system.

Fix Planned: Yes

Gary Jennejohn

unread,

Nov 5, 2009, 12:49:25 PM11/5/09

to

On Thu, 05 Nov 2009 19:40:03 +0300
S.N.Grigoriev <serguey-...@yandex.ru> wrote:

>
> 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com>
> wrote:
>
> > On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > > Hi list,
> > >
> > > I can confirm I've seen the same problem. After upgrading from 7-stable
> > > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > > work for me. My machine reboots every time I try to build world.
> > > I don't think I have a hardware problem: under 7-stable I can build
> > > world/kernel for both 7-stable and 8.0-RC2 without problems.
> > >
> > Is it by any chance possible that you have 'debug.debugger_on_panic' set
> > to '0' and no valid dump device configured?
>
> Hi Alexandre,
>
> I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> output. Where cat I find it? All my sysctl variables are set by
> default.

Do you have "options DDB" in your kernel config file?

---
Gary Jennejohn

S.N.Grigoriev

unread,

Nov 5, 2009, 1:34:23 PM11/5/09

to

05.11.09, 18:49, "Gary Jennejohn" <gary.je...@freenet.de>:

> On Thu, 05 Nov 2009 19:40:03 +0300
> S.N.Grigoriev wrote:
> >
> > 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko"

> > wrote:
> >
> > > On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > > > Hi list,
> > > >
> > > > I can confirm I've seen the same problem. After upgrading from 7-stable
> > > > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > > > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > > > work for me. My machine reboots every time I try to build world.
> > > > I don't think I have a hardware problem: under 7-stable I can build
> > > > world/kernel for both 7-stable and 8.0-RC2 without problems.
> > > >
> > > Is it by any chance possible that you have 'debug.debugger_on_panic' set
> > > to '0' and no valid dump device configured?
> >
> > Hi Alexandre,
> >
> > I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> > output. Where cat I find it? All my sysctl variables are set by
> > default.
> Do you have "options DDB" in your kernel config file?
> ---
> Gary Jennejohn

Hi Gary,

my current kernel is GENERIC, so I don't have "options DDB".
--
Regards,
S.Grigoriev.

S.N.Grigoriev

unread,

Nov 5, 2009, 2:08:13 PM11/5/09

to

05.11.09, 13:46, "Etienne Robillard" <robillar...@gmail.com>:

> S.N.Grigoriev wrote:
> >
> > 05.11.09, 18:49, "Gary Jennejohn" :

> >
> >> On Thu, 05 Nov 2009 19:40:03 +0300
> >> S.N.Grigoriev wrote:
> >>> 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko"
> >>> wrote:
> >>>
> >>>> On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> >>>>> Hi list,
> >>>>>
> >>>>> I can confirm I've seen the same problem. After upgrading from 7-stable
> >>>>> to 8.0-RC2 my machine just reboots during 'make buildworld' without
> >>>>> diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> >>>>> work for me. My machine reboots every time I try to build world.
> >>>>> I don't think I have a hardware problem: under 7-stable I can build
> >>>>> world/kernel for both 7-stable and 8.0-RC2 without problems.
> >>>>>
> >>>> Is it by any chance possible that you have 'debug.debugger_on_panic' set
> >>>> to '0' and no valid dump device configured?
> >>> Hi Alexandre,
> >>>
> >>> I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> >>> output. Where cat I find it? All my sysctl variables are set by
> >>> default.
> >> Do you have "options DDB" in your kernel config file?
> >> ---
> >> Gary Jennejohn
> >
> > Hi Gary,
> >
> > my current kernel is GENERIC, so I don't have "options DDB".

> I have RC2 with amd64 and buildworld/installworld runs fine.
> Maybe you memory (ram) problems ? I had to remove one 512mb clib
> in order to boot... ;-)
> Hope this helps,
> Etienne

Hi Etienne,

I think it is unlikely. I've done on this machine (under FreeBSD 7.1 and 7.2 and
some Linux versions) very much compilations without issues.

Etienne Robillard

unread,

Nov 5, 2009, 1:46:31 PM11/5/09

to

S.N.Grigoriev wrote:
>
> 05.11.09, 18:49, "Gary Jennejohn" <gary.je...@freenet.de>:

Hope this helps,

Etienne

--
Etienne Robillard <robillar...@gmail.com>
Green Tea Hackers Club <http://gthc.org/>
Blog: <http://gthc.org/blog/>
PGP Fingerprint: 178A BF04 23F0 2BF5 535D 4A57 FD53 FD31 98DC 4E57

Gary Jennejohn

unread,

Nov 6, 2009, 4:19:43 AM11/6/09

to

Well, you need that option to see the DDB (debug) sysctl's.

---
Gary Jennejohn

Alexandre "Sunny" Kovalenko

unread,

Nov 6, 2009, 7:55:51 AM11/6/09

to

On Thu, 2009-11-05 at 19:40 +0300, S.N.Grigoriev wrote:
> 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com>

> wrote:
>
> > On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > > Hi list,
> > >
> > > I can confirm I've seen the same problem. After upgrading from 7-stable
> > > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > > work for me. My machine reboots every time I try to build world.
> > > I don't think I have a hardware problem: under 7-stable I can build
> > > world/kernel for both 7-stable and 8.0-RC2 without problems.
> > >
> > Is it by any chance possible that you have 'debug.debugger_on_panic' set
> > to '0' and no valid dump device configured?
>
> Hi Alexandre,
>
> I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> output. Where cat I find it? All my sysctl variables are set by
> default.

If your system does not have "option DDB", it would not have sysctl
above, which might be just as well.

Does savecore -v output makes sense?

Where I am heading with all of that is the possibility that your system
panic(9)'ed and saved no core, hence, looking like it just rebooted.

--
Alexandre Kovalenko (Олександр Коваленко)

S.N.Grigoriev

unread,

Nov 6, 2009, 11:24:24 AM11/6/09

to

06.11.09, 07:55, "Alexandre \"Sunny\" Kovalenko" :

> On Thu, 2009-11-05 at 19:40 +0300, S.N.Grigoriev wrote:
> > 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko"

> > wrote:
> >
> > > On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > > > Hi list,
> > > >
> > > > I can confirm I've seen the same problem. After upgrading from 7-stable
> > > > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > > > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > > > work for me. My machine reboots every time I try to build world.
> > > > I don't think I have a hardware problem: under 7-stable I can build
> > > > world/kernel for both 7-stable and 8.0-RC2 without problems.
> > > >
> > > Is it by any chance possible that you have 'debug.debugger_on_panic' set
> > > to '0' and no valid dump device configured?
> >
> > Hi Alexandre,
> >
> > I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> > output. Where cat I find it? All my sysctl variables are set by
> > default.
> If your system does not have "option DDB", it would not have sysctl
> above, which might be just as well.
> Does savecore -v output makes sense?

This is 'savecore -v' output on my machine:

unable to open bounds file, using 0
checking for kernel dump on device /dev/ad8s1b
mediasize = 4294967296
sectorsize = 512
magic mismatch on last dump header on /dev/ad8s1b
savecore: no dumps found

--
Regards,
S.Grigoriev.

S.N.Grigoriev

unread,

Nov 7, 2009, 4:20:51 AM11/7/09

to

06.11.09, 10:19, "Gary Jennejohn" <gary.je...@freenet.de> wrote:

> On Thu, 05 Nov 2009 21:34:23 +0300

> S.N.Grigoriev wrote:
> >
> >
> > 05.11.09, 18:49, "Gary Jennejohn" :

> >
> > > On Thu, 05 Nov 2009 19:40:03 +0300

> > > S.N.Grigoriev wrote:
> > > >
> > > > 04.11.09, 16:51, "Alexandre \"Sunny\" Kovalenko"
> > > > wrote:
> > > >
> > > > > On Wed, 2009-11-04 at 13:44 +0300, S.N.Grigoriev wrote:
> > > > > > Hi list,
> > > > > >
> > > > > > I can confirm I've seen the same problem. After upgrading from 7-stable
> > > > > > to 8.0-RC2 my machine just reboots during 'make buildworld' without
> > > > > > diagnostics. But switching vm.pmap.pg_ps_enabled on/off does not
> > > > > > work for me. My machine reboots every time I try to build world.
> > > > > > I don't think I have a hardware problem: under 7-stable I can build
> > > > > > world/kernel for both 7-stable and 8.0-RC2 without problems.
> > > > > >
> > > > > Is it by any chance possible that you have 'debug.debugger_on_panic' set
> > > > > to '0' and no valid dump device configured?
> > > >
> > > > Hi Alexandre,
> > > >
> > > > I've not found 'debug.debugger_on_panic' variable in 'sysctl -a'
> > > > output. Where cat I find it? All my sysctl variables are set by
> > > > default.

> > > Do you have "options DDB" in your kernel config file?
> > > ---
> > > Gary Jennejohn
> >
> > Hi Gary,
> >
> > my current kernel is GENERIC, so I don't have "options DDB".
> >
> Well, you need that option to see the DDB (debug) sysctl's.
> ---

I've recompiled the kernel with 'options DDB' (using 7-stable).
The sysctl variable 'debug.debugger_on_panic' is now set to '1'.
But I still have silent reboots every time I try to build world/kernel.
'savecore -v' reports: 'no dumps found'. What did I do incorrectly?

Gary Jennejohn

unread,

Nov 7, 2009, 5:52:56 AM11/7/09

to

On Sat, 07 Nov 2009 12:20:51 +0300
S.N.Grigoriev <serguey-...@yandex.ru> wrote:

> I've recompiled the kernel with 'options DDB' (using 7-stable).
> The sysctl variable 'debug.debugger_on_panic' is now set to '1'.
> But I still have silent reboots every time I try to build world/kernel.
> 'savecore -v' reports: 'no dumps found'. What did I do incorrectly?
>

It could be that you also need "options KDB" for the kernel to enter
ddb on panic, but I'm not 100% sure about that.

Can't hurt to try it.

Do you also have dumpdev defined in /etc/rc.conf? It's set to AUTO
by default, but who knows whether that really works?

---
Gary Jennejohn

Alexandre "Sunny" Kovalenko

unread,

Nov 7, 2009, 1:32:38 PM11/7/09

to

On Sat, 2009-11-07 at 11:52 +0100, Gary Jennejohn wrote:
> On Sat, 07 Nov 2009 12:20:51 +0300
> S.N.Grigoriev <serguey-...@yandex.ru> wrote:
>
> > I've recompiled the kernel with 'options DDB' (using 7-stable).
> > The sysctl variable 'debug.debugger_on_panic' is now set to '1'.
> > But I still have silent reboots every time I try to build world/kernel.
> > 'savecore -v' reports: 'no dumps found'. What did I do incorrectly?
> >
>
> It could be that you also need "options KDB" for the kernel to enter
> ddb on panic, but I'm not 100% sure about that.
>
> Can't hurt to try it.
>
> Do you also have dumpdev defined in /etc/rc.conf? It's set to AUTO

> by default, <snip>
Apparently not on 8.0RC...

RabbitsDen# grep dumpdev defaults/rc.conf
dumpdev="NO" # Device to crashdump to (device name, AUTO, or NO).
savecore_flags="" # Used if dumpdev is enabled above, and present.
RabbitsDen# uname -a
FreeBSD RabbitsDen.RabbitsLawn.verizon.net 8.0-RC2 FreeBSD 8.0-RC2 #0
r198931: Sat Nov 7 12:07:32 EST 2009
ro...@RabbitsDen.RabbitsLawn.verizon.net:/usr/obj/usr/src/sys/TPX60 i386

while on 7.2

twinhead# grep dumpdev defaults/rc.conf
dumpdev="AUTO" # Device to crashdump to (device name, AUTO, or NO).
savecore_flags="" # Used if dumpdev is enabled above, and present.
twinhead# uname -a
FreeBSD twinhead.rabbitslawn.verizon.net 7.2-RELEASE-p3 FreeBSD
7.2-RELEASE-p3 #0: Sat Aug 22 12:43:26 EDT 2009
ro...@twinhead.rabbitslawn.verizon.net:/usr/obj/usr/src/sys/TWINHEAD
amd64

So one needs explicit dumpdev="AUTO" in the /etc/rc.conf. That caught me
by surprise too...

Sergey, I think it would be best if you follow

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN

and the do an ultimate test:

* quiesce your system
* switch to the console
* sync (few times, if you are really old school ;)
* sysctl debug.kdb.panic=1 (this would *panic* the system and, given
everything is set-up properly, produce the crash dump)

if you do not have debug.kdb.panic sysctl, please, add option KDB to
your kernel configuration.

If you get crash dump from the kernel-induced panic and your system
keeps rebooting without a trace, I would suspect some hardware testing
might be in order.

--
Alexandre Kovalenko (Олександр Коваленко)

S.N.Grigoriev

unread,

Nov 10, 2009, 4:41:59 AM11/10/09

to

07.11.09, 13:32, "Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com>
wrote:

> Sergey, I think it would be best if you follow

> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN
> and the do an ultimate test:
> * quiesce your system
> * switch to the console
> * sync (few times, if you are really old school ;)
> * sysctl debug.kdb.panic=1 (this would *panic* the system and, given
> everything is set-up properly, produce the crash dump)
>
> if you do not have debug.kdb.panic sysctl, please, add option KDB to
> your kernel configuration.
> If you get crash dump from the kernel-induced panic and your system
> keeps rebooting without a trace, I would suspect some hardware testing
> might be in order.

Alexandre,

I followed your tips. The kernel configuration now contains options DDB
and KDB. The sysctl variable 'debug.debugger_on_panic' is set to '1'.
After the 'sysctl debug.kdb.panic=1' command the debugger prompt
appears.
To have a crash dump I should type 'panic' on the debugger prompt.
If I type 'reboot' instead, there are no crash dumps. Is that behaviour
correct?
Another question: must all panics go to the bebugger prompt?
I still have neither crash dumps nor debugger prompt during
world/kernel compilations. Just reboots.

--
Regards,
S.Grigoriev.

Gary Jennejohn

unread,

Nov 10, 2009, 4:58:56 AM11/10/09

to

On Tue, 10 Nov 2009 12:41:59 +0300
S.N.Grigoriev <serguey-...@yandex.ru> wrote:

> I followed your tips. The kernel configuration now contains options DDB
> and KDB. The sysctl variable 'debug.debugger_on_panic' is set to '1'.
> After the 'sysctl debug.kdb.panic=1' command the debugger prompt
> appears.
> To have a crash dump I should type 'panic' on the debugger prompt.
> If I type 'reboot' instead, there are no crash dumps. Is that behaviour
> correct?
>

In this case yes, because you forced the panic by setting the sysctl.

> Another question: must all panics go to the bebugger prompt?
>

They should, if the kernel itself causes the panic. As Alexandre
wrote, a hardware problem probably would not cause a kernel panic.

I've seen panics like this myself and they were almost always caused
by faulty memory. Running "make buildworld" really stresses the
system and reveals hardware faults like that.

> I still have neither crash dumps nor debugger prompt during
> world/kernel compilations. Just reboots.
>

Are you in the console or running Xorg when it happens?

---
Gary Jennejohn

S.N.Grigoriev

unread,

Nov 10, 2009, 8:57:49 AM11/10/09

to

10.11.09, 10:58, "Gary Jennejohn" <gary.je...@freenet.de>
wrote:

> I've seen panics like this myself and they were almost always caused
> by faulty memory. Running "make buildworld" really stresses the
> system and reveals hardware faults like that.

Is there a non-zero probability that 7-stable irrespective of faulty memory
ALWAYS works fine on the same hardware (including 'make -j 8 buildworld')
but 8.0 ALWAYS crashes?

> Are you in the console or running Xorg when it happens?

Yes, I'm in the console. Xorg is not install at all.

--
Regards,
S.Grigoriev.

Gary Jennejohn

unread,

Nov 10, 2009, 9:07:05 AM11/10/09

to

On Tue, 10 Nov 2009 16:57:49 +0300
S.N.Grigoriev <serguey-...@yandex.ru> wrote:

>
>
> 10.11.09, 10:58, "Gary Jennejohn" <gary.je...@freenet.de>
> wrote:
>
> > I've seen panics like this myself and they were almost always caused
> > by faulty memory. Running "make buildworld" really stresses the
> > system and reveals hardware faults like that.
>
> Is there a non-zero probability that 7-stable irrespective of faulty memory
> ALWAYS works fine on the same hardware (including 'make -j 8 buildworld')
> but 8.0 ALWAYS crashes?
>

Who can say? Newer versions of FreeBSD could be more aggressive in their
memory usage. Other parts of the kernel like the ATA/AHCI subsystem also
use hardware capabilities more fully.

It seems that the kernel is not seeing any panic, otherwise it should
land in the debugger. This points to some hardware problem, especially
since other users are not reporting crashes like you see.

---
Gary Jennejohn

Andriy Gapon

unread,

Nov 10, 2009, 9:28:46 AM11/10/09

to

on 10/11/2009 15:57 S.N.Grigoriev said the following:

>
> 10.11.09, 10:58, "Gary Jennejohn" <gary.je...@freenet.de>
> wrote:
>
>> I've seen panics like this myself and they were almost always caused
>> by faulty memory. Running "make buildworld" really stresses the
>> system and reveals hardware faults like that.
>
> Is there a non-zero probability that 7-stable irrespective of faulty memory
> ALWAYS works fine on the same hardware (including 'make -j 8 buildworld')
> but 8.0 ALWAYS crashes?
>

>> Are you in the console or running Xorg when it happens?
>
> Yes, I'm in the console. Xorg is not install at all.

I can not provide any help with your problem, sorry.

But I'd like to point out one thing. I think that it was you mistake to report
your problem in this thread as opposed to a new thread or a PR. This thread
started with a report by Kai Gallasch about a lock order reversal panic. Hence
the subject of this thread. Then you said that you have "the same problem", but
the symptoms you described were a spontaneous reboot without any panic messages.
So some people spent their time trying to help you to set up kernel dumps, some
people missed your report because of the subject, some people missed Kai's report
because the discussion went the wrong way, some people were tempted to point to
the latest hardware issue discovered.
Apologies that I picked on you, but it's not helpful when such a different stuff
is piled up into the same thread.

Just for the record, here is the original message:
http://groups.google.com/group/mailing.freebsd.current/browse_thread/thread/7cdc98adfb88a34d
And it's clear lock order reversal panic that was lost in th noise generated by
the followups.

Yours ranting,
--
Andriy Gapon

Andriy Gapon

unread,

Nov 10, 2009, 9:30:28 AM11/10/09

to

on 10/11/2009 16:28 Andriy Gapon said the following:
> Yours ranting,

Sorry for the rant, it seems that I've missed half the text in the original report :-(

Alexandre "Sunny" Kovalenko

unread,

Nov 10, 2009, 9:43:58 AM11/10/09

to

On Tue, 2009-11-10 at 12:41 +0300, S.N.Grigoriev wrote:
>
> 07.11.09, 13:32, "Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com>
> wrote:
>
> > Sergey, I think it would be best if you follow
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN
> > and the do an ultimate test:
> > * quiesce your system
> > * switch to the console
> > * sync (few times, if you are really old school ;)
> > * sysctl debug.kdb.panic=1 (this would *panic* the system and, given
> > everything is set-up properly, produce the crash dump)
> >
> > if you do not have debug.kdb.panic sysctl, please, add option KDB to
> > your kernel configuration.
> > If you get crash dump from the kernel-induced panic and your system
> > keeps rebooting without a trace, I would suspect some hardware testing
> > might be in order.
>
> Alexandre,
>

> I followed your tips. The kernel configuration now contains options DDB
> and KDB. The sysctl variable 'debug.debugger_on_panic' is set to '1'.
> After the 'sysctl debug.kdb.panic=1' command the debugger prompt
> appears.
> To have a crash dump I should type 'panic' on the debugger prompt.
> If I type 'reboot' instead, there are no crash dumps. Is that behaviour
> correct?

This is correct -- if you use reboot from the debugger prompt, you will
not get crash dump. I don't know whether this is "right" behavior or
not, but I have definitely seen this before.

> Another question: must all panics go to the bebugger prompt?

No, set debug.debugger_on_panic to 0 and system will reboot on panic
producing crash dump at startup, provided you have crash dump device
configured properly. If I understood you correctly, after you typed
'panic' in the debugger prompt, your system restarted and, eventually,
usable dump found its way to /var/crash. Is this correct? If yes, you
can change your system to skip debugger on panic.

> I still have neither crash dumps nor debugger prompt during
> world/kernel compilations. Just reboots.
>

I am not qualified to tell you that there is absolutely no possibility
that FreeBSD could reboot without leaving any trace on the perfectly
functioning hardware, but I would think exploring other options might
not be out of order now.

Unscientifically, I have observed that 8.0RC2 runs hotter on long builds
than 7-STABLE did before. I have lowered _PSV, because handrest gets
uncomfortably warm and I seem to see it being hit (CPU is being
throttled) more often than I remember from the days of 7-STABLE. Again,
this is *completely* unscientific. It is also possible that I have dust
accumulated in the airduct and has nothing to do with the differences in
the OS level.

If you still dual boot on that machine and have some device to check
power draw during buildworld for both 7.2 and 8.0 it might be a
worthwhile exercise. ANother thing would be to monitor whatever thermal
sensor you have available to you, I think k8temp(8) provides necessary
services for AMD64.

Connecting either serial or firewire console to the machine and
capturing anything sent to the console right before the reboot might be
yet another thing to try.

--
Alexandre Kovalenko (Олександр Коваленко)

Alexandre "Sunny" Kovalenko

unread,

Nov 10, 2009, 9:47:32 AM11/10/09

to

On Tue, 2009-11-10 at 10:58 +0100, Gary Jennejohn wrote:
> On Tue, 10 Nov 2009 12:41:59 +0300
> S.N.Grigoriev <serguey-...@yandex.ru> wrote:
>

> > I followed your tips. The kernel configuration now contains options DDB
> > and KDB. The sysctl variable 'debug.debugger_on_panic' is set to '1'.
> > After the 'sysctl debug.kdb.panic=1' command the debugger prompt
> > appears.
> > To have a crash dump I should type 'panic' on the debugger prompt.
> > If I type 'reboot' instead, there are no crash dumps. Is that behaviour
> > correct?
> >
>

> In this case yes, because you forced the panic by setting the sysctl.
>

> > Another question: must all panics go to the bebugger prompt?
> >
>

> They should, if the kernel itself causes the panic. As Alexandre
> wrote, a hardware problem probably would not cause a kernel panic.

I really hope I have not written anything like that ;-) I have seen
plenty of panics induced by hardware problems -- from faulty memory to
overheated CPU and beyond. What I was trying to say that a reboot
*without a panic* is more likely a hardware problem than not.

gary.je...@freenet.de

unread,

Nov 10, 2009, 10:22:05 AM11/10/09

to

On Tue, 10 Nov 2009 09:47:32 -0500

"Alexandre \"Sunny\" Kovalenko" <gaij...@gmail.com> wrote:

> > They should, if the kernel itself causes the panic. As Alexandre
> > wrote, a hardware problem probably would not cause a kernel panic.
>
> I really hope I have not written anything like that ;-) I have seen
> plenty of panics induced by hardware problems -- from faulty memory to
> overheated CPU and beyond. What I was trying to say that a reboot
> *without a panic* is more likely a hardware problem than not.
>

Well, OK, I may have misinterpreted what you wrote or have chosen bad
wording myself to convey the same message. Nonetheless it looks like
a hardware problem to me.

---
Gary Jennejohn

Andriy Gapon

unread,

Nov 10, 2009, 12:05:23 PM11/10/09

to

on 10/11/2009 17:22 gary.je...@freenet.de said the following:

> Well, OK, I may have misinterpreted what you wrote or have chosen bad
> wording myself to convey the same message. Nonetheless it looks like
> a hardware problem to me.

[Trying to make up for my previous mistake.]

The symptom certainly looks like misbehaving hardware, but other information from
the reports seems to suggest that it is possible that this misbehavior might be
caused by software misconfiguring the hardware.

I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was correctly teh
first time.
I would try to see how 8.0-RC1 kernel behaves and in general try to find last
working, first non-working version.
It would be useful to know any (if any) non-default loader.conf and rc.conf
settings or kernel config (if not GENERIC).

Not a trivial issue unless it is hardware indeed.

--
Andriy Gapon

Kai Gallasch

unread,

Nov 10, 2009, 12:48:21 PM11/10/09

to

Am Tue, 10 Nov 2009 19:05:23 +0200
schrieb Andriy Gapon <a...@icyb.net.ua>:

> on 10/11/2009 17:22 gary.je...@freenet.de said the following:
> > Well, OK, I may have misinterpreted what you wrote or have chosen
> > bad wording myself to convey the same message. Nonetheless it
> > looks like a hardware problem to me.
>
> [Trying to make up for my previous mistake.]
>
> The symptom certainly looks like misbehaving hardware, but other
> information from the reports seems to suggest that it is possible
> that this misbehavior might be caused by software misconfiguring the
> hardware.

Hi.

This thread was started by me. In the meantime I filed a PR:
http://www.freebsd.org/cgi/query-pr.cgi?pr=140338

> I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was
> correctly teh first time.

I toggled vm.pmap.pg_ps_enabled three times between reboots and the
result is always the same. superpages enabled: reboot, superpages not
enabled: server stable

> I would try to see how 8.0-RC1 kernel behaves and in general try to
> find last working, first non-working version.

8.0RC1, 8.0BETA4 already showed the same behaviour

> It would be useful to know any (if any) non-default loader.conf and
> rc.conf settings or kernel config (if not GENERIC).

loader.conf untouched, rc.conf had just settings for networking active
when testing. In the end I enabled some other stuff to have it ready for
8.0 RELEASE, *after* I found out that disabling superpages helped
against the crashes.

Ah yes. I also ran memtest86 on the server for about half a day - no
problems.

But read for yourself in the PR.

I don't rule out that this behaviour with vm.pmap.pg_ps_enabled maybe
hardware related, but why then is the server running stable
with RELENG_7 and memtest and server diagnostics don't report any
problem?

--Kai.

--
I am NOMAD!

Adam Vande More

unread,

Nov 10, 2009, 12:55:20 PM11/10/09

to

>
>
> I don't rule out that this behaviour with vm.pmap.pg_ps_enabled maybe
> hardware related, but why then is the server running stable
> with RELENG_7 and memtest and server diagnostics don't report any
> problem?
>
> --Kai.
>
>

If memtest fails a module, you can generally trust the result. There are
many occasions when faulty modules will pass however so the only reliable
course of action when diagnosing memory issues is to replace with known good
modules.

--
Adam Vande More

Mark Atkinson

unread,

Nov 10, 2009, 12:15:42 PM11/10/09

to

Andriy Gapon wrote:
> on 10/11/2009 17:22 gary.je...@freenet.de said the following:
>> Well, OK, I may have misinterpreted what you wrote or have chosen bad
>> wording myself to convey the same message. Nonetheless it looks like
>> a hardware problem to me.
>
> [Trying to make up for my previous mistake.]
>
> The symptom certainly looks like misbehaving hardware, but other information from
> the reports seems to suggest that it is possible that this misbehavior might be
> caused by software misconfiguring the hardware.
>

> I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was correctly teh
> first time.

> I would try to see how 8.0-RC1 kernel behaves and in general try to find last
> working, first non-working version.

> It would be useful to know any (if any) non-default loader.conf and rc.conf
> settings or kernel config (if not GENERIC).
>

> Not a trivial issue unless it is hardware indeed.
>

Also, you can try adding:

hw.mca.enabled="1" in /boot/loader.conf, reboot, and then see if there
is a machine check exception on the console during the buildworld.

Mark Atkinson

unread,

Nov 10, 2009, 2:29:00 PM11/10/09

to

Kai Gallasch wrote:
> Am Tue, 10 Nov 2009 19:05:23 +0200
> schrieb Andriy Gapon <a...@icyb.net.ua>:
>

>> on 10/11/2009 17:22 gary.je...@freenet.de said the following:
>>> Well, OK, I may have misinterpreted what you wrote or have chosen
>>> bad wording myself to convey the same message. Nonetheless it
>>> looks like a hardware problem to me.
>> [Trying to make up for my previous mistake.]
>>
>> The symptom certainly looks like misbehaving hardware, but other
>> information from the reports seems to suggest that it is possible
>> that this misbehavior might be caused by software misconfiguring the
>> hardware.
>

> Hi.
>
> This thread was started by me. In the meantime I filed a PR:
> http://www.freebsd.org/cgi/query-pr.cgi?pr=140338
>

>> I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was
>> correctly teh first time.
>

> I toggled vm.pmap.pg_ps_enabled three times between reboots and the
> result is always the same. superpages enabled: reboot, superpages not
> enabled: server stable
>

>> I would try to see how 8.0-RC1 kernel behaves and in general try to
>> find last working, first non-working version.

> 8.0RC1, 8.0BETA4 already showed the same behaviour
>

>> It would be useful to know any (if any) non-default loader.conf and
>> rc.conf settings or kernel config (if not GENERIC).
>

> loader.conf untouched, rc.conf had just settings for networking active
> when testing. In the end I enabled some other stuff to have it ready for
> 8.0 RELEASE, *after* I found out that disabling superpages helped
> against the crashes.
>
> Ah yes. I also ran memtest86 on the server for about half a day - no
> problems.
>
> But read for yourself in the PR.
>

> I don't rule out that this behaviour with vm.pmap.pg_ps_enabled maybe
> hardware related, but why then is the server running stable
> with RELENG_7 and memtest and server diagnostics don't report any
> problem?

See the following, where I noticed this problem first a long time
ago on my HPDL385g5. It also passed memtest86 for days and I was able
to swap out memory modules to the same result.

http://article.gmane.org/gmane.os.freebsd.current/111307

I suspect this is actually a machine check exception you're seeing,
which you'll notice if you enable

hw.mca.enabled="1", and superpages, then do buildworld. Using -j doesn't
matter, it's just takes longer to throw an exception.

I'm hoping this is the rev E lfence problem, even though my chips are
not targetted. When and if a patch goes into -current, I'll try it out
to see if the problem with superpages goes away.

-Mark

Alexandre "Sunny" Kovalenko

unread,

Nov 10, 2009, 3:29:35 PM11/10/09

to

On Tue, 2009-11-10 at 19:05 +0200, Andriy Gapon wrote:
> on 10/11/2009 17:22 gary.je...@freenet.de said the following:
> > Well, OK, I may have misinterpreted what you wrote or have chosen bad
> > wording myself to convey the same message. Nonetheless it looks like
> > a hardware problem to me.
>
> [Trying to make up for my previous mistake.]
>
> The symptom certainly looks like misbehaving hardware, but other information from
> the reports seems to suggest that it is possible that this misbehavior might be
> caused by software misconfiguring the hardware.
>

> I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was correctly teh
> first time.

> I would try to see how 8.0-RC1 kernel behaves and in general try to find last
> working, first non-working version.

To that end, Rui Paulo committed utility to do binary search on the
revisions, which will, likely, come handy.

>>>> 1. install devel/p5-App-SVN-Bisect
>>>> 2. svn checkout http://svn.freebsd.org/base/stable/8 freebsd
>>>> 3. svn-bisect --start 198443 --end 198831 start
>>>> 4. Build a kernel and test.
>>>> If it works, type 'svn-bisect good'
>>>> If it doesn't work, type 'svn-bisect bad'
>>>> 5. Repeat process from step 4 until svn-bisect finds the culprit
>>>> revision.
>>>>
>>>>http://search.cpan.org/~infinoid/App-SVN-Bisect-0.8/bin/svn-bisect#EXAMPLE

--
Alexandre Kovalenko (Олександр Коваленко)

Andriy Gapon

unread,

Nov 11, 2009, 8:09:06 AM11/11/09

to

on 10/11/2009 19:48 Kai Gallasch said the following:

> Am Tue, 10 Nov 2009 19:05:23 +0200
> schrieb Andriy Gapon <a...@icyb.net.ua>:
>

>> on 10/11/2009 17:22 gary.je...@freenet.de said the following:
>>> Well, OK, I may have misinterpreted what you wrote or have chosen
>>> bad wording myself to convey the same message. Nonetheless it
>>> looks like a hardware problem to me.
>> [Trying to make up for my previous mistake.]
>>
>> The symptom certainly looks like misbehaving hardware, but other
>> information from the reports seems to suggest that it is possible
>> that this misbehavior might be caused by software misconfiguring the
>> hardware.
>

> Hi.
>
> This thread was started by me. In the meantime I filed a PR:
> http://www.freebsd.org/cgi/query-pr.cgi?pr=140338
>

>> I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was
>> correctly teh first time.
>

> I toggled vm.pmap.pg_ps_enabled three times between reboots and the
> result is always the same. superpages enabled: reboot, superpages not
> enabled: server stable

Yes, I saw your report.
I was following on the other report where the symptoms are very similar but
pg_ps_enabled does not seem to help.

>> I would try to see how 8.0-RC1 kernel behaves and in general try to
>> find last working, first non-working version.

> 8.0RC1, 8.0BETA4 already showed the same behaviour
>
>> It would be useful to know any (if any) non-default loader.conf and
>> rc.conf settings or kernel config (if not GENERIC).
>
> loader.conf untouched, rc.conf had just settings for networking active
> when testing. In the end I enabled some other stuff to have it ready for
> 8.0 RELEASE, *after* I found out that disabling superpages helped
> against the crashes.
>
> Ah yes. I also ran memtest86 on the server for about half a day - no
> problems.
>
> But read for yourself in the PR.
>
> I don't rule out that this behaviour with vm.pmap.pg_ps_enabled maybe
> hardware related, but why then is the server running stable
> with RELENG_7 and memtest and server diagnostics don't report any
> problem?

What I meant is that sometimes software can incorrectly configure hardware. Or
configure it in a way that was thoroughly tested by manufacturer. I didn't mean
to say that your hardware has any defect (but perhaps it does, hardware errata
don't exist for nothing).

--
Andriy Gapon

S.N.Grigoriev

unread,

Nov 11, 2009, 2:15:18 PM11/11/09

to

10.11.09, 09:15, "Mark Atkinson" <atki...@yahoo.com>
wrote:

> Andriy Gapon wrote:
> > on 10/11/2009 17:22 gary.je...@freenet.de said the following:
> >> Well, OK, I may have misinterpreted what you wrote or have chosen bad
> >> wording myself to convey the same message. Nonetheless it looks like
> >> a hardware problem to me.
> >
> > [Trying to make up for my previous mistake.]
> >
> > The symptom certainly looks like misbehaving hardware, but other information from
> > the reports seems to suggest that it is possible that this misbehavior might be
> > caused by software misconfiguring the hardware.
> >

> > I would re-test vm.pmap.pg_ps_enabled=0 just to be sure that it was correctly teh
> > first time.

> > I would try to see how 8.0-RC1 kernel behaves and in general try to find last
> > working, first non-working version.

> > It would be useful to know any (if any) non-default loader.conf and rc.conf
> > settings or kernel config (if not GENERIC).
> >

> > Not a trivial issue unless it is hardware indeed.
> >
> Also, you can try adding:
> hw.mca.enabled="1" in /boot/loader.conf, reboot, and then see if there
> is a machine check exception on the console during the buildworld.

Mark,

I've added hw.mca.enabled="1" in /boot/loader.conf and got the following
screen during the buildworld:

.....
-c /usr/src/gnu/usr.bin/binutils/as/../../../../contrib/binutils/gas/sb.c

MCA: CPU3 UNCOR PCC OVER DTLIB L1 error
MCA: Address 0x8015fb000

Fatal trap 28: machine check trap while in user mode

Fatal trap 28: machine check trap while in user mode
cpuid = 3, apic id = 03
..... etc.

I've typed 'panic' at the 'db>' prompt but got no crash dump.
Tell me please what can I do for further problem investigation.

--
Regards,
S.Grigoriev.

John Baldwin

unread,

Nov 11, 2009, 3:04:14 PM11/11/09

to

You hardware is broken and it is telling you so. You have had multiple
machine checks with the most severe one being an uncorrectable error in your
data TLB (i.e. in the CPU itself).

--
John Baldwin

Mark Atkinson

unread,

Nov 11, 2009, 3:13:55 PM11/11/09

to

> ......

> -c /usr/src/gnu/usr.bin/binutils/as/../../../../contrib/binutils/gas/sb.c
>
> MCA: CPU3 UNCOR PCC OVER DTLIB L1 error
> MCA: Address 0x8015fb000
>
>

> Fatal trap 28: machine check trap while in user mode
>
> Fatal trap 28: machine check trap while in user mode
> cpuid = 3, apic id = 03

> ...... etc.

>
>
> I've typed 'panic' at the 'db>' prompt but got no crash dump.
> Tell me please what can I do for further problem investigation.
>

Well, you're about at the point I am now with my HP dl385g5, only
turning off superpages would result in a successful buildworld. Mine
would often machine check during gas compilation as well.

You can try issuing 'where' or 'bt' to see the backtrace, but It
probably wouldn't reveal anything useful.

If 'panic' doesn't create a dump, you can try 'call doadump', then
'reset' to reset the machine.

All the best,

Mark

Emil Mikulic

unread,

Nov 11, 2009, 6:17:53 PM11/11/09

to

On Tue, Nov 10, 2009 at 09:15:42AM -0800, Mark Atkinson wrote:
> Also, you can try adding:
>
> hw.mca.enabled="1" in /boot/loader.conf

This is a little off-topic, but:
Why is this disabled by default in FreeBSD?

Giovanni Trematerra

unread,

Nov 12, 2009, 3:48:35 AM11/12/09

to

On Thu, Nov 12, 2009 at 12:17 AM, Emil Mikulic <emik...@gmail.com> wrote:
> On Tue, Nov 10, 2009 at 09:15:42AM -0800, Mark Atkinson wrote:
>> Also, you can try adding:
>>
>> hw.mca.enabled="1" in /boot/loader.conf
>
> This is a little off-topic, but:
> Why is this disabled by default in FreeBSD?

I guess because it introduces an overhead.

--
Trematerra Giovanni

Andriy Gapon

unread,

Nov 12, 2009, 8:36:42 AM11/12/09

to

on 11/11/2009 21:15 S.N.Grigoriev said the following:

>
> MCA: CPU3 UNCOR PCC OVER DTLIB L1 error
> MCA: Address 0x8015fb000
>
>
> Fatal trap 28: machine check trap while in user mode
>
> Fatal trap 28: machine check trap while in user mode
> cpuid = 3, apic id = 03
> ..... etc.
>
>
> I've typed 'panic' at the 'db>' prompt but got no crash dump.
> Tell me please what can I do for further problem investigation.

Serguey,

are you sure that setting vm.pmap.pg_ps_enabled=0 doesn't help you?
I know that already asked you this once.
But, could you please try again with vm.pmap.pg_ps_enabled=0 and hw.mca.enabled=1
and see what kind of behavior you get?
I am curious what would happen, would it be the same kind of machine check condition.

--
Andriy Gapon

Andriy Gapon

unread,

Nov 12, 2009, 8:38:59 AM11/12/09

to

on 12/11/2009 10:48 Giovanni Trematerra said the following:

> On Thu, Nov 12, 2009 at 12:17 AM, Emil Mikulic <emik...@gmail.com> wrote:
>> On Tue, Nov 10, 2009 at 09:15:42AM -0800, Mark Atkinson wrote:
>>> Also, you can try adding:
>>>
>>> hw.mca.enabled="1" in /boot/loader.conf
>> This is a little off-topic, but:
>> Why is this disabled by default in FreeBSD?
>
> I guess because it introduces an overhead.

I don't think that there is any noticable overhead.
My guess is because it is a new feature which has not been wildly tested yet.

Andriy Gapon

unread,

Nov 12, 2009, 8:59:26 AM11/12/09

to

on 11/11/2009 22:13 Mark Atkinson said the following:

>
> Well, you're about at the point I am now with my HP dl385g5, only
> turning off superpages would result in a successful buildworld. Mine
> would often machine check during gas compilation as well.

Mark,

you mentioning MCA was magic moment for me.
I was debugging a problem which seemed to be quite different, but now I think that
it converges to the problem discussed in this thread (if indeed it's the same
problem for all reporters).
The difference is that I use a "consumer level" system based on family 10h Athlon
II and you use Opterons, seemingly also 10h or Fh families.
I guess that means that you and Kai both use "high end"/"server grade" systems or
some such. It's possible that firmware/BIOS on your systems enables and monitors
MCA by default, even when the OS is not MCA-enabled.
As such, I am curious if you have any BIOS settings that look like being related
to Machine Check. Or perhaps there is something like Event Log in BIOS. Maybe it
even gets something useful.
Could you please check?

About my problem - it seems that I was working from the opposite end. I have been
using head/CURRENT with pg_ps_enabled=1 for quite a while now. And then I decided
to try hw.mca.enabled=1 and after that I started having the same symptoms as
described here. Unfortunately, I never did get Machine Check trap, it's always
something that looks like CPU halt and then reset by watchdog (if it is enabled).
So, for me:
superpages and no machine check - works
machine check and no superpages - works
machine check and superpages - problem

Mark Atkinson

unread,

Nov 12, 2009, 10:57:40 AM11/12/09

to

Andriy Gapon wrote:
> on 11/11/2009 22:13 Mark Atkinson said the following:
>> Well, you're about at the point I am now with my HP dl385g5, only
>> turning off superpages would result in a successful buildworld. Mine
>> would often machine check during gas compilation as well.
>
> Mark,
>
> you mentioning MCA was magic moment for me.
> I was debugging a problem which seemed to be quite different, but now I think that
> it converges to the problem discussed in this thread (if indeed it's the same
> problem for all reporters).

I'm not sure it's exactly the same. I only know a couple of things

- my memory tests good.
- turning off superpages allows this machine to function properly.

I suspect there's a problem with one of the following:

- the bios of my machine
- the on die memory controller/intructions of the cpu
- the motherboard electrical interface to memory or bus in some shape or
form.

> Or perhaps there is something like Event Log in BIOS. Maybe it
> even gets something useful.
> Could you please check?

Yes. When you receive a MCE on the HP machines the bios notices and
prints a message on the next bootup, something like "an unhandled memory
error has occured since last power on."

In my current job, which works with hardware, we'll occasionally see
MCEs during development. It's easy to say the memory is bad, and it is
the first thing we replace to test. However it can also be the
electrical interface to the hardware which may or may not be
fixable/worked around in firmware. I have also witnessed the software
initializing or controlling the hardware may result in a unhandled
condition spurring an MCE.

> About my problem - it seems that I was working from the opposite end. I have been
> using head/CURRENT with pg_ps_enabled=1 for quite a while now. And then I decided
> to try hw.mca.enabled=1 and after that I started having the same symptoms as
> described here. Unfortunately, I never did get Machine Check trap, it's always
> something that looks like CPU halt and then reset by watchdog (if it is enabled).
> So, for me:
> superpages and no machine check - works
> machine check and no superpages - works
> machine check and superpages - problem

That's not quite the same for sure, definitely try replacing the memory
first if you haven't already.

All the best,

Mark

S.N.Grigoriev

unread,

Nov 12, 2009, 11:34:06 AM11/12/09

to

12.11.09, 15:36, "Andriy Gapon" <a...@icyb.net.ua>
wrote:

> Serguey,
> are you sure that setting vm.pmap.pg_ps_enabled=0 doesn't help you?
> I know that already asked you this once.
> But, could you please try again with vm.pmap.pg_ps_enabled=0 and hw.mca.enabled=1
> and see what kind of behavior you get?
> I am curious what would happen, would it be the same kind of machine check condition.

Andriy,

I've done the world compilation with 'vm.pmap.pg_ps_enabled=0'.
The first attempt has finished with silent reboot. The second one has
been captured by the debugger:

panic: backgroundwritedone: lost buffer
cpuid = 0
KDB: enter: panic
[thread pid 3 tid 100014 ]
Stopped at breakpoint+0x5: leave

--
Regards,
S.Grigoriev.

John Baldwin

unread,

Nov 12, 2009, 11:33:30 AM11/12/09

to

On Wednesday 11 November 2009 6:17:53 pm Emil Mikulic wrote:
> On Tue, Nov 10, 2009 at 09:15:42AM -0800, Mark Atkinson wrote:
> > Also, you can try adding:
> >
> > hw.mca.enabled="1" in /boot/loader.conf
>
> This is a little off-topic, but:
> Why is this disabled by default in FreeBSD?

Because it is still a new feature and on some machines it causes
panics/freezes out of the box. I'd be happier turning it on once it has been
tested more and once we've figured out the reasons for the various freezes,
etc. that seem to be related to enabling it.

--
John Baldwin

Gary Jennejohn

unread,

Nov 12, 2009, 12:28:25 PM11/12/09

to

On Thu, 12 Nov 2009 11:33:30 -0500
John Baldwin <j...@freebsd.org> wrote:

> On Wednesday 11 November 2009 6:17:53 pm Emil Mikulic wrote:
> > On Tue, Nov 10, 2009 at 09:15:42AM -0800, Mark Atkinson wrote:
> > > Also, you can try adding:
> > >
> > > hw.mca.enabled="1" in /boot/loader.conf
> >
> > This is a little off-topic, but:
> > Why is this disabled by default in FreeBSD?
>
> Because it is still a new feature and on some machines it causes
> panics/freezes out of the box. I'd be happier turning it on once it has been
> tested more and once we've figured out the reasons for the various freezes,
> etc. that seem to be related to enabling it.
>

A safer way to test this is to enter
set hw.mca.enabled="1"
on the loader command line rather than putting it into loader.conf.

I just did this on my "consumer" AMD64 board without ill effect.

---
Gary Jennejohn

Kai Gallasch

unread,

Nov 12, 2009, 1:13:44 PM11/12/09

to

Am Thu, 12 Nov 2009 15:59:26 +0200
schrieb Andriy Gapon <a...@icyb.net.ua>:

> on 11/11/2009 22:13 Mark Atkinson said the following:

> >
> > Well, you're about at the point I am now with my HP dl385g5, only
> > turning off superpages would result in a successful buildworld.
> > Mine would often machine check during gas compilation as well.
>
> Mark,
>
> you mentioning MCA was magic moment for me.
> I was debugging a problem which seemed to be quite different, but now
> I think that it converges to the problem discussed in this thread (if
> indeed it's the same problem for all reporters).

> The difference is that I use a "consumer level" system based on
> family 10h Athlon II and you use Opterons, seemingly also 10h or Fh
> families. I guess that means that you and Kai both use "high
> end"/"server grade" systems or some such. It's possible that
> firmware/BIOS on your systems enables and monitors MCA by default,
> even when the OS is not MCA-enabled. As such, I am curious if you
> have any BIOS settings that look like being related to Machine

> Check. Or perhaps there is something like Event Log in BIOS. Maybe

> it even gets something useful. Could you please check?

Hi.

Here is one BIOS options on my server that is of possible interest to
this kind of problem:

Advanced Options ->
Processor Options:

No-Execute Page-Protection (DISABLED)

"Enables the HW portion of a feature that allows systems to be protected
against malicious code and viruses. In combination with an OS that
supports this feature, memory is marked as non-executable unless the
location explicitly contains executable code. Some viruses attempt to
insert and execute code from non-executable memory locations.
These attacks are intercepted and an exception is raised."

--Kai.

Andriy Gapon

unread,

Nov 12, 2009, 1:52:10 PM11/12/09

to

on 12/11/2009 19:28 Gary Jennejohn said the following:

>
> A safer way to test this is to enter
> set hw.mca.enabled="1"
> on the loader command line rather than putting it into loader.conf.
>
> I just did this on my "consumer" AMD64 board without ill effect.

What CPU do you have, could you provide its description from dmesg?
Did you stress-tested it enough (like parallel buildworld)?

--
Andriy Gapon

Kai Gallasch

unread,

Nov 12, 2009, 1:59:32 PM11/12/09

to

Am Wed, 11 Nov 2009 15:04:14 -0500
schrieb John Baldwin <j...@freebsd.org>:

> On Wednesday 11 November 2009 2:15:18 pm S.N.Grigoriev wrote:
> >
> > 10.11.09, 09:15, "Mark Atkinson" <atki...@yahoo.com>
> > wrote:
> >
> > > Andriy Gapon wrote:
> > > > on 10/11/2009 17:22 gary.je...@freenet.de said the
> > > > following:

> > > > Not a trivial issue unless it is hardware indeed.
> > > >

> > > Also, you can try adding:

> > > hw.mca.enabled="1" in /boot/loader.conf, reboot, and then see if
> > > there is a machine check exception on the console during the
> > > buildworld.
> >
> > Mark,
> >
> > I've added hw.mca.enabled="1" in /boot/loader.conf and got the
> > following screen during the buildworld:
> >
> > .....

> > -c /usr/src/gnu/usr.bin/binutils/as/../../../../contrib/binutils/gas/sb.c

> >
> > MCA: CPU3 UNCOR PCC OVER DTLIB L1 error
> > MCA: Address 0x8015fb000
>

> You hardware is broken and it is telling you so. You have had
> multiple machine checks with the most severe one being an
> uncorrectable error in your data TLB (i.e. in the CPU itself).

John,

I also set hw.mca.enabled="1" and vm.pmap.pg_ps_enabled="1"
in /boot/loader.conf on my (under load) spontaneously rebooting
opteron proliant server.

Server was upgraded to FREEBSD-8.0-PRERELEASE today.

This is what happened..

---- machine check trap, first run ----

sonnenkraft:/usr/obj # MCA: CPU 5 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x80e5c8000

Fatal trap 28: machine check trap while in user mode

cpuid = 5; apic id = 05
instruction pointer = 0x43:0x691688
stack pointer = 0x3b:0x7fffffffdf90
frame pointer = 0x3b:0x6a2
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 29319 (cc1)
[thread pid 29319 tid 100086 ]
Stopped at 0x691688: leal 0x1(%rax),%edx
db> where
Tracing pid 29319 tid 100086 td 0xffffff000e065390
WAKEUP_cpu() at 0x691688
*** error reading from address 6aa ***
db> bt
Tracing pid 29319 tid 100086 td 0xffffff000e065390
WAKEUP_cpu() at 0x691688
*** error reading from address 6aa ***
db> call doadump
Cannot dump. Device not defined or unavailable.
= 0x30

---- machine check trap, second run - this
time with dumpdev defined ----

sonnenkraft:~ # MCA: CPU 2 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x8011d3000

Fatal trap 28: machine check trap while in user mode

cpuid = 2; apic id = 02
instruction pointer = 0x43:0x6b1241
stack pointer = 0x3b:0x7fffffffe200
frame pointer = 0x3b:0x7fffffffe240
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 69498 (cc1)
[thread pid 69498 tid 100338 ]
Stopped at 0x6b1241: call 0x6af140
db> where
Tracing pid 69498 tid 100338 td 0xffffff000ef75720
WAKEUP_cpu() at 0x6b1241
db> bt
Tracing pid 69498 tid 100338 td 0xffffff000ef75720
WAKEUP_cpu() at 0x6b1241
db> call doadump
Physical memory: 20462 MB
Dumping 2303 MB: 2288 2272 2256 2240 2224 2208 2192 2176 2160 2144 2128
2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936 1920 1904
1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712 1696 1680
1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488 1472 1456
1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264 1248 1232
1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040 1024 1008
992 976 960 944 928 912 896 880 864 848 832 816 800 784 768 752 736 720
704 688 672 656 640 624 608 592 576 560 544 528 512 496 480 464 448 432
416 400 384 368 352 336 320 304 288 272 256 240 224 208 192 176 160 144
128 112 96 80 64 48 32 16
Dump complete
= 0
db> reboot
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 2

---- machine check trap, third run - BIOS: static low
power mode enabled, to rule out power/heat issue ----

sonnenkraft:~ # MCA: CPU 4 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x8011fd000

Fatal trap 28: machine check trap while in user mode

cpuid = 4; apic id = 04
instruction pointer = 0x43:0x76127d
stack pointer = 0x3b:0x7fffffffe068
frame pointer = 0x3b:0x7fffffffe090
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 73135 (cc1)
[thread pid 73135 tid 100146 ]
Stopped at 0x76127d: xorl %edx,%edx
db> where
Tracing pid 73135 tid 100146 td 0xffffff00071caab0
WAKEUP_cpu() at 0x76127d
db> bt
Tracing pid 73135 tid 100146 td 0xffffff00071caab0
WAKEUP_cpu() at 0x76127d
db> call doadump
Physical memory: 20462 MB
Dumping 2335 MB: 2320 2304 2288 2272 2256 2240 2224 2208 2192 2176 2160
2144 2128 2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936
1920 1904 1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712
1696 1680 1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488
1472 1456 1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264
1248 1232 1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040
1024 1008 992 976 960 944 928 912 896 880 864 848 832 816 800 784 768
752 736 720 704 688 672 656 640 624 608 592 576 560 544 528 512 496 480
464 448 432 416 400 384 368 352 336 320 304 288 272 256 240 224 208 192
176 160 144 128 112 96 80 64 48 32 16
Dump complete
= 0
db> reboot
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 4

---- END: ----

What hardware parts are defective and need replacement? CPU, memory
or mainboard?

I now have two vmcore's + crashinfo core.txt available on the server.
Are they of any use to get further information?

--Kai.

--
Draft beer, not people.

Gary Jennejohn

unread,

Nov 12, 2009, 2:48:05 PM11/12/09

to

On Thu, 12 Nov 2009 20:52:10 +0200
Andriy Gapon <a...@icyb.net.ua> wrote:

> on 12/11/2009 19:28 Gary Jennejohn said the following:
> >
> > A safer way to test this is to enter
> > set hw.mca.enabled="1"
> > on the loader command line rather than putting it into loader.conf.
> >
> > I just did this on my "consumer" AMD64 board without ill effect.
>
> What CPU do you have, could you provide its description from dmesg?
> Did you stress-tested it enough (like parallel buildworld)?
>

I did about 1/2 of a parallel build world (-j3) and then stopped it. I
figured, if it lasts that long, it should make it all the way through.

CPU: AMD Athlon(tm) Dual Core Processor 4850e (2505.35-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x60fb2 Stepping = 2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x2001<SSE3,CX16>
AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
TSC: P-state invariant

---
Gary Jennejohn

Andriy Gapon

unread,

Nov 12, 2009, 5:09:25 PM11/12/09

to

on 12/11/2009 21:48 Gary Jennejohn said the following:

> On Thu, 12 Nov 2009 20:52:10 +0200
> Andriy Gapon <a...@icyb.net.ua> wrote:
>
>> on 12/11/2009 19:28 Gary Jennejohn said the following:
>>> A safer way to test this is to enter
>>> set hw.mca.enabled="1"
>>> on the loader command line rather than putting it into loader.conf.
>>>
>>> I just did this on my "consumer" AMD64 board without ill effect.
>> What CPU do you have, could you provide its description from dmesg?
>> Did you stress-tested it enough (like parallel buildworld)?
>>
>
> I did about 1/2 of a parallel build world (-j3) and then stopped it. I
> figured, if it lasts that long, it should make it all the way through.
>
> CPU: AMD Athlon(tm) Dual Core Processor 4850e (2505.35-MHz K8-class CPU)
> Origin = "AuthenticAMD" Id = 0x60fb2 Stepping = 2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
> Features2=0x2001<SSE3,CX16>
> AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
> AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
> TSC: P-state invariant

Hmm, all people, who do have the problem and who provided cpu information, seem
to have family 16 (0x10, 10h) AMD processors. You have a family 15 one. Maybe
it's that, maybe not.

BTW, here is my info:
CPU: AMD Athlon(tm) II X2 250 Processor (3013.73-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f62 Stepping = 2

Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x802009<SSE3,MON,CX16,POPCNT>
AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
AMD
Features2=0x37ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT>
TSC: P-state invariant

--
Andriy Gapon

Etienne Robillard

unread,

Nov 12, 2009, 6:11:47 PM11/12/09

to

here's my dmesg output. No problem at all buildworld/installworld. I
also didn't play much with any debugging options, this is the generic
one... ;-)

Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.0-RC2 #0: Sun Oct 25 07:27:19 UTC 2009
ro...@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ (2613.41-MHz
K8-class CPU)
Origin = "AuthenticAMD" Id = 0x40f32 Stepping = 2

Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x2001<SSE3,CX16>
AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>

AMD Features2=0x1f<LAHF,CMP,SVM,ExtAPIC,CR8>
real memory = 536870912 (512 MB)
avail memory = 499974144 (476 MB)
ACPI APIC Table: <Nvidia ASUSACPI>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
ioapic0: Changing APIC ID to 4
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <Nvidia ASUSACPI> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, 1fde0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfefff000-0xfefff3ff on
acpi0
Timecounter "HPET" frequency 25000000 Hz quality 900
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pci0: <memory> at device 0.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 1.0 on pci0
isa0: <ISA bus> on isab0
pci0: <serial bus, SMBus> at device 1.1 (no driver attached)
ohci0: <OHCI (generic) USB controller> mem 0xfe02f000-0xfe02ffff irq 21
at device 2.0 on pci0
ohci0: [ITHREAD]
usbus0: <OHCI (generic) USB controller> on ohci0
ehci0: <NVIDIA nForce4 USB 2.0 controller> mem 0xfeb00000-0xfeb000ff irq
22 at device 2.1 on pci0
ehci0: [ITHREAD]
usbus1: EHCI version 1.0
usbus1: <NVIDIA nForce4 USB 2.0 controller> on ehci0
atapci0: <nVidia nForce CK804 UDMA133 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe800-0xe80f at device 6.0 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci0
ata1: [ITHREAD]
atapci1: <nVidia nForce CK804 SATA300 controller> port
0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xd400-0xd40f mem
0xfe02c000-0xfe02cfff irq 23 at device 7.0 on pci0
atapci1: [ITHREAD]
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]
atapci2: <nVidia nForce CK804 SATA300 controller> port
0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xc000-0xc00f mem
0xfe02b000-0xfe02bfff irq 21 at device 8.0 on pci0
atapci2: [ITHREAD]
ata4: <ATA channel 0> on atapci2
ata4: [ITHREAD]
ata5: <ATA channel 1> on atapci2
ata5: [ITHREAD]
pcib1: <ACPI PCI-PCI bridge> at device 9.0 on pci0
pci1: <ACPI PCI bus> on pcib1
fwohci0: <VIA Fire II (VT6306)> port 0xac00-0xac7f mem
0xfdfff000-0xfdfff7ff irq 17 at device 1.0 on pci1
fwohci0: [ITHREAD]
fwohci0: OHCI version 1.10 (ROM=1)
fwohci0: No. of Isochronous channels is 4.
fwohci0: EUI64 00:1e:8c:00:00:22:dc:f8
fwohci0: Phy 1394a available S400, 2 ports.
fwohci0: Link S400, max_rec 2048 bytes.
firewire0: <IEEE1394(FireWire) bus> on fwohci0
dcons_crom0: <dcons configuration ROM> on firewire0
dcons_crom0: bus_addr 0x1ef1c000
fwe0: <Ethernet over FireWire> on firewire0
if_fwe0: Fake Ethernet address: 02:1e:8c:22:dc:f8
fwe0: Ethernet address: 02:1e:8c:22:dc:f8
fwip0: <IP over FireWire> on firewire0
fwip0: Firewire address: 00:1e:8c:00:00:22:dc:f8 @ 0xfffe00000000, S400,
maxrec 2048
sbp0: <SBP-2/SCSI over FireWire> on firewire0
fwohci0: Initiate bus reset
fwohci0: fwohci_intr_core: BUS reset
fwohci0: fwohci_intr_core: node_id=0x00000000, SelfID Count=1,
CYCLEMASTER mode
dc0: <ADMtek AN985 10/100BaseTX> port 0xa800-0xa8ff mem
0xfdffe000-0xfdffe3ff irq 16 at device 6.0 on pci1
miibus0: <MII bus> on dc0
ukphy0: <Generic IEEE 802.3u media interface> PHY 1 on miibus0
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
dc0: Ethernet address: 00:12:17:51:31:ac
dc0: [ITHREAD]
nfe0: <NVIDIA nForce4 CK804 MCP9 Networking Adapter> port 0xbc00-0xbc07
mem 0xfe02a000-0xfe02afff irq 22 at device 10.0 on pci0
miibus1: <MII bus> on nfe0
atphy0: <Atheros F1 10/100/1000 PHY> PHY 0 on miibus1
atphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, auto
nfe0: Ethernet address: 00:1f:c6:be:65:76
nfe0: [FILTER]
pcib2: <ACPI PCI-PCI bridge> at device 11.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 12.0 on pci0
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 13.0 on pci0
pci4: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> at device 14.0 on pci0
pci5: <ACPI PCI bus> on pcib5
vgapci0: <VGA-compatible display> port 0x9c00-0x9cff mem
0xd0000000-0xdfffffff,0xfdef0000-0xfdefffff irq 18 at device 0.0 on pci5
vgapci1: <VGA-compatible display> mem 0xfdee0000-0xfdeeffff at device
0.1 on pci5
acpi_tz0: <Thermal Zone> on acpi0
ACPI Warning: \\_TZ_.THRM._PSL: Return Package type mismatch at index 0
- found [NULL Object Descriptor], expected Reference 20090521 nspredef-1058
atrtc0: <AT realtime clock> port 0x70-0x73 on acpi0
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: [FILTER]
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: [FILTER]
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (EPP/NIBBLE) in COMPATIBLE mode
ppc0: [ITHREAD]
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
plip0: [ITHREAD]
lpt0: <Printer> on ppbus0
lpt0: [ITHREAD]
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
cpu0: <ACPI CPU> on acpi0
powernow0: <PowerNow! K8> on cpu0
cpu1: <ACPI CPU> on acpi0
powernow1: <PowerNow! K8> on cpu1
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
firewire0: 1 nodes, maxhop <= 0 cable IRM irm(0) (me)
firewire0: bus manager 0
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 480Mbps High Speed USB v2.0
ugen0.1: <nVidia> at usbus0
uhub0: <nVidia OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <nVidia> at usbus1
uhub1: <nVidia EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
acd0: DMA limited to UDMA33, device found non-ATA66 cable
acd0: DVDR <LITE-ON DVDRW LH-20A1P/KL0N> at ata0-slave UDMA33
ad8: 152627MB <WDC WD1600AAJS-22PSA0 05.06H05> at ata4-master SATA300
uhub0: 9 ports with 9 removable, self powered
uhub1: 9 ports with 9 removable, self powered
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad8s1a
WARNING: / was not properly dismounted
ugen0.2: <Dell> at usbus0
ums0: <Dell Dell USB Optical Mouse, class 0/0, rev 2.00/43.01, addr 2>
on usbus0
ums0: 3 buttons and [XYZ] coordinates ID=0
ugen0.3: <vendor 0x0d8c> at usbus0
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, rule-based
forwarding disabled, default to deny, logging disabled
nfe0: link state changed to UP

hope this helps,

Etienne

--
Etienne Robillard <robillar...@gmail.com>
Green Tea Hackers Club <http://gthc.org/>
Blog: <http://gthc.org/blog/>
PGP Fingerprint: 178A BF04 23F0 2BF5 535D 4A57 FD53 FD31 98DC 4E57

Andriy Gapon

unread,

Nov 13, 2009, 2:53:50 AM11/13/09

to

on 13/11/2009 01:11 Etienne Robillard said the following:

> CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ (2613.41-MHz
> K8-class CPU)
> Origin = "AuthenticAMD" Id = 0x40f32 Stepping = 2

Your CPU is different, older family too.

--
Andriy Gapon

Andriy Gapon

unread,

Nov 13, 2009, 2:57:04 AM11/13/09

to

on 12/11/2009 17:57 Mark Atkinson said the following:

>> superpages and no machine check - works
>> machine check and no superpages - works
>> machine check and superpages - problem
>
> That's not quite the same for sure, definitely try replacing the memory
> first if you haven't already.

I am not sure why would you say this.
I still see more similarities than differences:
1. in all cases it's family 10h CPUs
2. in all cases turning off super-pages helps
3. in all case failure seems to happen through MCA mechanism

The hardware is quite tested too.

Andriy Gapon

unread,

Nov 13, 2009, 3:08:45 AM11/13/09

to

on 12/11/2009 20:59 Kai Gallasch said the following:

> sonnenkraft:~ # MCA: CPU 4 UNCOR PCC OVER DTLB L1 error

Kai,

very interesting info, it matches what Serguey reported too, thank you for the test!
So in all cases where MCE information is captured it seems to be L1 data TLB error.

John,
BTW, OVER may be incorrectly reported by hardware in this case, see erratum 60:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.pdf

Kai,
I have a hunch, could you please try the following _sledgehammer_ patch (only
kernel build/install is needed):
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 44b71f3..a456609 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -2981,6 +2981,7 @@ setpte:
* Map the superpage.
*/
pde_store(pde, PG_PS | newpde);
+ pmap_invalidate_all(pmap);

pmap_pde_promotions++;
CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"

This will slow down an act of promotion to a superpage, but should not have any
visible impact on overall performance.

Serguey,

you problem seems to not be limited to superpages only, so I am not sure if this
patch would be of much help to you.

Andriy Gapon

unread,

Nov 13, 2009, 3:10:16 AM11/13/09

to

on 12/11/2009 18:34 S.N.Grigoriev said the following:

>
> 12.11.09, 15:36, "Andriy Gapon" <a...@icyb.net.ua>
> wrote:
>
>> Serguey,
>> are you sure that setting vm.pmap.pg_ps_enabled=0 doesn't help you?
>> I know that already asked you this once.
>> But, could you please try again with vm.pmap.pg_ps_enabled=0 and hw.mca.enabled=1
>> and see what kind of behavior you get?
>> I am curious what would happen, would it be the same kind of machine check condition.
>
> Andriy,
>
> I've done the world compilation with 'vm.pmap.pg_ps_enabled=0'.
> The first attempt has finished with silent reboot. The second one has
> been captured by the debugger:
>
> panic: backgroundwritedone: lost buffer
> cpuid = 0
> KDB: enter: panic
> [thread pid 3 tid 100014 ]
> Stopped at breakpoint+0x5: leave
>

This is a really weird panic.
Looks like perhaps there are multiple problems with your system. I wouldn't
rule out a real hardware problem. Not sure what can be done.

Kai Gallasch

unread,

Nov 13, 2009, 8:48:04 AM11/13/09

to

Am Fri, 13 Nov 2009 10:08:45 +0200
schrieb Andriy Gapon <a...@icyb.net.ua>:

> on 12/11/2009 20:59 Kai Gallasch said the following:

> > sonnenkraft:~ # MCA: CPU 4 UNCOR PCC OVER DTLB L1 error

> Kai,

> I have a hunch, could you please try the following _sledgehammer_
> patch (only kernel build/install is needed):
> diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
> index 44b71f3..a456609 100644
> --- a/sys/amd64/amd64/pmap.c
> +++ b/sys/amd64/amd64/pmap.c
> @@ -2981,6 +2981,7 @@ setpte:
> * Map the superpage.
> */
> pde_store(pde, PG_PS | newpde);
> + pmap_invalidate_all(pmap);
>
> pmap_pde_promotions++;
> CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"
>
> This will slow down an act of promotion to a superpage, but should
> not have any visible impact on overall performance.

Andriy,

I tried the patch with vm.pmap.pg_ps_enabled="1" , hw.mca.enabled="1"
, rebuilt the kernel (although normally I never build kernels on Friday
13th :-) and ran buildworld -j8 for five times in a row. No sign of a
machine check exception, no reboot.

--Kai.

--
The final screw holding up a rackmount server is always possessed by
demons.

Andriy Gapon

unread,

Nov 13, 2009, 8:55:42 AM11/13/09

to

on 13/11/2009 15:48 Kai Gallasch said the following:

> Am Fri, 13 Nov 2009 10:08:45 +0200
> schrieb Andriy Gapon <a...@icyb.net.ua>:

>> Kai,
>> I have a hunch, could you please try the following _sledgehammer_
>> patch (only kernel build/install is needed):
>> diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
>> index 44b71f3..a456609 100644
>> --- a/sys/amd64/amd64/pmap.c
>> +++ b/sys/amd64/amd64/pmap.c
>> @@ -2981,6 +2981,7 @@ setpte:
>> * Map the superpage.
>> */
>> pde_store(pde, PG_PS | newpde);
>> + pmap_invalidate_all(pmap);
>>
>> pmap_pde_promotions++;
>> CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"
>>
>> This will slow down an act of promotion to a superpage, but should
>> not have any visible impact on overall performance.
>
> Andriy,
>
> I tried the patch with vm.pmap.pg_ps_enabled="1" , hw.mca.enabled="1"
> , rebuilt the kernel (although normally I never build kernels on Friday
> 13th :-) and ran buildworld -j8 for five times in a row. No sign of a
> machine check exception, no reboot.

I think that this is good news.
This is not a fix, but the fact that it helps should help us find a proper solution.

Thanks!
--
Andriy Gapon

John Baldwin

unread,

Nov 13, 2009, 9:49:08 AM11/13/09

to

> sonnenkraft:/usr/obj # MCA: CPU 5 UNCOR PCC OVER DTLB L1 error
> MCA: Address 0x80e5c8000

Hmm, normally I would suspect the CPU, but avg@ has been looking at the fact
that there may be some sort of interaction with the superpages code and the
machine check registers on AMD CPUs (either a CPU bug, or perhaps a
superpages bug). I would wait to see if he finds something. An isolated MCA
would most likely indicate a hardware error, but the fact that several people
are reporting this exact machine check but only when superpages is enabled
indicates it might be something else.

--
John Baldwin

Mark Atkinson

unread,

Nov 13, 2009, 1:13:24 PM11/13/09

to

Andriy Gapon wrote:
> on 12/11/2009 17:57 Mark Atkinson said the following:
>>> superpages and no machine check - works
>>> machine check and no superpages - works
>>> machine check and superpages - problem
>> That's not quite the same for sure, definitely try replacing the memory
>> first if you haven't already.
>
> I am not sure why would you say this.
> I still see more similarities than differences:
> 1. in all cases it's family 10h CPUs
> 2. in all cases turning off super-pages helps
> 3. in all case failure seems to happen through MCA mechanism
>
> The hardware is quite tested too.
>

Sorry, I only mentioned the difference in that in my case and
the case you listed as:

'superpages and no machine check - works'

Always ends up in either

* hardware reset on my machine (watchdog or other), or
* bus error during compilation

when doing buildworld with superpages on my affected machine. You seem
to list here that turning off superpages always helps, so that matches
my behavior. I tried looking back through your posts to find the
discrepancy, but didn't garner enough detail.

All the best,

Mark