2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p

Johannes Stezenbach

unread,

Aug 7, 2009, 1:10:11 PM8/7/09

to

Hi,

I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
an old Thinkpad T42p. During boot I get the following:

Local APIC disabled by BIOS -- you can enable it with "lapic"
APIC: disable apic facility
...
mce: CPU supports 5 MCE banks
Disabling lock debugging due to kernel taint
------------[ cut here ]------------
WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
Hardware name: 2373Y4M
Modules linked in:
Pid: 0, comm: swapper Tainted: G M 2.6.31-rc5 #1
Call Trace:
[<c10248c1>] warn_slowpath_common+0x60/0x90
[<c10248fe>] warn_slowpath_null+0xd/0x10
[<c1013139>] native_apic_write_dummy+0x2d/0x39
[<c100dcd2>] intel_init_thermal+0xb6/0x144
[<c100d517>] ? mce_init+0x33/0xb0
[<c100db4b>] mce_intel_feature_init+0xb/0x4c
[<c14fc31e>] mcheck_init+0x1e2/0x253
[<c14faef4>] identify_cpu+0x30b/0x31b
[<c14d9af0>] identify_boot_cpu+0xd/0x23
[<c14d9b3c>] check_bugs+0xb/0xd4
[<c104f929>] ? delayacct_init+0x42/0x49
[<c14d493c>] start_kernel+0x25e/0x26d
[<c14d430b>] i386_start_kernel+0x65/0x6a
---[ end trace 4eaa2a86a8e2da22 ]---
...
CPU: Intel(R) Pentium(R) M processor 1.80GHz stepping 06

mcelog reports:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 1
TIME 1249662514 Fri Aug 7 18:28:34 2009
MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Unknown Error 30
STATUS f200000000000030 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 13

In .config I have:

CONFIG_X86_UP_APIC=y
# CONFIG_X86_UP_IOAPIC is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_OLD_MCE is not set
CONFIG_X86_NEW_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
# CONFIG_X86_ANCIENT_MCE is not set
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y

I guess I should try to boot with "lapic"? But I think
MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
kernel dmesg said:

Local APIC disabled by BIOS -- you can enable it with "lapic"
...
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.

2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Johannes Stezenbach

unread,

Aug 9, 2009, 6:10:08 AM8/9/09

to

On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
>
> I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> an old Thinkpad T42p. During boot I get the following:

...

> I guess I should try to boot with "lapic"? But I think
> MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
> kernel dmesg said:
>
> Local APIC disabled by BIOS -- you can enable it with "lapic"
> ...
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#0.
>
> 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

I booted with "lapic", then the backtrace is gone
but it still logs machine check events which I think
are bogus since 2.6.29.1 does not log any. I also
tried 2.6.30, no machine check messages in dmesg. But
I noticed that mcelog complains about missing /dev/mcelog,
it seems that /dev/mcelog support for 32bit kernels is new.
However, I guess the old kernels should still have printed
a messge to dmesg, right? "Uncorrected error" + "Processor
context corrupt" sounds pretty serious, but the machine
runs without problems.

These are from 2.6.31-rc5 by running mcelog (dmesg
just says "Machine check events logged"):

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 1

TIME 1249725930 Sat Aug 8 12:05:30 2009

MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt

MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
STATUS f2000000000000b0 MCGSTATUS 0

MCGCAP 5 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 13

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor

MCE 1
CPU 0 BANK 1
TIME 1249728066 Sat Aug 8 12:41:06 2009

MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt

MCA: Unknown Error 20
STATUS f200000000000020 MCGSTATUS 0

MCGCAP 5 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 13

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor

MCE 2
CPU 0 BANK 1
TIME 1249747923 Sat Aug 8 18:12:03 2009

MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Unknown Error 30
STATUS f200000000000030 MCGSTATUS 0
MCGCAP 5 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 13

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor

MCE 3
CPU 0 BANK 1
TIME 1249765938 Sat Aug 8 23:12:18 2009

MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt

MCA: Unknown Error 20
STATUS f200000000000020 MCGSTATUS 0

MCGCAP 5 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 13

Bartlomiej Zolnierkiewicz

unread,

Aug 9, 2009, 6:40:05 AM8/9/09

to

On Sunday 09 August 2009 12:03:48 Johannes Stezenbach wrote:
> On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
> >
> > I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> > an old Thinkpad T42p. During boot I get the following:
> ...
> > I guess I should try to boot with "lapic"? But I think
> > MCE worked without "lapic" in earlier kernels. On a 2.6.29.1
> > kernel dmesg said:
> >
> > Local APIC disabled by BIOS -- you can enable it with "lapic"
> > ...
> > Intel machine check architecture supported.
> > Intel machine check reporting enabled on CPU#0.
> >
> > 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

http://patchwork.kernel.org/patch/37908/

Johannes Stezenbach

unread,

Aug 9, 2009, 12:50:11 PM8/9/09

to

On Sun, Aug 09, 2009 at 12:34:03PM +0200, Bartlomiej Zolnierkiewicz wrote:
> On Sunday 09 August 2009 12:03:48 Johannes Stezenbach wrote:
> > On Fri, Aug 07, 2009 at 07:09:42PM +0200, Johannes Stezenbach wrote:
> > >
> > > I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> > > an old Thinkpad T42p. During boot I get the following:
> > ...
> > >

> > > 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.
>
> http://patchwork.kernel.org/patch/37908/

This patch fixes the issue for me.

Thanks,
Johannes

Andi Kleen

unread,

Aug 10, 2009, 6:40:14 AM8/10/09

to

Johannes Stezenbach <j...@sig21.net> writes:

> Hi,
>
> I'm currently running linux-2.6.31-rc5-246-g90bc1a6 on
> an old Thinkpad T42p. During boot I get the following:

Thanks for the report.

>
> Local APIC disabled by BIOS -- you can enable it with "lapic"
> APIC: disable apic facility
> ...
> mce: CPU supports 5 MCE banks
> Disabling lock debugging due to kernel taint
> ------------[ cut here ]------------
> WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
> Hardware name: 2373Y4M
> Modules linked in:
> Pid: 0, comm: swapper Tainted: G M 2.6.31-rc5 #1

The mcelog below is already worked around with Bart's patch he posted
the link to (it's really a BIOS bug in your case that the BIOS leaves
junks in the machine check registers on boot)

[for the x86 maintainers:]
One thing that would be good to make sure that Bart's patch is queued
for .31 too, not only for .32, since this BIOS problem seems
to be common (already two reports)

But still need to fix that warning too, which is independent
[another .31 candidate]

> Call Trace:
> [<c10248c1>] warn_slowpath_common+0x60/0x90
> [<c10248fe>] warn_slowpath_null+0xd/0x10
> [<c1013139>] native_apic_write_dummy+0x2d/0x39
> [<c100dcd2>] intel_init_thermal+0xb6/0x144
> [<c100d517>] ? mce_init+0x33/0xb0
> [<c100db4b>] mce_intel_feature_init+0xb/0x4c
> [<c14fc31e>] mcheck_init+0x1e2/0x253
> [<c14faef4>] identify_cpu+0x30b/0x31b
> [<c14d9af0>] identify_boot_cpu+0xd/0x23
> [<c14d9b3c>] check_bugs+0xb/0xd4
> [<c104f929>] ? delayacct_init+0x42/0x49
> [<c14d493c>] start_kernel+0x25e/0x26d
> [<c14d430b>] i386_start_kernel+0x65/0x6a
> ---[ end trace 4eaa2a86a8e2da22 ]---

The appended patch should remove the warning. Can you please test it?

> 2.6.29.1 doesn't log any MCE events, so I doubt this is a HW problem.

It actually is a BIOS bug, but not really broken hardware.

-Andi

---

Don't try to enable thermal throttling on 32bit systems without apic

When the local APIC isn't enabled don't try to enable thermal throttling.
The APIC writes would WARN_ON.

Fixes

> Disabling lock debugging due to kernel taint
> ------------[ cut here ]------------
> WARNING: at arch/x86/kernel/apic/apic.c:247 native_apic_write_dummy+0x2d/0x39()
> Hardware name: 2373Y4M
> Modules linked in:
> Pid: 0, comm: swapper Tainted: G M 2.6.31-rc5 #1

Originally reported by Johannes Stezenbach

This is a 2.6.31 candidate because it fixes a regression.

Signed-off-by: Andi Kleen <a...@linux.intel.com>

---
arch/x86/kernel/cpu/mcheck/therm_throt.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -236,6 +236,9 @@ void intel_init_thermal(struct cpuinfo_x
int tm2 = 0;
u32 l, h;

+ if (!cpu_has_apic || disable_apic)
+ return;
+
/* Thermal monitoring depends on ACPI and clock modulation*/
if (!cpu_has(c, X86_FEATURE_ACPI) || !cpu_has(c, X86_FEATURE_ACC))
return;

--
a...@linux.intel.com -- Speaking for myself only.

Johannes Stezenbach

unread,

Aug 10, 2009, 8:30:16 AM8/10/09

to

On Mon, Aug 10, 2009 at 12:31:47PM +0200, Andi Kleen wrote:
> Johannes Stezenbach <j...@sig21.net> writes:
>
> > Call Trace:
> > [<c10248c1>] warn_slowpath_common+0x60/0x90
> > [<c10248fe>] warn_slowpath_null+0xd/0x10
> > [<c1013139>] native_apic_write_dummy+0x2d/0x39
> > [<c100dcd2>] intel_init_thermal+0xb6/0x144
> > [<c100d517>] ? mce_init+0x33/0xb0
> > [<c100db4b>] mce_intel_feature_init+0xb/0x4c
> > [<c14fc31e>] mcheck_init+0x1e2/0x253
> > [<c14faef4>] identify_cpu+0x30b/0x31b
> > [<c14d9af0>] identify_boot_cpu+0xd/0x23
> > [<c14d9b3c>] check_bugs+0xb/0xd4
> > [<c104f929>] ? delayacct_init+0x42/0x49
> > [<c14d493c>] start_kernel+0x25e/0x26d
> > [<c14d430b>] i386_start_kernel+0x65/0x6a
> > ---[ end trace 4eaa2a86a8e2da22 ]---
>
>
> The appended patch should remove the warning. Can you please test it?

It works (no more warning), but I feel I'm better off using "lapic"
because it enables thermal monitoring and performance counters.

Thanks,
Johannes

Andi Kleen

unread,

Aug 10, 2009, 8:40:15 AM8/10/09

to

> > The appended patch should remove the warning. Can you please test it?
>
> It works (no more warning), but I feel I'm better off using "lapic"

Thanks for testing.

> because it enables thermal monitoring and performance counters.

When the BIOS doesn't enable it then force enabling lapic might not work.

This could cause either boot failures (obvious) or more subtle
problems like SMM doing something unexpected. Just saying that if you have
strange problems later first try disabling this option again.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Johannes Stezenbach

unread,

Aug 10, 2009, 9:00:16 AM8/10/09

to

On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > The appended patch should remove the warning. Can you please test it?
> >
> > It works (no more warning), but I feel I'm better off using "lapic"
>
> Thanks for testing.
>
> > because it enables thermal monitoring and performance counters.
>
> When the BIOS doesn't enable it then force enabling lapic might not work.
>
> This could cause either boot failures (obvious) or more subtle
> problems like SMM doing something unexpected. Just saying that if you have
> strange problems later first try disabling this option again.

Thanks for the heads-up. I remember I tried to use oprofile
in the past on this machine and was disappointed that it only
got the timer event. I'll keep lapic for now unless I see
signs of instability.

Johannes

Ingo Molnar

unread,

Aug 10, 2009, 9:40:09 AM8/10/09

to

* Johannes Stezenbach <j...@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > > > The appended patch should remove the warning. Can you please test it?
> > >
> > > It works (no more warning), but I feel I'm better off using "lapic"
> >
> > Thanks for testing.
> >
> > > because it enables thermal monitoring and performance counters.
> >
> > When the BIOS doesn't enable it then force enabling lapic might not work.
> >
> > This could cause either boot failures (obvious) or more subtle
> > problems like SMM doing something unexpected. Just saying that
> > if you have strange problems later first try disabling this
> > option again.
>
> Thanks for the heads-up. I remember I tried to use oprofile in
> the past on this machine and was disappointed that it only got the
> timer event. I'll keep lapic for now unless I see signs of
> instability.

What's the output of something like 'perf stat true', and does 'perf
top' output something - i.e. do perfcounters work in general? Once
you get to that stage and it works then it should be fine.

Ingo

Johannes Stezenbach

unread,

Aug 10, 2009, 3:30:12 PM8/10/09

to

On Mon, Aug 10, 2009 at 03:29:23PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <j...@sig21.net> wrote:
> > On Mon, Aug 10, 2009 at 02:32:28PM +0200, Andi Kleen wrote:
> > >
> > > When the BIOS doesn't enable it then force enabling lapic might not work.
> > >
> > > This could cause either boot failures (obvious) or more subtle
> > > problems like SMM doing something unexpected. Just saying that
> > > if you have strange problems later first try disabling this
> > > option again.
> >
> > Thanks for the heads-up. I remember I tried to use oprofile in
> > the past on this machine and was disappointed that it only got the
> > timer event. I'll keep lapic for now unless I see signs of
> > instability.
>
> What's the output of something like 'perf stat true', and does 'perf
> top' output something - i.e. do perfcounters work in general? Once
> you get to that stage and it works then it should be fine.

# ./perf stat true

Performance counter stats for 'true':

0.985808 task-clock-msecs # 0.779 CPUs
0 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
110 page-faults # 0.112 M/sec
583873 cycles # 592.279 M/sec
500937 instructions # 0.858 IPC
<not counted> cache-references
<not counted> cache-misses

0.001265524 seconds time elapsed

# ./perf top
------------------------------------------------------------------------------
PerfTop: 172 irqs/sec kernel:43.6% [100000 cycles], (all, 1 CPUs)
------------------------------------------------------------------------------

samples pcnt RIP kernel function
______ _______ _____ ________________ _______________

96.00 - 16.4% - 00000000c129be10 : acpi_pm_read
66.00 - 11.3% - 00000000c116fbb1 : delay_tsc
59.00 - 10.1% - 00000000c1172a83 : ioread32
26.00 - 4.4% - 00000000c116f567 : vsnprintf
21.00 - 3.6% - 00000000c11a4721 : acpi_os_read_port
20.00 - 3.4% - 00000000c136612c : schedule
19.00 - 3.2% - 00000000c10054b9 : mask_and_ack_8259A
18.00 - 3.1% - 00000000c11ce8bc : acpi_idle_enter_bm
17.00 - 2.9% - 00000000c1090dd1 : do_select
16.00 - 2.7% - 00000000c133f380 : unix_poll
14.00 - 2.4% - 00000000c116e2b4 : number
13.00 - 2.2% - 00000000c1002847 : sysenter_past_esp
10.00 - 1.7% - 00000000c1085b84 : fget_light
9.00 - 1.5% - 00000000c13666a7 : preempt_schedule
9.00 - 1.5% - 00000000c102cf1b : get_next_timer_interrupt
^C

First I tried oprofile while running an endless while loop in bash:

# opreport
CPU: Pentium M (P6 core), speed 1800 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted, and not in a thermal trip) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
282940 76.5545 bash
78266 21.1763 libc-2.9.so
1730 0.4681 Xorg
1069 0.2892 oprofiled

Looks plausible.

But in demsg I got this:

Delta way too big! 18446744022868427516 ts=18446744022868427516 write stamp = 0
------------[ cut here ]------------
WARNING: at kernel/trace/ring_buffer.c:1392 rb_reserve_next_event+0x150/0x309()
Hardware name: 2373Y4M
Modules linked in: ath5k mac80211 ath cfg80211 oprofile bnep sco rfcomm l2cap bluetooth ehci_hcd uhci_hc
Pid: 13478, comm: opcontrol Not tainted 2.6.31-rc5 #5
Call Trace:
[<c10248dd>] warn_slowpath_common+0x60/0x90
[<c102491a>] warn_slowpath_null+0xd/0x10
[<c1054129>] rb_reserve_next_event+0x150/0x309
[<c1068c06>] ? get_page_from_freelist+0x86/0x35a
[<c10544f7>] ring_buffer_lock_reserve+0xe7/0x135
[<f88622c0>] op_cpu_buffer_write_reserve+0x1a/0x4b [oprofile]
[<f886239d>] op_add_code+0x57/0x98 [oprofile]
[<c1068c06>] ? get_page_from_freelist+0x86/0x35a
[<c1367c08>] ? page_fault+0x0/0x8
[<f8862409>] log_sample+0x2b/0x6c [oprofile]
[<c1064c76>] ? filemap_fault+0x74/0x32b
[<f886249c>] oprofile_add_sample+0x3b/0x6b [oprofile]
[<f88641d8>] ppro_check_ctrs+0x66/0xdb [oprofile]
[<c1074bde>] ? __do_fault+0x303/0x32f
[<f8863907>] profile_exceptions_notify+0x1f/0x26 [oprofile]
[<c103953b>] notifier_call_chain+0x2b/0x55
[<c10398e3>] __atomic_notifier_call_chain+0x1a/0x3a
[<c103990f>] atomic_notifier_call_chain+0xc/0xe
[<c103993e>] notify_die+0x2d/0x2f
[<c1003c54>] do_nmi+0x63/0x222
[<c1367d1d>] nmi_stack_correct+0x28/0x2d
[<c1367c08>] ? page_fault+0x0/0x8
---[ end trace d174f39c63495e01 ]---

Thanks
Johannes

Andi Kleen

unread,

Aug 10, 2009, 3:50:11 PM8/10/09

to

> But in demsg I got this:

AFAIK that's already fixed. cc Robert to confirm. Not sure if the fix made it into 2.6.31 though?

-Andi

--

a...@linux.intel.com -- Speaking for myself only.

Robert Richter

unread,

Aug 10, 2009, 4:10:15 PM8/10/09

to

On 10.08.09 21:44:13, Andi Kleen wrote:
> > But in demsg I got this:
>
> AFAIK that's already fixed. cc Robert to confirm. Not sure if the fix made it into 2.6.31 though?

The warning I fixed was in rb_advance_reader(). Fix is already
upstream:

193df82 ring-buffer: Fix advance of reader in rb_buffer_peek()

But this seems to be different, a time stamp issue?

CC'ing Steve.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert....@amd.com

Ingo Molnar

unread,

Aug 10, 2009, 4:20:16 PM8/10/09

to

That looks almost normal - except for cache-references and
cache-misses that is not counted. Could you send the /proc/cpuinfo
info please?

>
>
> # ./perf top
> ------------------------------------------------------------------------------
> PerfTop: 172 irqs/sec kernel:43.6% [100000 cycles], (all, 1 CPUs)
> ------------------------------------------------------------------------------
>
> samples pcnt RIP kernel function
> ______ _______ _____ ________________ _______________
>
> 96.00 - 16.4% - 00000000c129be10 : acpi_pm_read
> 66.00 - 11.3% - 00000000c116fbb1 : delay_tsc
> 59.00 - 10.1% - 00000000c1172a83 : ioread32
> 26.00 - 4.4% - 00000000c116f567 : vsnprintf
> 21.00 - 3.6% - 00000000c11a4721 : acpi_os_read_port
> 20.00 - 3.4% - 00000000c136612c : schedule
> 19.00 - 3.2% - 00000000c10054b9 : mask_and_ack_8259A
> 18.00 - 3.1% - 00000000c11ce8bc : acpi_idle_enter_bm
> 17.00 - 2.9% - 00000000c1090dd1 : do_select
> 16.00 - 2.7% - 00000000c133f380 : unix_poll
> 14.00 - 2.4% - 00000000c116e2b4 : number
> 13.00 - 2.2% - 00000000c1002847 : sysenter_past_esp
> 10.00 - 1.7% - 00000000c1085b84 : fget_light
> 9.00 - 1.5% - 00000000c13666a7 : preempt_schedule
> 9.00 - 1.5% - 00000000c102cf1b : get_next_timer_interrupt
> ^C

Ok, this looks normal.

> First I tried oprofile while running an endless while loop in bash:
>
> # opreport
> CPU: Pentium M (P6 core), speed 1800 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted, and not in a thermal trip) with a unit mask of 0x00 (No unit mask) count 100000
> CPU_CLK_UNHALT...|
> samples| %|
> ------------------
> 282940 76.5545 bash
> 78266 21.1763 libc-2.9.so
> 1730 0.4681 Xorg
> 1069 0.2892 oprofiled
>
> Looks plausible.

Yeah.

That's a new warning i havent seen before - i've Cc:-ed Robert
(oprofile maintainer) and Steve (ftrace/ring-buffer maintainer) for
that.

The warning is probably harmless - oprofile sampling still works
fine, right?

Ingo

Johannes Stezenbach

unread,

Aug 10, 2009, 4:40:12 PM8/10/09

to

On Mon, Aug 10, 2009 at 10:14:06PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <j...@sig21.net> wrote:
> >
> > # ./perf stat true
> >
> > Performance counter stats for 'true':
> >
> > 0.985808 task-clock-msecs # 0.779 CPUs
> > 0 context-switches # 0.000 M/sec
> > 0 CPU-migrations # 0.000 M/sec
> > 110 page-faults # 0.112 M/sec
> > 583873 cycles # 592.279 M/sec
> > 500937 instructions # 0.858 IPC
> > <not counted> cache-references
> > <not counted> cache-misses
> >
> > 0.001265524 seconds time elapsed
>
> That looks almost normal - except for cache-references and
> cache-misses that is not counted. Could you send the /proc/cpuinfo
> info please?

# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.80GHz
stepping : 6
cpu MHz : 600.000
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe bts est tm2
bogomips : 1196.15
clflush size : 64
power management:

> The warning is probably harmless - oprofile sampling still works
> fine, right?

I haven't done much testing so far, but so far it looks promising.

Could the warning be caused by the cpufreq ondemand governor?
ISTR that one should switch to the performance governor before doing
any profiling, but I forgot for this test.

Johannes

Ingo Molnar

unread,

Aug 10, 2009, 5:40:11 PM8/10/09

to

* Johannes Stezenbach <j...@sig21.net> wrote:

> On Mon, Aug 10, 2009 at 10:14:06PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <j...@sig21.net> wrote:
> > >
> > > # ./perf stat true
> > >
> > > Performance counter stats for 'true':
> > >
> > > 0.985808 task-clock-msecs # 0.779 CPUs
> > > 0 context-switches # 0.000 M/sec
> > > 0 CPU-migrations # 0.000 M/sec
> > > 110 page-faults # 0.112 M/sec
> > > 583873 cycles # 592.279 M/sec
> > > 500937 instructions # 0.858 IPC
> > > <not counted> cache-references
> > > <not counted> cache-misses
> > >
> > > 0.001265524 seconds time elapsed
> >
> > That looks almost normal - except for cache-references and
> > cache-misses that is not counted. Could you send the /proc/cpuinfo
> > info please?
>
> # cat /proc/cpuinfo
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 13
> model name : Intel(R) Pentium(R) M processor 1.80GHz

ah, yes. There's no cache-references/misses, because in
arch/x86/kernel/cpu/perf_counter.c we have two zero entries:

static const u64 p6_perfmon_event_map[] =
{
[PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
[PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
[PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000, <----------
[PERF_COUNT_HW_CACHE_MISSES] = 0x0000, <----------
[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4,
[PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5,
[PERF_COUNT_HW_BUS_CYCLES] = 0x0062,
};

i.e. PERF_COUNT_HW_CACHE_REFERENCES and PERF_COUNT_HW_CACHE_MISSES
is not filled in yet.

Could you try something like:

perf stat -e r0f2e true

(0x2e: L2 requests, 0x0f: all units)

if i checked the docs right that counter would give us L2 cache
stats - does it display non-zero values?

> stepping : 6
> cpu MHz : 600.000
> cache size : 2048 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca cmov clflush dts acpi mmx fxsr sse sse2 ss tm pbe bts est tm2
> bogomips : 1196.15
> clflush size : 64
> power management:
>
>
> > The warning is probably harmless - oprofile sampling still works
> > fine, right?
>
> I haven't done much testing so far, but so far it looks promising.
>
> Could the warning be caused by the cpufreq ondemand governor? ISTR
> that one should switch to the performance governor before doing
> any profiling, but I forgot for this test.

there might be a connection - it could in theory cause sched_clock()
transients and confuse the ring-buffer time-stamping.

Ingo

Johannes Stezenbach

unread,

Aug 10, 2009, 6:20:07 PM8/10/09

to

# ./perf stat -e r0f2e true

Performance counter stats for 'true':

10584 raw 0xf2e

0.001159924 seconds time elapsed

The number also increases for larger programs than "true".

According to /usr/share/oprofile/i386/p6_mobile/events and
http://oprofile.sourceforge.net/docs/intel-p6-mobile-events.php
0x2e + 0x0f is "L2 requests, all units", but I couldn't say how
to count cache references vs. misses. Or does it work
with unit mask 0x0e vs. 0x01?

# ./perf stat -e r0e2e true

Performance counter stats for 'true':

10147 raw 0xe2e

0.001121651 seconds time elapsed

# ./perf stat -e r012e true

Performance counter stats for 'true':

468 raw 0x12e

0.001130870 seconds time elapsed

> > Could the warning be caused by the cpufreq ondemand governor? ISTR
> > that one should switch to the performance governor before doing
> > any profiling, but I forgot for this test.
>
> there might be a connection - it could in theory cause sched_clock()
> transients and confuse the ring-buffer time-stamping.

I'll try tomorrow after a fresh boot if the warning also appears
with the performance governor.

Thanks
Johannes

Peter Zijlstra

unread,

Aug 11, 2009, 8:10:14 AM8/11/09

to

On Tue, 2009-08-11 at 13:06 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zi...@chello.nl> wrote:
>
> > On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:
> >
> > > @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[]

> > > {
> > > [PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
> > > [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,

> > > - [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000,
> > > - [PERF_COUNT_HW_CACHE_MISSES] = 0x0000,
> > > + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0f2e,
> > > + [PERF_COUNT_HW_CACHE_MISSES] = 0x012e,
> >
> > 2e is total numer of L2 events,
> >
> > 0f is all mesi states
> > 01 is invalid states
>
> here's Intel's own description:
>
> I_STATE 0x01 Counts how many times requests miss the cache.
> MESI 0x0F Counts how many times cache lines in any state are accessed.
>
> so it's pretty close in practice. The only counts that are a bit
> inapplicable are fetches/prefetches it initiates on its own (they
> are included here) - but those too are related to the workload in
> general, so it's good as an approximation.
>
> It's definitely better than 0x00 IMO. What do you think?

Well, if they say so. I was thinking that counting I states would count
invalidates due to remote S->{E,M} transitions and invpg ins' and such.

And hitting an invalidated line is a whole different thing than plain
missing it due to it not being present.

Anyway, if this is the Intel recommended thing for cache misses, who am
I to argue.

Peter Zijlstra

unread,

Aug 11, 2009, 8:10:21 AM8/11/09

to

On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:

> @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[]
> {

> [PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
> [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,

> - [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000,
> - [PERF_COUNT_HW_CACHE_MISSES] = 0x0000,
> + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0f2e,
> + [PERF_COUNT_HW_CACHE_MISSES] = 0x012e,

2e is total numer of L2 events,

0f is all mesi states
01 is invalid states

Yes they are numbers, but no they are not full cache refs and misses.

I don't think this is a good patch.

Ingo Molnar

unread,

Aug 11, 2009, 8:50:06 AM8/11/09

to

Ok. That definitely looks like the right event to use.

Could you try the patch below, does it do the trick? Note, since
there's just two generic counters and perf stat uses four counters,
you'll need to run longer commands than 'true' or something like:

perf stat -a sleep 1

or:

perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null

to get all counters excercised and time-shared on your CPU.

Ingo

-------------->
Subject: perf_counter, x86: Add generic cache events to P6-mobile CPUs
From: Ingo Molnar <mi...@elte.hu>
Date: Tue Aug 11 10:26:33 CEST 2009

Johannes Stezenbach reported that 'perf stat' does not count
cache-miss and cache-references events on his Pentium-M based
laptop.

Add the events.

Reported-by: Johannes Stezenbach <j...@sig21.net>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
arch/x86/kernel/cpu/perf_counter.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux/arch/x86/kernel/cpu/perf_counter.c

@@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[]
{

[PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
[PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,

- [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000,
- [PERF_COUNT_HW_CACHE_MISSES] = 0x0000,
+ [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0f2e,
+ [PERF_COUNT_HW_CACHE_MISSES] = 0x012e,

[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4,
[PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5,
[PERF_COUNT_HW_BUS_CYCLES] = 0x0062,

Ingo Molnar

unread,

Aug 11, 2009, 9:20:16 AM8/11/09

to

* Peter Zijlstra <a.p.zi...@chello.nl> wrote:

> On Tue, 2009-08-11 at 11:34 +0200, Ingo Molnar wrote:
>
> > @@ -116,8 +116,8 @@ static const u64 p6_perfmon_event_map[]
> > {
> > [PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
> > [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
> > - [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000,
> > - [PERF_COUNT_HW_CACHE_MISSES] = 0x0000,
> > + [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0f2e,
> > + [PERF_COUNT_HW_CACHE_MISSES] = 0x012e,
>
> 2e is total numer of L2 events,
>
> 0f is all mesi states
> 01 is invalid states

here's Intel's own description:

I_STATE 0x01 Counts how many times requests miss the cache.
MESI 0x0F Counts how many times cache lines in any state are accessed.

so it's pretty close in practice. The only counts that are a bit
inapplicable are fetches/prefetches it initiates on its own (they
are included here) - but those too are related to the workload in
general, so it's good as an approximation.

It's definitely better than 0x00 IMO. What do you think?

Ingo

Johannes Stezenbach

unread,

Aug 11, 2009, 11:50:09 AM8/11/09

to

On Tue, Aug 11, 2009 at 12:13:07AM +0200, Johannes Stezenbach wrote:
> On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> > * Johannes Stezenbach <j...@sig21.net> wrote:
> > >
> > > Could the warning be caused by the cpufreq ondemand governor? ISTR
> > > that one should switch to the performance governor before doing
> > > any profiling, but I forgot for this test.
> >
> > there might be a connection - it could in theory cause sched_clock()
> > transients and confuse the ring-buffer time-stamping.
>
> I'll try tomorrow after a fresh boot if the warning also appears
> with the performance governor.

Nope, cpufreq was not the culprit. The "Delta way too big!"
warning in rb_reserve_next_event() happens after suspend-to-RAM.

Johannes Stezenbach

unread,

Aug 11, 2009, 12:00:22 PM8/11/09

to

On Tue, Aug 11, 2009 at 11:34:05AM +0200, Ingo Molnar wrote:
>
> Could you try the patch below, does it do the trick? Note, since
> there's just two generic counters and perf stat uses four counters,
> you'll need to run longer commands than 'true' or something like:
>
> perf stat -a sleep 1
>
> or:
>
> perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null
>
> to get all counters excercised and time-shared on your CPU.

OK, here are the results:

# ./perf stat -a true

Performance counter stats for 'true':

1.332145 task-clock-msecs # 1.057 CPUs
4 context-switches # 0.003 M/sec

0 CPU-migrations # 0.000 M/sec

121 page-faults # 0.091 M/sec
788333 cycles # 591.777 M/sec
610408 instructions # 0.774 IPC

<not counted> cache-references
<not counted> cache-misses

0.001260775 seconds time elapsed

# ./perf stat -a sleep 1

Performance counter stats for 'sleep 1':

1001.358010 task-clock-msecs # 1.000 CPUs
253 context-switches # 0.000 M/sec

0 CPU-migrations # 0.000 M/sec

145 page-faults # 0.000 M/sec
8841883 cycles # 8.830 M/sec (scaled from 85.04%)
6361047 instructions # 0.719 IPC (scaled from 78.03%)
801847 cache-references # 0.801 M/sec (scaled from 14.96%)
16925 cache-misses # 0.017 M/sec (scaled from 21.97%)

1.001540692 seconds time elapsed

# ./perf stat --repeat 10 /bin/ls -R /usr/bin >/dev/null

Performance counter stats for '/bin/ls -R /usr/bin' (10 runs):

8.642859 task-clock-msecs # 0.939 CPUs ( +- 2.382% )
1 context-switches # 0.000 M/sec ( +- 0.000% )
0 CPU-migrations # 0.000 M/sec ( +- 0.000% )
375 page-faults # 0.043 M/sec ( +- 0.000% )
14796621 cycles # 1712.005 M/sec ( +- 0.306% )
15526008 instructions # 1.049 IPC ( +- 0.020% )

<not counted> cache-references
<not counted> cache-misses

0.009207932 seconds time elapsed ( +- 4.020% )

# ./perf stat --repeat 10 /bin/ls -lR /usr/bin >/dev/null

Performance counter stats for '/bin/ls -lR /usr/bin' (10 runs):

49.894190 task-clock-msecs # 0.762 CPUs ( +- 8.231% )
45 context-switches # 0.001 M/sec ( +- 52.174% )
0 CPU-migrations # 0.000 M/sec ( +- 0.000% )
476 page-faults # 0.010 M/sec ( +- 0.000% )
81036819 cycles # 1624.173 M/sec ( +- 3.166% ) (scaled from 96.03%)
88822455 instructions # 1.096 IPC ( +- 2.108% ) (scaled from 96.42%)
573228 cache-references # 11.489 M/sec ( +- 37.947% )
<not counted> cache-misses

0.065464625 seconds time elapsed ( +- 19.349% )

# ./perf stat --repeat 10 /bin/ls -R /usr/lib >/dev/null

Performance counter stats for '/bin/ls -R /usr/lib' (10 runs):

136.771643 task-clock-msecs # 0.075 CPUs ( +- 19.292% )
757 context-switches # 0.006 M/sec ( +- 56.332% )
0 CPU-migrations # 0.000 M/sec ( +- 0.000% )
343 page-faults # 0.003 M/sec ( +- 0.000% )
173125500 cycles # 1265.800 M/sec ( +- 5.034% ) (scaled from 78.21%)
174597635 instructions # 1.009 IPC ( +- 1.829% ) (scaled from 78.34%)
2441413 cache-references # 17.850 M/sec ( +- 18.289% ) (scaled from 21.79%)
21112 cache-misses # 0.154 M/sec ( +- 11.183% ) (scaled from 21.66%)

1.824204481 seconds time elapsed ( +- 29.781% )

Johannes

Ingo Molnar

unread,

Aug 11, 2009, 1:00:15 PM8/11/09

to

* Johannes Stezenbach <j...@sig21.net> wrote:

Ok, this all looks pretty good!

Ingo

Ingo Molnar

unread,

Aug 12, 2009, 8:00:17 AM8/12/09

to

* Andi Kleen <an...@firstfloor.org> wrote:

> *PING* can someone please take the patch in there and process the
> other suggestions?

Firstly, a basic patch submission technical matter: could you please
stop spamming maintainers of the x86/MCE code with such '*PING*'
private mails, for patches you never properly submitted to begin
with?

The proper way to submit an upstream kernel fix, as you should well
be aware of, is to send a patch with a proper title and to Cc: it to
lkml and the maintainers affected. You never did that, you only
posted a for-testing patch into a discussion. Please stop this
self-important posturing, it's somewhat annoying.

Also, another, patch log quality issue, please credit Johannes
properly. You put this into the changelog:

> Originally reported by Johannes Stezenbach
>
> This is a 2.6.31 candidate because it fixes a regression.
>
> Signed-off-by: Andi Kleen <a...@linux.intel.com>

The proper way is to put this into the changelog:

Reported-by: Johannes Stezenbach <j...@sig21.net>

And given that Johannes also tested the patch, another line of:

Tested-by: Johannes Stezenbach <j...@sig21.net>

Would be appropriate as well.

Thirdly, we can do better with the fix itself too:

> +++ linux/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -236,6 +236,9 @@ void intel_init_thermal(struct cpuinfo_x
> int tm2 = 0;
> u32 l, h;
>
> + if (!cpu_has_apic || disable_apic)
> + return;
> +
> /* Thermal monitoring depends on ACPI and clock modulation*/
> if (!cpu_has(c, X86_FEATURE_ACPI) || !cpu_has(c, X86_FEATURE_ACC))
> return;

we already have that X86_FEATURE_ACPI and X86_FEATURE_ACC check and
a return statement. Would be better to expand that with the APIC
checks. Plus update the comment to also mention APIC as a
requirement plus fix the small error in the comment too while at it.

If these problems are fixed i'll apply the fix to tip:x86/urgent.

Thanks,

Ingo

Steven Rostedt

unread,

Aug 17, 2009, 10:50:13 AM8/17/09

to

On Tue, 11 Aug 2009, Johannes Stezenbach wrote:

> On Tue, Aug 11, 2009 at 12:13:07AM +0200, Johannes Stezenbach wrote:
> > On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> > > * Johannes Stezenbach <j...@sig21.net> wrote:
> > > >
> > > > Could the warning be caused by the cpufreq ondemand governor? ISTR
> > > > that one should switch to the performance governor before doing
> > > > any profiling, but I forgot for this test.
> > >
> > > there might be a connection - it could in theory cause sched_clock()
> > > transients and confuse the ring-buffer time-stamping.
> >
> > I'll try tomorrow after a fresh boot if the warning also appears
> > with the performance governor.
>
> Nope, cpufreq was not the culprit. The "Delta way too big!"
> warning in rb_reserve_next_event() happens after suspend-to-RAM.

Yeah, the ring buffer expects the time delta of two different events on
the same page to be less than 2^59 nanosecs apart. But something must have
screwed things up, since 2^59 nanosecs is 6671 days (18 years).

Perhaps coming out of suspend to ram changes the timestamp? Does it reset
the result of sched_clock?

-- Steve