Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MCE/Package power limit notification

38 views
Skip to first unread message

Udo Steinberg

unread,
Nov 23, 2011, 10:20:01 AM11/23/11
to
Hi,

After half an hour of compiling code on an Intel SNB machine, at which time
the machine was more or less permanently running with turbo boost active,
I've gotten the following in my dmesg with Linux-3.1.0:

CPU2: Package power limit notification (total events = 1)
CPU3: Package power limit notification (total events = 1)
CPU1: Package power limit notification (total events = 1)
CPU0: Package power limit notification (total events = 1)
CPU2: Package power limit normal
CPU3: Package power limit normal
CPU1: Package power limit normal
CPU0: Package power limit normal
[Hardware Error]: Machine check events logged
CPU1: Package power limit notification (total events = 655)
CPU2: Package power limit notification (total events = 655)
CPU3: Package power limit notification (total events = 655)
CPU0: Package power limit notification (total events = 655)
CPU3: Package power limit normal
CPU2: Package power limit normal
CPU1: Package power limit normal
CPU0: Package power limit normal
[Hardware Error]: Machine check events logged

Below is the output from mcelog. Is there anything that can be done about it?

Cheers,

- Udo

mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 3 THERMAL EVENT TSC 179fecf42d7
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 1 THERMAL EVENT TSC 179fecf5d95
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 2
CPU 0 THERMAL EVENT TSC 179fecf72fb
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 3
CPU 2 THERMAL EVENT TSC 179fecf88a9
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 4
CPU 3 THERMAL EVENT TSC 179fef2e964
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 5
CPU 1 THERMAL EVENT TSC 179fef2fd56
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 6
CPU 0 THERMAL EVENT TSC 179fef30d22
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 7
CPU 2 THERMAL EVENT TSC 179fef311bd
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 8
CPU 2 THERMAL EVENT TSC 31a70403488
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 9
CPU 3 THERMAL EVENT TSC 31a7040468b
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 10
CPU 0 THERMAL EVENT TSC 31a704059b1
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 11
CPU 1 THERMAL EVENT TSC 31a70407082
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 12
CPU 2 THERMAL EVENT TSC 31a705416f8
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 13
CPU 1 THERMAL EVENT TSC 31a70542907
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 14
CPU 0 THERMAL EVENT TSC 31a70543945
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 15
CPU 3 THERMAL EVENT TSC 31a70543cab
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 42
signature.asc

Luck, Tony

unread,
Nov 28, 2011, 5:50:04 PM11/28/11
to
> After half an hour of compiling code on an Intel SNB machine, at which time
> the machine was more or less permanently running with turbo boost active,
> I've gotten the following in my dmesg with Linux-3.1.0:
>
> CPU2: Package power limit notification (total events = 1)
> CPU3: Package power limit notification (total events = 1)
> CPU1: Package power limit notification (total events = 1)
> CPU0: Package power limit notification (total events = 1)
> ...

Udo,

Fenghua is looking at options to tone down these messages - reporting
as machine checks is unduly scary.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Yu, Fenghua

unread,
Nov 28, 2011, 6:00:02 PM11/28/11
to
> -----Original Message-----
> From: Luck, Tony
> Sent: Monday, November 28, 2011 2:46 PM
> To: Udo Steinberg; Linux Kernel Mailing List
> Cc: Yu, Fenghua
> Subject: RE: MCE/Package power limit notification
>
> > After half an hour of compiling code on an Intel SNB machine, at
> which time
> > the machine was more or less permanently running with turbo boost
> active,
> > I've gotten the following in my dmesg with Linux-3.1.0:
> >
> > CPU2: Package power limit notification (total events = 1)
> > CPU3: Package power limit notification (total events = 1)
> > CPU1: Package power limit notification (total events = 1)
> > CPU0: Package power limit notification (total events = 1)
> > ...
>
> Udo,
>
> Fenghua is looking at options to tone down these messages - reporting
> as machine checks is unduly scary.

Hi, Udo,

I sent out a patch to remove the mcelog info. Could you try it and see if it works for you?
https://lkml.org/lkml/2011/11/14/239

Thanks.

-Fenghua

Udo Steinberg

unread,
Nov 29, 2011, 4:30:02 PM11/29/11
to
On Mon, 28 Nov 2011 14:50:47 -0800 Yu, Fenghua (YF) wrote:

YF> I sent out a patch to remove the mcelog info. Could you try it and see if it works for you?
YF> https://lkml.org/lkml/2011/11/14/239
YF>
YF> Thanks.
YF>
YF> -Fenghua

Hi Fenghua,

Thanks for the patch. It works and eliminates the MCE warnings. What exactly
are the BIOS issues mentioned in the patch description? Is BIOS programming
some MSRs the wrong way?

Cheers,

- Udo
signature.asc

Fenghua Yu

unread,
Nov 29, 2011, 4:50:02 PM11/29/11
to
Hi, Udo,

Could you please check counters in /sys/devices/system/cpu/cpu#/thermal_throttle
and see which counters report the thermal events?

The thought of the patch is to remove the errors in mcelog and report the errors
in respective counters. Therefore, the events are not reported as scary hardware
issues but are still captured in counters.

I think BIOS/firmware sets up power limit or thermal throttle incorrectly and
triggers events incorrectly. You may try updated BIOS to see if the events go
away.

Udo Steinberg

unread,
Nov 29, 2011, 5:20:01 PM11/29/11
to
Hi Fenghua,

On Tue, 29 Nov 2011 13:43:43 -0800 Fenghua Yu (FY) wrote:

FY> Could you please check counters in /sys/devices/system/cpu/cpu#/thermal_throttle
FY> and see which counters report the thermal events?

cpu0/thermal_throttle/core_power_limit_count : 0
cpu0/thermal_throttle/core_throttle_count : 102536
cpu0/thermal_throttle/package_power_limit_count : 384
cpu0/thermal_throttle/package_throttle_count : 183429
cpu1/thermal_throttle/core_power_limit_count : 0
cpu1/thermal_throttle/core_throttle_count : 102536
cpu1/thermal_throttle/package_power_limit_count : 384
cpu1/thermal_throttle/package_throttle_count : 183429
cpu2/thermal_throttle/core_power_limit_count : 0
cpu2/thermal_throttle/core_throttle_count : 104859
cpu2/thermal_throttle/package_power_limit_count : 384
cpu2/thermal_throttle/package_throttle_count : 183429
cpu3/thermal_throttle/core_power_limit_count : 0
cpu3/thermal_throttle/core_throttle_count : 104859
cpu3/thermal_throttle/package_power_limit_count : 384
cpu3/thermal_throttle/package_throttle_count : 183429

FY> The thought of the patch is to remove the errors in mcelog and report the errors
FY> in respective counters. Therefore, the events are not reported as scary hardware
FY> issues but are still captured in counters.

I'm still seeing the following messages:

CPU2: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU3: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU1: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU0: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU0: Package temperature/speed normal
CPU2: Package temperature/speed normal
CPU1: Package temperature/speed normal
CPU3: Package temperature/speed normal
CPU3: Core temperature above threshold, cpu clock throttled (total events = 81740)
CPU2: Core temperature above threshold, cpu clock throttled (total events = 81740)
CPU2: Core temperature/speed normal
CPU3: Core temperature/speed normal
[Hardware Error]: Machine check events logged

FY> I think BIOS/firmware sets up power limit or thermal throttle incorrectly and
FY> triggers events incorrectly. You may try updated BIOS to see if the events go
FY> away.

I'm running the latest BIOS on my Lenovo Thinkpad X220. Someone should talk
to Lenovo about getting this fixed. My machine reports:

DMI: LENOVO 4290W4H/4290W4H, BIOS 8DET54WW (1.24 ) 10/18/2011

Cheers,

- Udo
signature.asc
0 new messages