[Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.

Ashok Raj

unread,

Dec 3, 2015, 6:20:05 PM12/3/15

to

Linux has logical cpu offline capability. That can be triggered by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

In Intel Architecture, MCE's are broadcasted to all CPUs in the system.

This includes the CPUs marked offline by Linux. Unless the CPU's were removed
via an ACPI notification, in which case the cpu's are removed from the
cpu_present_map.

This patch ensures offline CPU's don't participate in MCE rendezvous, but
simply perform clearing some status bits to ensure a second MCE wont cause
automatic shutdown.

Without the patch, mce_start will increment mce_callin, but mce_start would
wait for all online_cpus. So offline cpu's should avoid participating in the
rendezvous process.

Reviewed-by: Tony Luck <tony...@intel.com>
Signed-off-by: Ashok Raj <asho...@intel.com>
---
arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..82a0c8b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -998,6 +998,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
int lmce = 0;
+ unsigned int cpu = smp_processor_id();

ist_enter(regs);

@@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)

mce_gather_info(&m, regs);

+ /*
+ * if this cpu is offline, just bail out.
+ * TBD: looking into adding any logs this offline CPU might have,
+ * to be collected and reported by the rendezvous master.
+ */
+ if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
+ goto out;
+
final = this_cpu_ptr(&mces_seen);
*final = m;

@@ -1142,8 +1151,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)

if (worst > 0)
mce_report_event(regs);
- mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
sync_core();

if (recover_paddr == ~0ull)
--
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Greg KH

unread,

Dec 3, 2015, 6:40:08 PM12/3/15

to

On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
>
> # echo 0 > /sys/devices/system/cpu/cpuX/online
>
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
>
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
>
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
>
> Without the patch, mce_start will increment mce_callin, but mce_start would
> wait for all online_cpus. So offline cpu's should avoid participating in the
> rendezvous process.
>
> Reviewed-by: Tony Luck <tony...@intel.com>
> Signed-off-by: Ashok Raj <asho...@intel.com>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read Documentation/stable_kernel_rules.txt
for how to do this properly.

</formletter>

Borislav Petkov

unread,

Dec 4, 2015, 9:40:06 AM12/4/15

to

On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:

This CPU - it being offline and all - is not doing the minimal amount of
work possible IMO.

Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
etc?

IMO the only things it should do is this:

if (cpu_is_offline(smp_processor_id())) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}

and that should be at the very beginning of do_machine_check(). So
that the hardware is happy. Concerning Linux, it is offline so no data
structures on it are valid.

Hmmm?

P.S., please don't put stable@ to CC - add it as a "CC: " line in the
SOB section instead.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Raj, Ashok

unread,

Dec 4, 2015, 11:20:10 AM12/4/15

to

Hi Boris

On Fri, Dec 04, 2015 at 03:34:04PM +0100, Borislav Petkov wrote:
> > @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)

> > + if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
> > + goto out;
>
> This CPU - it being offline and all - is not doing the minimal amount of
> work possible IMO.
>
> Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
> etc?

Yes, thats possible to not do ist_enter() and the exception count.

I tried to keep most of the part as is and leveraging code already
doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
and if clear we should initiate shutdown.

When we add the logging from offline cpus as next step it would be safe to
use interrupt stack, and the offline

I liked the observability part keeping the exception count. if and when we
online the cpu again, it might look as it noticed nothing. Now we can
check /proc/interrupts and see the offline cpu also observed the MCE.

>
> IMO the only things it should do is this:
>
> if (cpu_is_offline(smp_processor_id())) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> return;
> }
>
> and that should be at the very beginning of do_machine_check(). So
> that the hardware is happy. Concerning Linux, it is offline so no data
> structures on it are valid.

>

> P.S., please don't put stable@ to CC - add it as a "CC: " line in the
> SOB section instead.

Let me know what you think, i can resend with the Cc: stable line.. I
Did add the stable line in the right section in an earlier version, but
deleting some extraneous commit messages accidently got to this one :(.

Cheers,
Ashok

Borislav Petkov

unread,

Dec 4, 2015, 12:00:06 PM12/4/15

to

On Fri, Dec 04, 2015 at 12:14:20PM -0500, Raj, Ashok wrote:
> Yes, thats possible to not do ist_enter() and the exception count.
>
> I tried to keep most of the part as is and leveraging code already
> doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
> and if clear we should initiate shutdown.

So add that check too.

> When we add the logging from offline cpus as next step it would be safe to
> use interrupt stack, and the offline

Franky, I'm not sure at all and very very wary of adding *any* code
which runs on an offlined CPU. Because *no one* does that and it hasn't
been tested at all. So who knows what happens.

What we should be doing is execute the *minimal* amount of code possible
and get out. No counting, no per-cpu variables. No nothing.

> I liked the observability part keeping the exception count. if and
> when we online the cpu again, it might look as it noticed nothing. Now
> we can check /proc/interrupts and see the offline cpu also observed
> the MCE.

And? Tell us what? That SMM fondled the hardware under our feet. TBH,
I'd tend to be much more drastic here and even taint the kernel. I mean,
seriously, what kind of MCEs which happen as a result of OS execution
are you expecting to get reported on an offlined CPU?

I can't think of very any.

Because we have been considering offlining a core as one possible RAS
action. So what happens is a user or a RAS agent offlines a core and
yet, that offlined core still reports MCEs. Something's terribly wrong
with that picture, IMO.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Luck, Tony

unread,

Dec 4, 2015, 12:30:08 PM12/4/15

to

> Franky, I'm not sure at all and very very wary of adding *any* code
> which runs on an offlined CPU. Because *no one* does that and it hasn't
> been tested at all. So who knows what happens.
>
> What we should be doing is execute the *minimal* amount of code possible
> and get out. No counting, no per-cpu variables. No nothing.

The minimal code requires we use:

smp_processor_id() [to get our cpu number]
cpu_is_offline() [to find out the cpu is offline]

The first of those looks more dangerous in that it accesses a per-cpu variable.

I don't think we need to be totally paranoid here. We know that the offline cpus
were once online and went through normal kernel initialization code (if they didn't,
then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
response to a machine check would have been to reset the system).

> Because we have been considering offlining a core as one possible RAS
> action. So what happens is a user or a RAS agent offlines a core and
> yet, that offlined core still reports MCEs. Something's terribly wrong
> with that picture, IMO.

Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
including telling the rest of the system not to send it any more broadcast events
like MCE, SMI. But the h/w guys like to give the s/w guys job security by making
these corner cases that we have to work around in s/w :-)

-Tony

Borislav Petkov

unread,

Dec 4, 2015, 12:40:11 PM12/4/15

to

On Fri, Dec 04, 2015 at 05:23:18PM +0000, Luck, Tony wrote:
> > Franky, I'm not sure at all and very very wary of adding *any* code
> > which runs on an offlined CPU. Because *no one* does that and it hasn't
> > been tested at all. So who knows what happens.
> >
> > What we should be doing is execute the *minimal* amount of code possible
> > and get out. No counting, no per-cpu variables. No nothing.
>
> The minimal code requires we use:
>
> smp_processor_id() [to get our cpu number]
> cpu_is_offline() [to find out the cpu is offline]
>
> The first of those looks more dangerous in that it accesses a per-cpu variable.
>
> I don't think we need to be totally paranoid here. We know that the offline cpus
> were once online and went through normal kernel initialization code (if they didn't,
> then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
> response to a machine check would have been to reset the system).

I don't mean that - I mean the stuff we do before we call
cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
etc. Then we do a whole another bunch of stuff at the "out:" label like
printk and whatnot which shouldn't run on an offlined CPU.

I.e., the check whether a CPU is offline should be the first thing we do
in do_machine_check and get the hell out if so.

> Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
> including telling the rest of the system not to send it any more broadcast events
> like MCE, SMI. But the h/w guys like to give the s/w guys job security by making
> these corner cases that we have to work around in s/w :-)

Mind you, this is unintentional from the hw guys. But ha(!), I know
*exactly* what you mean.

:-)

Luck, Tony

unread,

Dec 4, 2015, 1:00:06 PM12/4/15

to

> I don't mean that - I mean the stuff we do before we call
> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> etc. Then we do a whole another bunch of stuff at the "out:" label like
> printk and whatnot which shouldn't run on an offlined CPU.

ist_enter() is black magic to me. Andy? Would you be worried about executing
ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
by Linux?

Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
/proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
But we could do the offline check before this.

There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".

-Tony

Borislav Petkov

unread,

Dec 4, 2015, 1:10:07 PM12/4/15

to

On Fri, Dec 04, 2015 at 05:53:33PM +0000, Luck, Tony wrote:
> > I don't mean that - I mean the stuff we do before we call
> > cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> > etc. Then we do a whole another bunch of stuff at the "out:" label like
> > printk and whatnot which shouldn't run on an offlined CPU.
>
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?

ist_enter() is context tracking functionality.

> Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
> /proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
> But we could do the offline check before this.
>
> There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
> at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".

Whether it is kosher or not is beside the point. Why should an offlined
CPU even noodle through all that code if it doesn't need/have to? It can
return immediately instead.

Luck, Tony

unread,

Dec 4, 2015, 1:40:06 PM12/4/15

to

> Whether it is kosher or not is beside the point. Why should an offlined
> CPU even noodle through all that code if it doesn't need/have to? It can
> return immediately instead.

Ashok wants to move in stage 2 to having the offline cpu scan banks and report
any errors seen there. To do that we'll have to run through a fair bit of the
do_machine_check() code.

But ... if you want a super safe version to put the stable tag on ... we could
just have something like this at the head of do_machine_check()

int cpu = smp_processor_id();

if (cpu_is_offline(cpu)) {
rdmsr(MCG_STATUS);
if (RIPV bit set) {
wrmsr(MCG_STATUS, 0);
return;
}
// can we do anything here? Offline cpu has no place to return to.
// There are no good answers ... falling into the regular code is
// what we did historically
}

-Tony

Borislav Petkov

unread,

Dec 4, 2015, 2:40:06 PM12/4/15

to

On Fri, Dec 04, 2015 at 06:30:39PM +0000, Luck, Tony wrote:
> Ashok wants to move in stage 2 to having the offline cpu scan banks and report
> any errors seen there. To do that we'll have to run through a fair bit of the
> do_machine_check() code.

I'm still very sceptical whether logging those errors are worth the risk
of running code on an offlined CPU.

> But ... if you want a super safe version to put the stable tag on ... we could
> just have something like this at the head of do_machine_check()

It would be prudent IMO. We don't want to disrupt stable kernels
unnecessarily.

Andy Lutomirski

unread,

Dec 4, 2015, 5:40:06 PM12/4/15

to

On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony...@intel.com> wrote:
>> I don't mean that - I mean the stuff we do before we call
>> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
>> etc. Then we do a whole another bunch of stuff at the "out:" label like
>> printk and whatnot which shouldn't run on an offlined CPU.
>
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?

Offline CPUs are black magic to me. But as long as the CPU works the
way that the normal specs say it should, then ist_enter is fair game.
In any event, if context tracking blows up on an offline CPU, I'd
argue that's a context tracking bug and needs to be fixed.

But maybe offlined CPUs are supposed to have all interrupts off
(including MCE?) and the argument goes the other way? Dunno.

--Andy

Ashok Raj

unread,

Dec 4, 2015, 6:00:06 PM12/4/15

to

Linux has logical cpu offline capability. That can be triggered by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

In Intel Architecture, MCE's are broadcasted to all CPUs in the system.

This includes the CPUs marked offline by Linux. Unless the CPU's were removed
via an ACPI notification, in which case the cpu's are removed from the
cpu_present_map.

This patch ensures offline CPU's don't participate in MCE rendezvous, but
simply perform clearing some status bits to ensure a second MCE wont cause
automatic shutdown.

Without the patch, mce_start will increment mce_callin, but mce_start would

only wait for all online_cpus. So offline cpu's should avoid participating

in the rendezvous process.

Reviewed-by: Tony Luck <tony...@intel.com>

Cc: sta...@vger.kernel.org

Signed-off-by: Ashok Raj <asho...@intel.com>
---

arch/x86/kernel/cpu/mcheck/mce.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..72ea044 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -998,7 +998,20 @@ void do_machine_check(struct pt_regs *regs, long error_code)

u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
int lmce = 0;
+ unsigned int cpu = smp_processor_id();

+ /*
+ * if this cpu is offline, just bail out.

+ */
+ if (cpu_is_offline(cpu)) {
+ u64 mcgstatus;
+
+ mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+ if (mcgstatus & MCG_STATUS_RIPV) {
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ return;
+ }
+ }
ist_enter(regs);

this_cpu_inc(mce_exception_count);
@@ -1142,8 +1155,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)

if (worst > 0)
mce_report_event(regs);
- mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
sync_core();

if (recover_paddr == ~0ull)
--
2.4.3

Borislav Petkov

unread,

Dec 4, 2015, 6:10:07 PM12/4/15

to

With that hunk here you want to clear MSR_IA32_MCG_STATUS in the
!cfg->banks case, right?

If so, this should be a separate patch. Also, CC: stable with an
explanation why.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

Raj, Ashok

unread,

Dec 4, 2015, 6:10:07 PM12/4/15

to

On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony...@intel.com> wrote:
> > ist_enter() is black magic to me. Andy? Would you be worried about executing
> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> > by Linux?
>
> Offline CPUs are black magic to me. But as long as the CPU works the
> way that the normal specs say it should, then ist_enter is fair game.
> In any event, if context tracking blows up on an offline CPU, I'd
> argue that's a context tracking bug and needs to be fixed.
>
> But maybe offlined CPUs are supposed to have all interrupts off
> (including MCE?) and the argument goes the other way? Dunno.

MCE's are broadcast by the hardware and cannot be blocked. Offline
is only a Linux specific state. Now if the offline was a result of an ACPI
event (eject) that triggered the CPU removal (offline in Linux, as it would
have in a platform that supports true hotplug) then the platform would
remove this cpu from the broadcast list.

if kernel were to set CR4.MCE=0 that would cause system shutdown when
an MCE is broadcast and hits this cpu.

Cheers,
Ashok

Borislav Petkov

unread,

Dec 4, 2015, 6:20:06 PM12/4/15

to

On Fri, Dec 04, 2015 at 11:11:12PM +0000, Luck, Tony wrote:
> > With that hunk here you want to clear MSR_IA32_MCG_STATUS in the
> > !cfg->banks case, right?
>

> I can't imagine how we'd get into do_machine_check without any banks.
>
> Would indeed be a separate patch ... but value seems limited.

Then we should kill that hunk. I'm all for not covering improbable
cases.

Luck, Tony

unread,

Dec 4, 2015, 6:20:06 PM12/4/15

to

> With that hunk here you want to clear MSR_IA32_MCG_STATUS in the
> !cfg->banks case, right?

I can't imagine how we'd get into do_machine_check without any banks.

Would indeed be a separate patch ... but value seems limited.

-Tony

Andy Lutomirski

unread,

Dec 4, 2015, 6:20:08 PM12/4/15

to

On Fri, Dec 4, 2015 at 4:08 PM, Raj, Ashok <asho...@intel.com> wrote:
> On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <tony...@intel.com> wrote:
>> > ist_enter() is black magic to me. Andy? Would you be worried about executing
>> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
>> > by Linux?
>>
>> Offline CPUs are black magic to me. But as long as the CPU works the
>> way that the normal specs say it should, then ist_enter is fair game.
>> In any event, if context tracking blows up on an offline CPU, I'd
>> argue that's a context tracking bug and needs to be fixed.
>>
>> But maybe offlined CPUs are supposed to have all interrupts off
>> (including MCE?) and the argument goes the other way? Dunno.
>
> MCE's are broadcast by the hardware and cannot be blocked. Offline
> is only a Linux specific state. Now if the offline was a result of an ACPI
> event (eject) that triggered the CPU removal (offline in Linux, as it would
> have in a platform that supports true hotplug) then the platform would
> remove this cpu from the broadcast list.
>
> if kernel were to set CR4.MCE=0 that would cause system shutdown when
> an MCE is broadcast and hits this cpu.

I meant "supposed" as in Linux might expect arch code to prevent the
CPU from receiving interrupts.

Anyway, I think that would be silly and we should just expect
ist_enter to work regardless of online state.

This does mean that if we plug in a new CPU and online it, then
there's a window before we set up percpu memory and enable CR4.MCE in
which an MCE on any CPU will kill the system, at least on hardware for
which MCE broadcast can't be turned off.

--Andy

Ashok Raj

unread,

Dec 4, 2015, 6:30:06 PM12/4/15

to

Linux has logical cpu offline capability. That can be triggered by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

In Intel Architecture, MCE's are broadcasted to all CPUs in the system.

This includes the CPUs marked offline by Linux. Unless the CPU's were removed
via an ACPI notification, in which case the cpu's are removed from the
cpu_present_map.

This patch ensures offline CPU's don't participate in MCE rendezvous, but
simply perform clearing some status bits to ensure a second MCE wont cause
automatic shutdown.

Without the patch, mce_start will increment mce_callin, but mce_start would
only wait for all online_cpus. So offline cpu's should avoid participating
in the rendezvous process.

Reviewed-by: Tony Luck <tony...@intel.com>
Cc: sta...@vger.kernel.org
Signed-off-by: Ashok Raj <asho...@intel.com>
---

arch/x86/kernel/cpu/mcheck/mce.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..23ecb1d 100644

--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -998,7 +998,20 @@ void do_machine_check(struct pt_regs *regs, long error_code)
u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
int lmce = 0;
+ unsigned int cpu = smp_processor_id();
+
+ /*
+ * if this cpu is offline, just bail out.
+ */
+ if (cpu_is_offline(cpu)) {
+ u64 mcgstatus;

+ mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+ if (mcgstatus & MCG_STATUS_RIPV) {
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ return;
+ }
+ }
ist_enter(regs);

this_cpu_inc(mce_exception_count);

--
2.4.3

Borislav Petkov

unread,

Dec 7, 2015, 3:10:07 PM12/7/15

to

On Fri, Dec 04, 2015 at 07:29:36PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
>
> # echo 0 > /sys/devices/system/cpu/cpuX/online
>
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
>
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
>
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
>
> Without the patch, mce_start will increment mce_callin, but mce_start would
> only wait for all online_cpus. So offline cpu's should avoid participating
> in the rendezvous process.
>
> Reviewed-by: Tony Luck <tony...@intel.com>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Ashok Raj <asho...@intel.com>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)

Tested on a box here, massaged commit message and queued for 4.4,
thanks.

---
From: Ashok Raj <asho...@intel.com>
Date: Thu, 3 Dec 2015 19:16:10 -0500
Subject: [PATCH] x86/mce: Ensure offline CPUs don't participate in rendezvous
process

Intel's MCA implementation broadcasts MCEs to all CPUs on the node.
This poses a problem for offlined CPUs which cannot participate in the
rendezvous process:

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Kernel Offset: disabled
Rebooting in 100 seconds..

More specifically, Linux does a soft offline of a CPU when writing a 0
to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC
exception from being broadcasted to that CPU.

Ensure that offline CPUs don't participate in the MCE rendezvous and
clear the RIP valid status bit so that a second MCE won't cause a
shutdown.

Without the patch, mce_start() will increment mce_callin and wait for
all CPUs. Offlined CPUs should avoid participating in the rendezvous
process altogether.

Reviewed-by: Tony Luck <tony...@intel.com>
Signed-off-by: Ashok Raj <asho...@intel.com>

Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: linux-edac <linux...@vger.kernel.org>
Cc: sta...@vger.kernel.org
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: x86-ml <x...@kernel.org>
Link: http://lkml.kernel.org/r/1449188170-3909-1-git...@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <b...@suse.de>
---
arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3865e95cc5ec..a006f4cd792b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1002,6 +1002,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)

int flags = MF_ACTION_REQUIRED;
int lmce = 0;

+ /* If this CPU is offline, just bail out. */
+ if (cpu_is_offline(smp_processor_id())) {
+ u64 mcgstatus;
+

+ mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
+ if (mcgstatus & MCG_STATUS_RIPV) {
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ return;
+ }
+ }

+
ist_enter(regs);

this_cpu_inc(mce_exception_count);
--
2.3.5

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Luck, Tony

unread,

Dec 7, 2015, 3:10:07 PM12/7/15

to

> Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

Is that what we printed in this case? ... boy is that a misleading message ... we got *extra*
cpus (the offline ones), not "Not all".

Good job we have a fix :-)

-Tony

Borislav Petkov

unread,

Dec 7, 2015, 3:30:07 PM12/7/15

to

Well, we still have that printk string in there.

And that is incorrect too, because the MCE (at least the one I'm
injecting) gets broadcasted to the CPUs on the *node* and not to the
whole system.

If we had to be precise, text should say "Not all CPUs which the MCE was
broadcasted to entered the exception handler..."

Luck, Tony

unread,

Dec 7, 2015, 5:10:07 PM12/7/15

to

> And that is incorrect too, because the MCE (at least the one I'm
> injecting) gets broadcasted to the CPUs on the *node* and not to the
> whole system.

Which system? What kind of machine check? On Intel we expect machine checks
to be broadcast to all logical cpus on all nodes (unless local machine check is enabled,
in which case SRAR style machine checks go only to the logical cpu that hit the error).

The code is written to that expectation ... and we don't report things as well if
something else happens (like too many or too few cpus showing up).

-Tony

Borislav Petkov

unread,

Dec 7, 2015, 5:40:12 PM12/7/15

to

Box logs below.

BIOS is doing funny cores enumeration:

node #0, CPUs 0-7
node #1, CPUs 8-15
node #2, CPUs 16-23
node #3, CPUs 24-31

and then starts from node 0 again:

.... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
.... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
.... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
.... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63

So I went and offlined cores 5 and 34 which are on node 0.

Why node 0? Well, when I inject error type 0x10 which is

0x00000010 Memory Uncorrectable non-fatal

it generates an MCE only on the node 0 cores. For that log see the end
of this mail. The gist of it is that the CPUs on which #MC gets raised
are the cores on node 0, i.e., 0-7 and 32-39.

Cores 5 and 34 are gone, of course.

I mean, even if the #MC gets raised only on the node, the fix still
works.

$ grep -Ei "hardware.*CPU" /tmp/mce | sed 's/^.*CPU//' | sort -n
0: Machine Check Exception: 5 Bank 5: be00000000010090
1: Machine Check Exception: 5 Bank 5: be00000000010090
2: Machine Check Exception: 5 Bank 5: be00000000010090
3: Machine Check Exception: 5 Bank 5: be00000000010090
4: Machine Check Exception: 5 Bank 5: be00000000010090
6: Machine Check Exception: 5 Bank 5: be00000000010090
7: Machine Check Exception: 5 Bank 5: be00000000010090
32: Machine Check Exception: 5 Bank 5: be00000000010090
33: Machine Check Exception: 5 Bank 5: be00000000010090
35: Machine Check Exception: 5 Bank 5: be00000000010090
36: Machine Check Exception: 5 Bank 5: be00000000010090
37: Machine Check Exception: 5 Bank 5: be00000000010090
38: Machine Check Exception: 5 Bank 5: be00000000010090
39: Machine Check Exception: 5 Bank 5: be00000000010090

[ 0.859060] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz (family: 0x6, model: 0x2d, stepping: 0x7
...
[ 0.981593] x86: Booting SMP configuration:
[ 0.991092] .... node #0, CPUs: #1
[ 1.013485] microcode: CPU1 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.034219] #2
[ 1.049577] microcode: CPU2 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.070309] #3
[ 1.085865] microcode: CPU3 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.106618] #4
[ 1.121978] microcode: CPU4 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.142720] #5
[ 1.158079] microcode: CPU5 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.178833] #6
[ 1.194191] microcode: CPU6 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.214914] #7
[ 1.230471] microcode: CPU7 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.251309]
[ 1.254854] .... node #1, CPUs: #8
[ 1.275173] microcode: CPU8 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.390509] #9
[ 1.406859] microcode: CPU9 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.427735] #10
[ 1.444303] microcode: CPU10 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.465343] #11
[ 1.481718] microcode: CPU11 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.502779] #12
[ 1.519156] microcode: CPU12 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.540171] #13
[ 1.556536] microcode: CPU13 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.577587] #14
[ 1.594127] microcode: CPU14 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.615131] #15
[ 1.631471] microcode: CPU15 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.652590]
[ 1.656132] .... node #2, CPUs: #16
[ 1.676518] microcode: CPU16 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.791812] #17
[ 1.808189] microcode: CPU17 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.829292] #18
[ 1.845868] microcode: CPU18 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.866925] #19
[ 1.883311] microcode: CPU19 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.904386] #20
[ 1.920765] microcode: CPU20 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.941810] #21
[ 1.958169] microcode: CPU21 microcode updated early to revision 0x710, date = 2013-06-17
[ 1.979242] #22
[ 1.995787] microcode: CPU22 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.016842] #23
[ 2.033182] microcode: CPU23 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.054314]
[ 2.057854] .... node #3, CPUs: #24
[ 2.078330] microcode: CPU24 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.193513] #25
[ 2.209874] microcode: CPU25 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.230996] #26
[ 2.247563] microcode: CPU26 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.268627] #27
[ 2.284998] microcode: CPU27 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.306061] #28
[ 2.322437] microcode: CPU28 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.343433] #29
[ 2.359780] microcode: CPU29 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.380855] #30
[ 2.397397] microcode: CPU30 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.418432] #31
[ 2.434759] microcode: CPU31 microcode updated early to revision 0x710, date = 2013-06-17
[ 2.455792]
[ 2.459336] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
[ 2.583817] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
[ 2.710873] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
[ 2.838069] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63
[ 2.964288] x86: Booted up 4 nodes, 64 CPUs
[ 2.974471] smpboot: Total of 64 processors activated (344907.86 BogoMIPS)

[ 5290.635126] Broke affinity for irq 82
[ 5290.643222] Broke affinity for irq 111
[ 5290.651507] Broke affinity for irq 125
[ 5290.664107] smpboot: CPU 5 is now offline
[ 5298.371336] Broke affinity for irq 31
[ 5298.379528] Broke affinity for irq 82
[ 5298.387627] Broke affinity for irq 103
[ 5298.395908] Broke affinity for irq 110
[ 5298.404187] Broke affinity for irq 111
[ 5298.412450] Broke affinity for irq 112
[ 5298.420733] Broke affinity for irq 118
[ 5298.429017] Broke affinity for irq 124
[ 5298.437295] Broke affinity for irq 125
[ 5298.445584] Broke affinity for irq 127
[ 5298.453880] Broke affinity for irq 137
[ 5298.466543] smpboot: CPU 34 is now offline
[ 5302.187338] EINJ: Error INJection is initialized.
[ 5318.897170] Disabling lock debugging due to kernel taint
[ 5318.910775] mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5318.931171] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5318.951567] mce: [Hardware Error]: TSC bab9f2d8a4e00 ADDR bb68ec00 MISC 20403ebe86
[ 5318.969835] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC b microcode 710
[ 5318.990959] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5319.003825] EDAC sbridge MC0: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.023215] EDAC sbridge MC0: TSC bab9f2d8a4e00
[ 5319.033036] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5319.050338] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC b
[ 5319.069542] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset
:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5319.122943] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.143355] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5319.163846] mce: [Hardware Error]: TSC bab9f2d8a51c1 ADDR bb68ec00 MISC 20403ebe86
[ 5319.182249] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 6 microcode 710
[ 5319.203539] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5319.216586] EDAC sbridge MC0: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.235994] EDAC sbridge MC0: TSC bab9f2d8a51c1
[ 5319.245814] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5319.263348] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 6
[ 5319.283041] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset
:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5319.337311] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.357960] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8159a4d0> {mutex_lock+0x10/0x27}
[ 5319.378519] mce: [Hardware Error]: TSC bab9f2d8a3feb ADDR bb68ec00 MISC 20403ebe86
[ 5319.397151] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 4 microcode 710
[ 5319.418650] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5319.431902] EDAC sbridge MC0: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.451491] EDAC sbridge MC0: TSC bab9f2d8a3feb
[ 5319.461311] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5319.479022] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 4
[ 5319.499014] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset
:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5319.553209] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.574029] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5319.594953] mce: [Hardware Error]: TSC bab9f2d8a87ea ADDR bb68ec00 MISC 20403ebe86
[ 5319.613756] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC c microcode 710
[ 5319.635431] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5319.648873] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.668661] EDAC sbridge MC0: TSC bab9f2d8a87ea
[ 5319.678483] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5319.696422] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC c
[ 5319.716789] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset
:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5319.771531] mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.792743] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5319.813836] mce: [Hardware Error]: TSC bab9f2d8a87ce ADDR bb68ec00 MISC 20403ebe86
[ 5319.832819] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC d microcode 710
[ 5319.854654] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5319.868243] EDAC sbridge MC0: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5319.888366] EDAC sbridge MC0: TSC bab9f2d8a87ce
[ 5319.898186] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5319.916192] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC d
[ 5319.936752] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5319.991752] mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.013034] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5320.034166] mce: [Hardware Error]: TSC bab9f2d8a59dd ADDR bb68ec00 MISC 20403ebe86
[ 5320.053149] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 7 microcode 710
[ 5320.074972] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5320.088567] EDAC sbridge MC0: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.108688] EDAC sbridge MC0: TSC bab9f2d8a59dd
[ 5320.118511] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5320.136527] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 7
[ 5320.157079] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5320.212025] mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.233316] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5320.254462] mce: [Hardware Error]: TSC bab9f2d8a4f5c ADDR bb68ec00 MISC 20403ebe86
[ 5320.273455] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC f microcode 710
[ 5320.295303] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5320.308905] EDAC sbridge MC0: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.329026] EDAC sbridge MC0: TSC bab9f2d8a4f5c
[ 5320.338847] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5320.356858] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC f
[ 5320.377433] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5320.432474] mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.453569] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5320.474703] mce: [Hardware Error]: TSC bab9f2d8a4d60 ADDR bb68ec00 MISC 20403ebe86
[ 5320.493689] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC e microcode 710
[ 5320.515532] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5320.529139] EDAC sbridge MC0: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.549050] EDAC sbridge MC0: TSC bab9f2d8a4d60
[ 5320.558870] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5320.576890] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC e
[ 5320.597478] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5320.652525] mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.673804] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5320.694918] mce: [Hardware Error]: TSC bab9f2d8a5823 ADDR bb68ec00 MISC 20403ebe86
[ 5320.713916] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 9 microcode 710
[ 5320.735759] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5320.749347] EDAC sbridge MC0: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.769452] EDAC sbridge MC0: TSC bab9f2d8a5823
[ 5320.779273] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5320.797296] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 9
[ 5320.817877] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5320.872972] mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.894249] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5320.915390] mce: [Hardware Error]: TSC bab9f2d8a5326 ADDR bb68ec00 MISC 20403ebe86
[ 5320.934374] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 3 microcode 710
[ 5320.956222] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5320.969807] EDAC sbridge MC0: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5320.989913] EDAC sbridge MC0: TSC bab9f2d8a5326
[ 5320.999734] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5321.017750] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 3
[ 5321.038284] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5321.093686] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.114770] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5321.135925] mce: [Hardware Error]: TSC bab9f2d8a5562 ADDR bb68ec00 MISC 20403ebe86
[ 5321.154918] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 2 microcode 710
[ 5321.176765] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5321.190369] EDAC sbridge MC0: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.210303] EDAC sbridge MC0: TSC bab9f2d8a5562
[ 5321.220123] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5321.238146] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 2
[ 5321.258723] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5321.303358] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.324279] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5321.345397] mce: [Hardware Error]: TSC bab9f2d8a572f ADDR bb68ec00 MISC 20403ebe86
[ 5321.364380] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 8 microcode 710
[ 5321.386184] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5321.399729] EDAC sbridge MC0: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.419624] EDAC sbridge MC0: TSC bab9f2d8a572f
[ 5321.429445] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5321.447454] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 8
[ 5321.467989] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5321.511475] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.532587] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5321.553689] mce: [Hardware Error]: TSC bab9f2d8a50f4 ADDR bb68ec00 MISC 20403ebe86
[ 5321.572681] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 1 microcode 710
[ 5321.594500] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5321.608057] EDAC sbridge MC0: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.628161] EDAC sbridge MC0: TSC bab9f2d8a50f4
[ 5321.637982] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5321.655998] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 1
[ 5321.676524] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5321.720020] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.740939] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 5321.762058] mce: [Hardware Error]: TSC bab9f2d8a5034 ADDR bb68ec00 MISC 20403ebe86
[ 5321.781022] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 0 microcode 710
[ 5321.802837] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 5321.816395] EDAC sbridge MC0: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090
[ 5321.836300] EDAC sbridge MC0: TSC bab9f2d8a5034
[ 5321.846121] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 5321.864127] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1449517966 SOCKET 0 APIC 0
[ 5321.884647] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
[ 5321.928136] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 5321.945589] Kernel panic - not syncing: Fatal machine check
[ 5321.985122] Kernel Offset: disabled
[ 5322.008492] Rebooting in 100 seconds..
[ 5421.226077] ACPI MEMORY or I/O RESET_REG.

Raj, Ashok

unread,

Dec 7, 2015, 5:50:11 PM12/7/15

to

On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote:
>
> Box logs below.

Do you have the dmidecode strings to find which platform this is?

>
> BIOS is doing funny cores enumeration:
>
> node #0, CPUs 0-7
> node #1, CPUs 8-15
> node #2, CPUs 16-23
> node #3, CPUs 24-31
>
> and then starts from node 0 again:
>
> .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
> .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
> .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
> .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63
>
> So I went and offlined cores 5 and 34 which are on node 0.
>
> Why node 0? Well, when I inject error type 0x10 which is
>
> 0x00000010 Memory Uncorrectable non-fatal
>
> it generates an MCE only on the node 0 cores. For that log see the end
> of this mail. The gist of it is that the CPUs on which #MC gets raised
> are the cores on node 0, i.e., 0-7 and 32-39.
>
> Cores 5 and 34 are gone, of course.
>
> I mean, even if the #MC gets raised only on the node, the fix still
> works.

Not sure how the fix works.. since we excluded only the ones offline. So
unless all online cpu's check in, the code should give you the old behavior.

What does cat /proc/interrupts | grep MCE

In a system broadcasting, all cpu counts should be the same. Since we didn't
increment the offline stats, if you were to bring the cpu up, it should be one
less than other cpus...

Cheers,
Ashok

Luck, Tony

unread,

Dec 7, 2015, 6:30:06 PM12/7/15

to

On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote:

> BIOS is doing funny cores enumeration:
>
> node #0, CPUs 0-7
> node #1, CPUs 8-15
> node #2, CPUs 16-23
> node #3, CPUs 24-31
>
> and then starts from node 0 again:
>
> .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
> .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
> .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
> .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63

That's normal. BIOS writers are encouraged to list all the
hyperthread 0 cpus from each core, and then add the hyperthread
1 cpus later in the table. That way an OS that boots less than
all the cpus will get the maximum number of real cores into
play.

> 0x00000010 Memory Uncorrectable non-fatal
>
> it generates an MCE only on the node 0 cores. For that log see the end
> of this mail. The gist of it is that the CPUs on which #MC gets raised
> are the cores on node 0, i.e., 0-7 and 32-39.

I think all the threads on all the sockets must have shown up
in the machine check handler ... but only the ones on socket0
printed anything (they can all see the error in bank5 which is
shared across the socket ... but cpus 8-15 etc. will see no errors
in any banks ... so will be silent.)

-Tony

Borislav Petkov

unread,

Dec 7, 2015, 6:30:07 PM12/7/15

to

On Mon, Dec 07, 2015 at 06:46:40PM -0500, Raj, Ashok wrote:
> On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote:
> >
> > Box logs below.
>
> Do you have the dmidecode strings to find which platform this is?

Is this enough or you want complete dmidecode dump?

DMI: Intel Corporation LH Pass/S4600LH...., BIOS SE5C600.86B.99.99.2050.043020121425 04/30/2012

> Not sure how the fix works.. since we excluded only the ones offline.
> So unless all online cpu's check in, the code should give you the old
> behavior.

Did you miss my statement in my previous mail where I said that the MCE
is being raised only on the cores of node 0?

> What does cat /proc/interrupts | grep MCE

Can't. Shell on the box is dead after the injection.

> In a system broadcasting, all cpu counts should be the same. Since we didn't
> increment the offline stats, if you were to bring the cpu up, it should be one
> less than other cpus...

See the logs at the end of my previous email. #MC gets raised - or at
least output from mce_panic() comes out only - on the cores of node 0.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Raj, Ashok

unread,

Dec 7, 2015, 7:50:06 PM12/7/15

to

On Tue, Dec 08, 2015 at 12:25:24AM +0100, Borislav Petkov wrote:
>
> Did you miss my statement in my previous mail where I said that the MCE
> is being raised only on the cores of node 0?
>

That's right.. but i think if MCE is only given to node0, then the system
would panic eveytime with or without the patch. which is why i got confused.

I somehow misunderstood that with this patch the system didn't panic.

Cheers,
Ashok

Borislav Petkov

unread,

Dec 8, 2015, 4:20:07 AM12/8/15

to

On Mon, Dec 07, 2015 at 08:41:43PM -0500, Raj, Ashok wrote:
> On Tue, Dec 08, 2015 at 12:25:24AM +0100, Borislav Petkov wrote:
> >
> > Did you miss my statement in my previous mail where I said that the MCE
> > is being raised only on the cores of node 0?
> >
>
> That's right.. but i think if MCE is only given to node0, then the system
> would panic eveytime with or without the patch. which is why i got confused.
>
> I somehow misunderstood that with this patch the system didn't panic.

No, the system did panic in both times. The "strange" observation is
that the MCE gets reported only on the cores on node 0. Or at least only
the printks from mce_panic() on the cores on node0 reach the serial
console.

If we really broadcast only on node0, then that would be a problem if
the corrupted data leaves the node and manages to corrupt storage when
written out on some of the other nodes. I'm not sure if the kernel
panicking the whole system is on time and there's not a small window
between the detection and the panicking, in which the corruption might
happen.

If so, this'd defeat the purpose of MCE broadcasting but I'm just
hypothesizing here.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Luck, Tony

unread,

Dec 8, 2015, 11:10:06 AM12/8/15

to

> No, the system did panic in both times. The "strange" observation is
> that the MCE gets reported only on the cores on node 0. Or at least only
> the printks from mce_panic() on the cores on node0 reach the serial
> console.

You only see messages and logs from node0, because the cpus there are
the only ones that see any errors logged in their banks.

The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing.
There are no system-wide banks ... just core-wide (in recent generations
banks 0-3) and socket-wide (banks >=4). But don't code those numbers
into any generic code ... we will change them sooner or later.

-Tony

Borislav Petkov

unread,

Dec 8, 2015, 2:00:06 PM12/8/15

to

On Tue, Dec 08, 2015 at 03:59:58PM +0000, Luck, Tony wrote:
> > No, the system did panic in both times. The "strange" observation is
> > that the MCE gets reported only on the cores on node 0. Or at least only
> > the printks from mce_panic() on the cores on node0 reach the serial
> > console.
>
> You only see messages and logs from node0, because the cpus there are
> the only ones that see any errors logged in their banks.
>
> The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing.

Right, sure, of course. Doh!

Confirmation:

[ 183.840517] mce: do_machine_check: CPU: 30
[ 183.840531] mce: do_machine_check: CPU: 27
[ 183.840536] mce: do_machine_check: CPU: 29
[ 183.840541] mce: do_machine_check: CPU: 56
[ 183.840546] mce: do_machine_check: CPU: 28
[ 183.840548] mce: do_machine_check: CPU: 60
[ 183.840550] mce: do_machine_check: CPU: 24
[ 183.840557] mce: do_machine_check: CPU: 12
[ 183.840561] mce: do_machine_check: CPU: 45
[ 183.840565] mce: do_machine_check: CPU: 59
[ 183.840569] mce: do_machine_check: CPU: 57
[ 183.840572] mce: do_machine_check: CPU: 61
[ 183.840584] mce: do_machine_check: CPU: 0
[ 183.840587] mce: do_machine_check: CPU: 32
[ 183.840593] mce: do_machine_check: CPU: 63
[ 183.840596] mce: do_machine_check: CPU: 31
[ 183.840602] mce: do_machine_check: CPU: 42
[ 183.840606] mce: do_machine_check: CPU: 11
[ 183.840611] mce: do_machine_check: CPU: 41
[ 183.840613] mce: do_machine_check: CPU: 9
[ 183.840617] mce: do_machine_check: CPU: 62
[ 183.840619] mce: do_machine_check: CPU: 25
[ 183.840624] mce: do_machine_check: CPU: 58
[ 183.840627] mce: do_machine_check: CPU: 26
[ 183.840633] mce: do_machine_check: CPU: 5
[ 183.840638] mce: do_machine_check: CPU: 1
[ 183.840642] mce: do_machine_check: CPU: 37
[ 183.840648] mce: do_machine_check: CPU: 15
[ 183.840650] mce: do_machine_check: CPU: 47
[ 183.840653] mce: do_machine_check: CPU: 44
[ 183.840657] mce: do_machine_check: CPU: 14
[ 183.840659] mce: do_machine_check: CPU: 46
[ 183.840666] mce: do_machine_check: CPU: 52
[ 183.840670] mce: do_machine_check: CPU: 50
[ 183.840675] mce: do_machine_check: CPU: 48
[ 183.840677] mce: do_machine_check: CPU: 16
[ 183.840682] mce: do_machine_check: CPU: 54
[ 183.840686] mce: do_machine_check: CPU: 18
[ 183.840692] mce: do_machine_check: CPU: 40
[ 183.840695] mce: do_machine_check: CPU: 8
[ 183.840701] mce: do_machine_check: CPU: 2
[ 183.840705] mce: do_machine_check: CPU: 20
[ 183.840710] mce: do_machine_check: CPU: 13
[ 183.840712] mce: do_machine_check: CPU: 43
[ 183.840716] mce: do_machine_check: CPU: 10
[ 183.840722] mce: do_machine_check: CPU: 3
[ 183.840724] mce: do_machine_check: CPU: 35
[ 183.840727] mce: do_machine_check: CPU: 33
[ 183.840730] mce: do_machine_check: CPU: 34
[ 183.840734] mce: do_machine_check: CPU: 6
[ 183.840738] mce: do_machine_check: CPU: 38
[ 183.840743] mce: do_machine_check: CPU: 53
[ 183.840745] mce: do_machine_check: CPU: 21
[ 183.840750] mce: do_machine_check: CPU: 23
[ 183.840752] mce: do_machine_check: CPU: 55
[ 183.840755] mce: do_machine_check: CPU: 22
[ 183.840759] mce: do_machine_check: CPU: 49
[ 183.840761] mce: do_machine_check: CPU: 17
[ 183.840767] mce: do_machine_check: CPU: 19
[ 183.840770] mce: do_machine_check: CPU: 51
[ 183.840776] mce: do_machine_check: CPU: 39
[ 183.840778] mce: do_machine_check: CPU: 7
[ 183.840784] mce: do_machine_check: CPU: 36
[ 183.840786] mce: do_machine_check: CPU: 4
[ 184.485104] Disabling lock debugging due to kernel taint
[ 184.498006] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090
[ 184.498023] mce: [Hardware Error]: Machine check events logged
[ 184.531428] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 184.551126] mce: [Hardware Error]: TSC c760ad064ccce ADDR bb68ec00 MISC 421c8c86
[ 184.568358] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449600598 SOCKET 0 APIC 1 microcode 710
[ 184.588862] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
...

mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090

mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090

CPUs:

[ 1.103200] x86: Booting SMP configuration:
[ 1.112441] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 1.227835] .... node #1, CPUs: #8 #9 #10 #11 #12 #13 #14 #15
[ 1.451861] .... node #2, CPUs: #16 #17 #18 #19 #20 #21 #22 #23
[ 1.674819] .... node #3, CPUs: #24 #25 #26 #27 #28 #29 #30 #31
[ 1.899011] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
[ 2.026616] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
[ 2.152645] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
[ 2.276782] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63
[ 2.402263] x86: Booted up 4 nodes, 64 CPUs

Ok, all clear.

Thanks!