Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hardware failure?: Now what?

139 views
Skip to first unread message

Charles Curley

unread,
Mar 20, 2021, 5:00:04 PM3/20/21
to
My syslog is reporting things like:

Mar 20 13:58:29 hawk rasdaemon[892]: Calling ras_mc_event_opendb()
Mar 20 13:58:29 hawk rasdaemon[892]: cpu 03:rasdaemon: mce_record store: 0x55c124c9b148
Mar 20 13:58:29 hawk kernel: [ 300.407406] mce: [Hardware Error]: Machine check events logged
Mar 20 13:58:29 hawk kernel: [ 300.407410] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 90000040000f0005
Mar 20 13:58:29 hawk kernel: [ 300.407411] mce: [Hardware Error]: TSC f442c87fda
Mar 20 13:58:29 hawk kernel: [ 300.407413] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1616270309 SOCKET 0 APIC 6 microcode 19
Mar 20 13:58:29 hawk rasdaemon[892]: rasdaemon: register inserted at db

root@hawk:/crc/back# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2021-03-20 13:58:30 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0xf442c87fda, walltime=0x605653e5, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
2 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x274d9e61020, walltime=0x605655ea, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
3 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x27517a5dacb, walltime=0x605655eb, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
4 2021-03-20 14:10:34 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x30ea8517bee, walltime=0x605656b9, cpuid=0x000306c3

root@hawk:/crc/back#


If I read that correctly, CPU 3 is seeing and correcting internal parity
errors.

The board is an ASUS H97M-E, bios date 05/15/2015. Processor is
Intel(R) Core(TM) i7-4790S CPU @ 3.20GHz, with eight processors.

Now what?


--
Does anybody read signatures any more?

https://charlescurley.com
https://charlescurley.com/blog/

Sven Hartge

unread,
Mar 20, 2021, 5:20:04 PM3/20/21
to
Charles Curley <charle...@charlescurley.com> wrote:

> Mar 20 13:58:29 hawk rasdaemon[892]: Calling ras_mc_event_opendb()
> Mar 20 13:58:29 hawk rasdaemon[892]: cpu 03:rasdaemon: mce_record store: 0x55c124c9b148
> Mar 20 13:58:29 hawk kernel: [ 300.407406] mce: [Hardware Error]: Machine check events logged
> Mar 20 13:58:29 hawk kernel: [ 300.407410] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 90000040000f0005
> Mar 20 13:58:29 hawk kernel: [ 300.407411] mce: [Hardware Error]: TSC f442c87fda
> Mar 20 13:58:29 hawk kernel: [ 300.407413] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1616270309 SOCKET 0 APIC 6 microcode 19
> Mar 20 13:58:29 hawk rasdaemon[892]: rasdaemon: register inserted at db

> 1 2021-03-20 13:58:30 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0xf442c87fda, walltime=0x605653e5, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 2 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x274d9e61020, walltime=0x605655ea, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 3 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x27517a5dacb, walltime=0x605655eb, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 4 2021-03-20 14:10:34 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x30ea8517bee, walltime=0x605656b9, cpuid=0x000306c3

> If I read that correctly, CPU 3 is seeing and correcting internal parity
> errors.

Correct.

> The board is an ASUS H97M-E, bios date 05/15/2015. Processor is
> Intel(R) Core(TM) i7-4790S CPU @ 3.20GHz, with eight processors.

> Now what?

Nothing really.

Check if there is a BIOS/Firmware update available.

Check if the voltages are set correctly in the BIOS/Firmware. (Usually
by loading the defaults and setting everything to "auto".)

Check temperature of the CPU.

Check if the latest intel-microcode package from Debian is installed
(3.20201118.1~deb10u1 at the moment) or grab the newest one from testing
(3.20210216.1).

Try running mprime95 in test mode for some time to see if it complains
and if errors occur more often when under load.

Also run memtest86+ for some time to verify the correctness of your RAM.

In the end, if the error is something in one of the caches inside the
CPU, there is nothing really you can do.

Grüße,
Sven.

--
Sigmentation fault. Core dumped.

Andy Smith

unread,
Mar 20, 2021, 6:20:05 PM3/20/21
to
Hi,

On Sat, Mar 20, 2021 at 02:29:25PM -0600, Charles Curley wrote:
> MCE events:
> 1 2021-03-20 13:58:30 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0xf442c87fda, walltime=0x605653e5, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006

This could be a RAM error, but it could also be a memory error for
the cache inside the CPU, so a CPU error. But it could also be a
spurious CPU bug:

https://trick77.com/qemu-on-haswell-causes-spurious-mce-events/

Are you running qemu or KVM or some other kind of virtualisation? If
yes and if there doesn't appear to be any actual instability then it
may be spurious.

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

Dan Ritter

unread,
Mar 20, 2021, 7:00:06 PM3/20/21
to
Charles Curley wrote:
>
> The board is an ASUS H97M-E, bios date 05/15/2015. Processor is
> Intel(R) Core(TM) i7-4790S CPU @ 3.20GHz, with eight processors.
>
> Now what?

4 cores, 8 threads.

As others are pointing out, this could be thermal. Clean the
fan, consider replacing the power supply, consider removing the
heatsink and cleaning it then re-applying thermal paste.

If the problem recurs and it isn't thermal, you can replace the
CPU.

i7-5775C or another i7-4790S will go for about $200; a used
i7-4770K will be nearly unnoticeably faster for about $180.


-dsr-

Charles Curley

unread,
Mar 20, 2021, 10:40:04 PM3/20/21
to
On Sat, 20 Mar 2021 22:13:17 +0000
Andy Smith <an...@strugglers.net> wrote:

> This could be a RAM error, but it could also be a memory error for
> the cache inside the CPU, so a CPU error. But it could also be a
> spurious CPU bug:
>
> https://trick77.com/qemu-on-haswell-causes-spurious-mce-events/
>
> Are you running qemu or KVM or some other kind of virtualisation? If
> yes and if there doesn't appear to be any actual instability then it
> may be spurious.

Quite likely this is it. Thank you.

On Friday I created an i386 VM to test Bullseye installations. The
first such event shows up in syslog at about the same time. The first
event shows up in syslog shortly after I created it.

I ran memtest86+ for about 5 hours just now, and it showed no errors. I
have since rebooted the machine into Linux. I have not started that VM
up again, and will leave it alone for a day to see if I get any such
events.

I have a number of amd64 VMs, and I do not recall seeing this error
before. If I can run those without this error, that will narrow things
down to the i386 VM, and that may be worth a bug report.

Charles Curley

unread,
Mar 22, 2021, 4:00:05 PM3/22/21
to
On Sat, 20 Mar 2021 20:09:24 -0600
Charles Curley <charle...@charlescurley.com> wrote:

> I have a number of amd64 VMs, and I do not recall seeing this error
> before. If I can run those without this error, that will narrow things
> down to the i386 VM, and that may be worth a bug report.

I ran an amd64 VM for 24 hours, and no errors. I just fired up a 486
VM, and no errors. I will let that run 24 hours and see what that does.

The i386 VM is "qemu32". I see a kvm32 in my list of options. I may try
that as well.

Charles Curley

unread,
Mar 23, 2021, 4:40:05 PM3/23/21
to
On Mon, 22 Mar 2021 13:52:27 -0600
Charles Curley <charle...@charlescurley.com> wrote:

> I ran an amd64 VM for 24 hours, and no errors. I just fired up a 486
> VM, and no errors. I will let that run 24 hours and see what that
> does.
>
> The i386 VM is "qemu32". I see a kvm32 in my list of options. I may
> try that as well.

I ran the VM as qemu32 for a few seconds, and got 7 more errors. I then
switched the VM to "kvm32". I have been running that for about an hour
and seen 27 more errors. Still, that strikes me as pretty conclusive.

Is this even worth pursuing?

Sven Hartge

unread,
Mar 23, 2021, 5:10:04 PM3/23/21
to
Charles Curley <charle...@charlescurley.com> wrote:
> On Mon, 22 Mar 2021 13:52:27 -0600 Charles Curley <charle...@charlescurley.com> wrote:

>> I ran an amd64 VM for 24 hours, and no errors. I just fired up a 486
>> VM, and no errors. I will let that run 24 hours and see what that
>> does.
>>
>> The i386 VM is "qemu32". I see a kvm32 in my list of options. I may
>> try that as well.

> I ran the VM as qemu32 for a few seconds, and got 7 more errors. I
> then switched the VM to "kvm32". I have been running that for about an
> hour and seen 27 more errors. Still, that strikes me as pretty
> conclusive.

> Is this even worth pursuing?

Only as an amusing part-trick: "Look here folks, I can create an MCE on
demand."

Other than that: Intel has acknowledged the defect as an official
erro^Werratum and documented it. So "case closed" in that regard.

Charles Curley

unread,
Mar 23, 2021, 5:30:06 PM3/23/21
to
On Tue, 23 Mar 2021 22:03:08 +0100
Sven Hartge <sv...@svenhartge.de> wrote:

> Other than that: Intel has acknowledged the defect as an official
> erro^Werratum and documented it. So "case closed" in that regard.

Agreed. Thanks.
0 new messages