Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Test ECC memory

34 views
Skip to first unread message

kry...@ibse.cz

unread,
Feb 20, 2023, 12:50:07 PM2/20/23
to
Dear Debian community,
we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and Kingston unbuffered ECC DIMMs for our small bussiness servers. All the servers are running on ZFS for which ECC memory is recommended. So I naively tried to test it actually works. I read EVERY disscussion on EVERY forum I was able to find (and there is a lot of them, believe me), but I did not find a satisfying answer. According to the legendary tweet from AMD (for which is link in every discussion), the Ryzen CPUs should support ECC memory, but it is not tested feature since they are consumer CPUs. Funny thing is, that according to their spec sheets even EPYC class CPUs do not support them (only CPUs with stated ECC support I found are Ryzen Embedded ones - for example the V1605B in UDOO Bolt). Nevertheless system reports it works - dmidecode, lshw, kernel loads driver and EDAC MC is present in /sys/devices/system/edac/mc, even memtest86+ v6.0 and above reports ECC memory. In forum discussions Intel guys are saying that correctable ECC errors are relatively common - stated counts vary, but I got the impression that at least one in a week should appear. And our virtual hypervisor running over half a year with more than 80% memory utilization has not a single one, niether in sysfs nor in EUFI event log. I understand that the errror count rises with height above mean sea level due to solar radiation and we are in 246m altitude, but at least one error would be nice.
The only thing I had success with was memory overclocking - I lowered timing as low as possible for system to POST and when Debian was running, it reported corectable errors from different memory regions (13 during 30 minutes). Rising memory frequency did not work. But all this was done on Asus motherboard, with same memory and CPU however. When I change any memory related setting on ASRock Rack motherboard, it will not POST.
In kernel documentation is described that Intel CPUs have ability to inject errors for driver testing but I did not find anything like it for AMD. Does anyone know any way to test that ECC works without breaking the system before? Thank you for your answers.

PS: Some commercial memtests should allegedly be able to inject ECC errors (for example the one from passmark), have anyone tried those?

Best regards,
Kryštof

John Hasler

unread,
Feb 20, 2023, 1:20:05 PM2/20/23
to
Tape the Americium-241 button out of a smoke detector to a RAM chip.
--
John Hasler
jo...@sugarbit.com
Elmwood, WI USA

John Hasler

unread,
Feb 20, 2023, 3:20:06 PM2/20/23
to
> Hi, thank you for the answer. Honestly it came to my mind I could make
> some kind of neutrino emitter, since according to most articles it is
> the main source of ECC errors,

Neutrons, not neutrinos. The latter rarely interact with matter at
all. A neutron source is fairly difficult to make.

> In school I was tought that a sheet of paper is enough to absorb alpha
> rays - will it penetrate a chip package?

Some will. Attenuation of such radiation is proportional to thickness
and the level at the surface of the button, while too low to be
dangerous, is much higher than background.

A reference from https://en.wikipedia.org/wiki/ECC_memory :

https://web.archive.org/web/20131021190327/http://pdf.yuri.se/files/art/2.pdf

gene heskett

unread,
Feb 20, 2023, 5:10:06 PM2/20/23
to
On 2/20/23 13:12, John Hasler wrote:
> Tape the Americium-241 button out of a smoke detector to a RAM chip.

Ooooh, that would be nasty ;o(> But it ought to do the trick.

Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
- Louis D. Brandeis
Genes Web page <http://geneslinuxbox.net:6309/>

kry...@ibse.cz

unread,
Feb 20, 2023, 5:50:05 PM2/20/23
to
DdB wrote:
> Did you really read, that epycs cannot support ECC?
> At least i can say, that my pools did not report any faults (which ofc
> would be several layers above ecc) either in 3 years, which did help in
> falling asleep. ;-)

I am sorry, it was little missleading - not that they can not support them, but there is no official document that would state so. The only official specsheet I saw that explicitely mentions ECC support is this one: https://www.amd.com/en/products/specifications/embedded

Dan Ritter

unread,
Feb 20, 2023, 7:10:07 PM2/20/23
to
We see ECC errors irregularly and infrequently on both Intel and
AMD CPUs. One a week would be very concerning if we're talking
about one system, but not too concerning if we are discussing a
thousand systems.

-dsr-

kry...@ibse.cz

unread,
Feb 21, 2023, 2:10:05 AM2/21/23
to
Dne úterý 21. února 2023 1:20:50 CET, DdB napsal(a):
> Lucky me, i just looked up my hardware (Dual CPU on server MB):
>
> > https://versus.com/en/amd-epyc-7282
> > Supports ECC memory

Yes, but it is sad that you have to search for this information somewhere else than on vendors website. But this is not the only thing AMD not saying to us - for example when using 4 dual rank DIMM, Ryzen can run them only at 2666MT/s intead of 3200MT/s, which is information I found only on ASRsocks websites and manuals. One would think that AMD will write it somewhere since memory controller is part of CPU, but again I have never seen it anywhere in thier specsheets.

Anssi Saari

unread,
Feb 21, 2023, 4:50:06 AM2/21/23
to
Dan Ritter <d...@randomstring.org> writes:

> We see ECC errors irregularly and infrequently on both Intel and
> AMD CPUs.

How/where do you see those on a Debian system? I looked into this
briefly but didn't get anywhere.

Anssi Saari

unread,
Feb 21, 2023, 5:20:06 AM2/21/23
to
kry...@ibse.cz writes:

> PS: Some commercial memtests should allegedly be able to inject ECC
> errors (for example the one from passmark), have anyone tried those?

I've tried Passmark's memory tester (the commercial one which includes
ECC error injection), but I've had no luck. My desktop has issues with
mouse and keyboard support in it and Grub as well, it's so bad it's
practically impossible to do anything. It's my only "modern" system with
a Ryzen 5600X and ECC RAM.

My router and file server have ECC RAM but those systems are BIOS only
and I can't boot Passmark's thing on them since it requires UEFI. I have
an oldish Intel laptop with ECC RAM also which is apparently the only
one of my computers where Passmark's thing could run. Haven't tried it
though.

Dan Ritter

unread,
Feb 21, 2023, 10:10:07 AM2/21/23
to
The kernel announces readiness during boot with:
dmesg:[ 18.331561] EDAC amd64: Node 0: DRAM ECC enabled.

and then an event looks like this:
Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
kernel:[5964975.397283] [Hardware Error]: Corrected error, no
action required.

Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
kernel:[5964975.406226] [Hardware Error]: CPU:0 (15:2:0)
MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c04400040080a13

Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
kernel:[5964975.418574] [Hardware Error]: Error Addr:
0x0000001ed405ef50

Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
kernel:[5964975.426919] [Hardware Error]: MC4 Error (node 0):
DRAM ECC error detected on the NB.

Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
kernel:[5964975.437370] [Hardware Error]: cache level: L3/GEN,
mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)


If you see a bunch of these, you want to install edac-utils and
run it to see if you have a bad DIMM.

-dsr-

kry...@ibse.cz

unread,
Feb 21, 2023, 4:30:06 PM2/21/23
to
Dan Ritter wrote:
> The kernel announces readiness during boot with:
> dmesg:[ 18.331561] EDAC amd64: Node 0: DRAM ECC enabled.
>
> and then an event looks like this:
> Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
> kernel:[5964975.397283] [Hardware Error]: Corrected error, no
> action required.
>
> Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
> kernel:[5964975.406226] [Hardware Error]: CPU:0 (15:2:0)
> MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c04400040080a13
>
> Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
> kernel:[5964975.418574] [Hardware Error]: Error Addr:
> 0x0000001ed405ef50
>
> Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
> kernel:[5964975.426919] [Hardware Error]: MC4 Error (node 0):
> DRAM ECC error detected on the NB.
>
> Message from syslogd@HOSTNAME at Jan 25 15:05:51 ...
> kernel:[5964975.437370] [Hardware Error]: cache level: L3/GEN,
> mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Do you keep track of how often these errors occur?

Dan Ritter

unread,
Feb 21, 2023, 5:00:06 PM2/21/23
to
Yes, but I'm not allowed to give much more precision than I've
already said. Rare, unless you have a failing DIMM.

-dsr-
0 new messages