Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SSD Optimization - Crucial CT1000MX500SSD1

101 views
Skip to first unread message

Marcelo Laia

unread,
Sep 29, 2022, 10:20:05 AM9/29/22
to
Hi,

Recently, I bought a SSD SATA Crucial CT1000MX500SSD1.

Nowadays, one week a go, debian testing system got crashed because
partition got read only.

I suspect this is a reason from some upgrade and I suspect that an SSD
optimization can solve this problem.

SSD firmware is up to date. This version is M3CR043.

Here is some informations:

:~$ sudo smartctl -a /dev/sda

https://pastebin.com/Jyrhn1A2

:~$ sudo journalctl --since "2022-09-25 00:00:00" | grep sda

https://pastebin.com/QtCqpJPm

:~$ sudo journalctl --since "2022-09-25 00:00:00" | grep ata1

https://pastebin.com/x5QdaYQU

:~$ sudo journalctl --since "2022-09-25 00:00:00" | grep error

https://pastebin.com/1hEPm7YX

Have you some clue and/or advise here?

Thank you so much!

--
Marcelo

Dan Ritter

unread,
Sep 29, 2022, 10:50:05 AM9/29/22
to
You are having hardware errors. This will not be solved by
software.

The error may be in the controller (unlikely unless it is very
hot), the data cable (possible, cheap to switch) or the disk
(likely, will need replacement).

-dsr-

Andy Smith

unread,
Sep 29, 2022, 12:00:06 PM9/29/22
to
Hi,

On Thu, Sep 29, 2022 at 10:54:19AM -0300, Marcelo Laia wrote:
> Recently, I bought a SSD SATA Crucial CT1000MX500SSD1.
>
> Nowadays, one week a go, debian testing system got crashed because partition
> got read only.
>
> I suspect this is a reason from some upgrade and I suspect that an SSD
> optimization can solve this problem.

Modern SSDs don't really need any optimisation. They shouldn't go
read-only after a short period of normal use. I've got a few of
these drives and don't see such errors, and don't need to change any
settings to make them work properly.

> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep sda
>
> https://pastebin.com/QtCqpJPm

These errors look quite serious and could be at the hardware level.

I'd try to make the drive do a "long" SMART self-test and if it
failed it I'd immediately RMA it.

# smartctl -t long /dev/sda4

Then to view progress/result:

# smartctl -l selftest /dev/sda

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

David Christensen

unread,
Sep 29, 2022, 3:50:05 PM9/29/22
to
The many "ATA bus error" messages indicate that you have a bad
connection between your motherboard and the SSD. You need to correct
this problem first.


I suggest replacing the SATA cable with a new SATA III cable with
locking connectors that is clearly marked "SATA III" and/or "6 Gbps". I
buy the black version of these:

https://www.cablematters.com/pc-187-156-3-pack-straight-60-gbps-sata-iii-cable.aspx

https://www.cablematters.com/pc-188-156-cable-matters-3-pack-90-degree-right-angle-60-gbps-sata-iii-cable-18-inches.aspx


David

Alexander V. Makartsev

unread,
Sep 29, 2022, 4:10:05 PM9/29/22
to
On 29.09.2022 18:54, Marcelo Laia wrote:
> Hi,
>
> Recently, I bought a SSD SATA Crucial CT1000MX500SSD1.
>
> Nowadays, one week a go, debian testing system got crashed because
> partition got read only.
>
It is expected for partitions to fallback into read-only mode if there
are errors reported from the driver.

> I suspect this is a reason from some upgrade and I suspect that an SSD
> optimization can solve this problem.
>
> SSD firmware is up to date. This version is M3CR043.
Based on the info you've sent, there are a few options to try:
1. Replace SATA cable with a known working one. I suggest this because
there are a few errors were registered in SMART Attribute 199.
2. Test the drive within the most basic conditions, like one GPT
partition and Ext4 filesystem without "discard" option. Leave extra
layers like LVM, LUKS, TRIM feature, etc, aside for now.
3. There is a possibility a BIOS\Firmware update for your motherboard
could improve compatibility with SSD's internal controller.

There are no "SSD optimization" exist that would be required for a SSD
to function. They are "it just works" kind of devices.
Overall SMART looks clean to me, like the one from a brand new SSD would
look like.
There are multiple similar reports on the Internet, so I'd suspect it is
more of a hardware\firmware compatibility issue than a faulty drive.
E.g. there is a high chance this SSD would work just fine inside another
PC (with different motherboard, ICH, BIOS, etc.)

Can you show us more info about your PC/Laptop? A report from "inxi"
would be great.
You can get all possible info and filter out private data, with these
parameters:
    $ sudo inxi -a -v 8 -za

>
> Here is some informations:
>
> :~$ sudo smartctl -a /dev/sda
>
> https://pastebin.com/Jyrhn1A2
>
> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep sda
>
> https://pastebin.com/QtCqpJPm
>
> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep ata1
>
> https://pastebin.com/x5QdaYQU
>
> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep error
>
> https://pastebin.com/1hEPm7YX
>
> Have you some clue and/or advise here?
>
> Thank you so much!
>


--
With kindest regards, Alexander.

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀⠀⠀⠀

piorunz

unread,
Sep 29, 2022, 6:50:06 PM9/29/22
to
On 29/09/2022 21:03, Alexander V. Makartsev wrote:

> Based on the info you've sent, there are a few options to try:
> 1. Replace SATA cable with a known working one. I suggest this because
> there are a few errors were registered in SMART Attribute 199.

And also 183 SATA_Interfac_Downshift = 9 times.

These SATA interface downshift to 3.0 Gbps or lower may correspond to
read-only events. Definitely I'd try another SATA cable and different
SATA port on motherboard.

I have two exactly the same SSDs but 183 and 199 fields are zero.

--
With kindest regards, Piotr.

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org/
⠈⠳⣄⠀⠀⠀⠀

Marcelo Laia

unread,
Sep 30, 2022, 12:20:05 PM9/30/22
to
> Can you show us more info about your PC/Laptop? A report from "inxi"
> would be great. You can get all possible info and filter out private
> data, with these parameters:
>
> $ sudo inxi -a -v 8 -za

https://pastebin.com/iAvJrXdB

--
Marcelo

Marcelo Laia

unread,
Sep 30, 2022, 12:20:05 PM9/30/22
to
> I'd try to make the drive do a "long" SMART self-test and if it
> failed it I'd immediately RMA it.
>
> # smartctl -t long /dev/sda4
>
> Then to view progress/result:
>
> # smartctl -l selftest /dev/sda

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.0-2-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 507 -


--
Marcelo

David Christensen

unread,
Sep 30, 2022, 3:40:07 PM9/30/22
to
> On 9/29/22 06:54, Marcelo Laia wrote:

>> Recently, I bought a SSD SATA Crucial CT1000MX500SSD1.

>> :~$ sudo smartctl -a /dev/sda
>>
>> https://pastebin.com/Jyrhn1A2
>>
>> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep sda
>>
>> https://pastebin.com/QtCqpJPm
>>
>> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep ata1
>>
>> https://pastebin.com/x5QdaYQU
>>
>> :~$ sudo journalctl --since "2022-09-25 00:00:00" | grep error
>>
>> https://pastebin.com/1hEPm7YX


On 9/29/22 12:40, David Christensen wrote:

> The many "ATA bus error" messages indicate that you have a bad
> connection between your motherboard and the SSD.  You need to correct
> this problem first.


On 9/30/22 08:57, Marcelo Laia wrote:

>> $ sudo inxi -a -v 8 -za
>
> https://pastebin.com/iAvJrXdB


So, a 2014 Dell Inspiron 5547 laptop. You did not state that earlier,
so I assumed it was a desktop/ server system...


Laptop HDD/SSD cables are very specific. Getting a replacement could be
easy or could be hard. Contact Dell to see if the part is available. I
have never concluded a laptop HDD/SSD cable was bad, but I have not
encountered many. You will have to decide if you want to replace yours.
I would try to eliminate other possibilities first.


Please post:

# cat /etc/debian_version ; uname -a


Connect an Ethernet cable. Disable the Wi-Fi via CMOS Setup. Boot a
recent Debian installer into a rescue shell. For example (you may see
different questions; adjust your answers as needed):

debian-11.3.0-amd64-netinst.iso

Debian GNU/Linux installer menu
-> Advanced options
-> Rescue mode
-> Language -> C
-> Continent or region -> North America
-> Country, territory or area -> United States
-> Keymap to use -> American English
-> Load missing firmware from removable media -> No
-> Primary network interface -> eth0: Ethernet
-> Hostname -> debian
-> Domain name -> <blank>
-> Select your time zone -> Pacific
-> Passphrase for /dev/sda3 -> <blank>
-> Device to use as root file system -> Do not use a root file system
-> Rescue operations -> Execute a shell in the installer environment
-> Executing a shell -> Continue


Does dmesg(1) show any errors?

# dmesg | grep error


If you read 10 GiB from the SSD:

# date ; dd if=/dev/sda of=/dev/null bs=1M count=10k ; date


How long does it take? Were there any error messages? Does dmesg(1)
show any errors?


Power off when done:

# poweroff


If you remove the Crucial SSD, install the previous HDD/SSD, and
exercise it with the Debian installer rescue shell, do you see any errors?


David

Andy Smith

unread,
Sep 30, 2022, 5:10:06 PM9/30/22
to
Hello,

On Fri, Sep 30, 2022 at 12:56:22PM -0300, Marcelo Laia wrote:
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 507 -

Well, I did suggest a "long" self-test whereas you appear to have
only done a short one.

Others' suggestions to replace the SATA cables and try different
ports was a better call that slipped my mind, but I think I read
further down the thread that this is a laptop so maybe you can't do
that.

Cheers,
Andy

Marcelo Laia

unread,
Oct 1, 2022, 7:30:05 AM10/1/22
to
Hi Andy, so sorry for that. Here is the long one.

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.0-2-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 544 -
# 2 Short offline Completed without error 00% 507 -


--
Marcelo

David Christensen

unread,
Oct 1, 2022, 4:00:05 PM10/1/22
to
Please run the following command and post the complete console session
-- prompt, command entered, output displayed -- on a web site
(pastebin.com, etc.):

# smartctl -x /dev/sda


David

Marcelo Laia

unread,
Oct 1, 2022, 9:00:05 PM10/1/22
to
> Please run the following command and post the complete console session
> -- prompt, command entered, output displayed -- on a web site
> (pastebin.com, etc.):
>
>
> # smartctl -x /dev/sda

Here is:

https://pastebin.com/znfuz82t

Thank you so much!

--
Marcelo

David Christensen

unread,
Oct 1, 2022, 9:30:06 PM10/1/22
to
These attributes all indicate that the internal functions of the drive
are correct:

1 Raw_Read_Error_Rate POSR-K 100 100 000 - 0
5 Reallocate_NAND_Blk_Cnt -O--CK 100 100 010 - 0

171 Program_Fail_Count -O--CK 100 100 000 - 0
172 Erase_Fail_Count -O--CK 100 100 000 - 0

180 Unused_Reserve_NAND_Blk PO--CK 000 000 000 - 64

184 Error_Correction_Count -O--CK 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0

196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_ECC_Cnt -O--CK 100 100 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0

202 Percent_Lifetime_Remain ----CK 099 099 001 - 1

206 Write_Error_Rate -OSR-- 100 100 000 - 0


This attribute indicates that the drive is having a problems when
communicating with the motherboard:

199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 38


So, I would still do the steps I suggested earlier:

https://lists.debian.org/debian-user/2022/09/msg00772.html


David

David

unread,
Oct 1, 2022, 10:30:06 PM10/1/22
to
On Sun, 2 Oct 2022 at 06:55, David Christensen
<dpch...@holgerdanske.com> wrote:

> Please run the following command and post the complete console session
> -- prompt, command entered, output displayed -- on a web site
> (pastebin.com, etc.):

Or paste.debian.net [1][2], which is a better match with the
philosophy [3] of the Debian project.

Maybe this is worth a mention in the mailing list Monthly FAQ?

[1] http://paste.debian.net
[2] http://paste.debian.net/paste.pl?show_template=about
[3] https://www.debian.org/intro/philosophy

Marcelo Laia

unread,
Oct 2, 2022, 9:30:05 AM10/2/22
to
> Or paste.debian.net

Thank you so much! I didn't know it, yet.

--
Marcelo

Marcelo Laia

unread,
Oct 2, 2022, 9:40:06 AM10/2/22
to
> # cat /etc/debian_version ; uname -a

bookworm/sid
Linux marcelo 5.19.0-2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.11-1 (2022-09-24) x86_64 GNU/Linux

> Disable the Wi-Fi via CMOS Setup.

This isn't possible in the BIOS.

> # dmesg | grep error

Only error related to iwlwifi and r8169 (wifi card)

> # date ; dd if=/dev/sda of=/dev/null bs=1M count=10k ; date

started at 13:01:03
finished at 13:01:27

> How long does it take?

24 s

> Were there any error messages?

No

> Does dmesg(1) show any errors?

The same previous errors: iwlwifi and r8169 (wifi card)

If you remove the Crucial SSD, install the previous HDD/SSD, and
exercise it with the Debian installer rescue shell, do you see any
errors?

I will do that at night, to day.

Thank you!

--
Marcelo

David Christensen

unread,
Oct 2, 2022, 4:40:06 PM10/2/22
to
On 10/2/22 06:19, Marcelo Laia wrote:
>> # cat /etc/debian_version ; uname -a
>
> bookworm/sid
> Linux marcelo 5.19.0-2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.11-1
> (2022-09-24) x86_64 GNU/Linux


Please install Debian Stable.


>
>> Disable the Wi-Fi via CMOS Setup.
>
> This isn't possible in the BIOS.


Please reset your CMOS settings:


https://www.dell.com/support/kbdoc/en-us/000130200/how-to-use-and-troubleshoot-the-inspiron-15-5547#Issue0_7


>> # dmesg | grep error
>
> Only error related to iwlwifi and r8169 (wifi card)
>
>> # date ; dd if=/dev/sda of=/dev/null bs=1M count=10k ; date
>
> started at 13:01:03
> finished at  13:01:27
>
>> How long does it take?
>
> 24 s
>
>> Were there any error messages?
>
> No
>
>> Does dmesg(1) show any errors?
>
> The same previous errors: iwlwifi and r8169 (wifi card)


That supports my theory that the problem is Debian Unstable.


David

Rodrigo Cunha

unread,
Oct 2, 2022, 5:20:06 PM10/2/22
to
Your debian is a test version, check your SSD in a LTS debian. bookworm/sid is a testing version.
--
Atenciosamente,
Rodrigo da Silva Cunha
São Gonçalo, RJ - Brasil

piorunz

unread,
Oct 3, 2022, 12:30:06 PM10/3/22
to
On 02/10/2022 21:33, David Christensen wrote:
> On 10/2/22 06:19, Marcelo Laia wrote:
>>> # cat /etc/debian_version ; uname -a
>>
>> bookworm/sid
>> Linux marcelo 5.19.0-2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.11-1
>> (2022-09-24) x86_64 GNU/Linux
>
>
> Please install Debian Stable.

Why would he?
I have exactly the same SSD (two of them) in my machine, on Debian
Testing, drives in BTRFS Raid1 mode, everything works perfect. But I
have good SATA cables.
OS version has nothing to do with cabling errors in SSD drive SMART log.
He may as well be using DOS, Windows FreeBSD, any Linux - cabling errors
must never happen.

uname -a
Linux ryzen 5.19.0-2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.11-1
(2022-09-24) x86_64 GNU/Linux

$ sudo smartctl /dev/sda --all | grep "Device
Model\|SATA_Interfac\|DMA_CRC_Error"
Device Model: CT1000MX500SSD1
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always
- 0

$ sudo smartctl /dev/sdb --all | grep "Device
Model\|SATA_Interfac\|DMA_CRC_Error"
Device Model: CT1000MX500SSD1
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always
- 0

David Christensen

unread,
Oct 3, 2022, 11:00:06 PM10/3/22
to
Even if you and the OP ran identical OS instances (e.g. clones), I do
not believe you two have the same make and model computers. Therefore,
different code paths will be executed -- e.g. device drivers. So, the
OP's computer may be hitting a bug that your computer does not.


I am applying a trouble-shooting strategy -- change one variable, apply
a stimulus, and measure the result. If the result is the same as it was
before, then the result is unlikely to be related to the variable and/or
change. But if the result is different, then the result is likely to be
related to the variable and/or change.


Of course, this is all premised upon devising a stimulus that reliably
reproduces the result. When my HDD's/SSD's were having SATA cable
and/or drive rack problems, reading 10 GB from them typically produced
at least one error.


When the OP read 10 GB of the SSD using the d-i rescue shell, he was
applying a stimulus after changing the variable "OS instance". The
result was different. Therefore, the SATA UDMA CRC errors are related
to changing the OS instance.


But, the above experiment has significant flaws (here are few; I expect
there are more):

1. We cannot reproduce the OP's hardware and software.

2. We do not know what Debian installer the OP used (but we could
obtain it if he told us).

2. The stimulus read from the SSD. The UDMA CRC errors may only occur
during writes.

3.The SMART reports indicate 38 UDMA CRC errors for 1296000877 Logical
Sectors Written and 801097450 Logical Sectors Read. So, an average of 1
error per 5.52E+7 sectors. The test read 2.05E+7 sectors. That might
be too few sectors.

4. Similarly, for Number of Read Commands -- 1 error per 4.43E+5
commands vs. 1.02E+4 test commands.

5. The Debian installer rescue shell is single-user (single-process?),
but the UDMA errors were seen during multi-user operation (SMP). If the
SATA UDMA errors are caused by concurrency/ parallel execution, the d-i
rescue shell environment may not be capable of reproducing the error.


If the OP installs Debian Stable on the SSD, runs the 10 GB sequential
read test, uses the system interactively, and the SATA UDMA errors are
not seen for a some period of time (a week?), then I would be reasonably
confident the problem was the SSD Debian Testing instance. But if the
errors persist, then we will have to think up another hypothesis and
experiment.


David

piorunz

unread,
Oct 4, 2022, 7:40:05 AM10/4/22
to
Agree.

> I am applying a trouble-shooting strategy -- change one variable, apply
> a stimulus, and measure the result.  If the result is the same as it was
> before, then the result is unlikely to be related to the variable and/or
> change.  But if the result is different, then the result is likely to be
> related to the variable and/or change.

Agree, troubleshooting strategy must be followed by elimination. In
general, when dealing with issues, OS is good choice to change (for
example run different system from LiveCD for couple of days)
> Of course, this is all premised upon devising a stimulus that reliably
> reproduces the result.  When my HDD's/SSD's were having SATA cable
> and/or drive rack problems, reading 10 GB from them typically produced
> at least one error.

I cannot comment on that, I very very rarely have this kind of issues.
Last time I had it turned out to be faulty RAM (very rare bit rots), not
SATA cable.

> When the OP read 10 GB of the SSD using the d-i rescue shell, he was
> applying a stimulus after changing the variable "OS instance".  The
> result was different.  Therefore, the SATA UDMA CRC errors are related
> to changing the OS instance

Maybe. But my bet this is hardware error. RAM or SATA Cable, or SSD
itself. Or motherboard, then this is dead in the water and laptop would
need to be recycled/sold as spares if this happens.

I would also suggest to OP at this point, to do full memtest86 by
passmark (UEFI only) https://www.memtest86.com/
Or old typical (but bugged sometimes) memtest86+ https://www.memtest.org/

Full run of either to make sure CPU/RAM is good.

David Christensen

unread,
Oct 4, 2022, 8:10:05 PM10/4/22
to
On 10/4/22 04:39, piorunz wrote:
> On 04/10/2022 03:56, David Christensen wrote:


>> Of course, [my trouble-shooting strategy] is all premised upon devising a stimulus that reliably
>> reproduces the result.  When my HDD's/SSD's were having SATA cable
>> and/or drive rack problems, reading 10 GB from them typically produced
>> at least one error.
>
> I cannot comment on that, I very very rarely have [SATA UDMA CRC] issues.
> Last time I had it turned out to be faulty RAM (very rare bit rots), not
> SATA cable.


I have moved the majority of my data to servers with ECC memory and ZFS
mirrors, but I have little to no defense against memory errors on my
desktops and laptops without ECC memory. So, I keep as little data as
possible on the latter, and backup/ archive daily or sooner.


>> When the OP read 10 GB of the SSD using the d-i rescue shell, he was
>> applying a stimulus after changing the variable "OS instance".  The
>> result was different.  Therefore, the SATA UDMA CRC errors are related
>> to changing the OS instance
>
> Maybe. But my bet this is hardware error. RAM or SATA Cable, or SSD
> itself. Or motherboard, then this is dead in the water and laptop would
> need to be recycled/sold as spares if this happens.
>
> I would also suggest to OP at this point, to do full memtest86 by
> passmark (UEFI only) https://www.memtest86.com/
> Or old typical (but bugged sometimes) memtest86+ https://www.memtest.org/
>
> Full run of either to make sure CPU/RAM is good.


+1 for the OP testing memory by any means. The BIOS in the OP's
computer may include a test suite that includes memory tests. (I have
STFW and the Dell Support web site for Inspiron 5547 System Setup
documentation, but have not found anything.) If the OP has a HDD/SDD
with a Windows instance, the OP should be able to run diagnostics via
Microsoft Edge and the Dell Support web site.


Do you have a URL that documents bugs in memtest86+?


David

piorunz

unread,
Oct 4, 2022, 9:50:04 PM10/4/22
to
On 05/10/2022 01:00, David Christensen wrote:

> I have moved the majority of my data to servers with ECC memory and ZFS
> mirrors, but I have little to no defense against memory errors on my
> desktops and laptops without ECC memory.  So, I keep as little data as
> possible on the latter, and backup/ archive daily or sooner.

Fortunately, we are blessed with AMD Ryzen processors having unlocked
ECC functionality, nothing stupid like fully supported ECC on Intel
processors but disabled because no "Xeon" in the name.

I use this in my desktop PC:
CPU: 8-core model: AMD Ryzen 7 5800X
Mobo: ASUSTeK model: PRIME B550-PLUS
Memory:
RAM: total: 62.72 GiB used: 19.94 GiB (31.8%)
Array-1: capacity: 128 GiB slots: 4 EC: Multi-bit ECC
Device-1: DIMM_A1 type: no module installed
Device-2: DIMM_A2 type: DDR4 size: 32 GiB speed: 3600 MT/s
Device-3: DIMM_B1 type: no module installed
Device-4: DIMM_B2 type: DDR4 size: 32 GiB speed: 3600 MT/s

Works amazing. No single bit loss (due to memory) since I built it. But
I had files corrupted due to SSD though. Btrfs detects all checksum
errors so I know right away what is happening. No long term bit rot and
important data lost, like I would have been on other filesystems and
non-ECC memory.

And my home server is similar, older build:
CPU: Info: 8-Core model: AMD Ryzen 7 1700
Mobo: ASUSTeK model: PRIME B350-PLUS
Memory: RAM: total: 62.81 GiB used: 12.13 GiB (19.3%)
Array-1: capacity: 128 GiB slots: 4 EC: Multi-bit ECC
Device-1: DIMM_A1 size: 16 GiB speed: 2666 MT/s
Device-2: DIMM_A2 size: 16 GiB speed: 2666 MT/s
Device-3: DIMM_B1 size: 16 GiB speed: 2666 MT/s
Device-4: DIMM_B2 size: 16 GiB speed: 2666 MT/s

Had at least two critical file corruption losses before I went all ECC,
now, for two years its been perfectly stable.


> Do you have a URL that documents bugs in memtest86+?

In Debian? Not that I am aware of. But throughout my last two years of
refurbishing a bit older machines, about 50 various desktops and
laptops, often times memtest86+ would just hang (no progress moving but
small ASCII characters is moving around like it's doing something).
That happens especially when maximum (for old PC BIOS) RAM is installed.
Windows, and Linux work fine with that maximized memory amount of,
example: 4 GB (for older machines) or 8 GB (for a bit newer ones) etc
etc. Windows10 memory tester thing would even test it fine. And Linux
"memtester" which runs while system is running, also perfect result no
errors.
But memtest86+ would just hang, always at exactly the same percentage in
first test, and computer would need to be reset.
It happens on memtest86+ from Linux Mint ISO too, so perhaps this is for
software authors, not for Debian BTS. Rare error, on old exotic builds,
but annoying. That's why where I can (on UEFI) I prefer to use
proprietary Passmark's memtest86.

David Christensen

unread,
Oct 5, 2022, 12:10:06 AM10/5/22
to
On 10/4/22 18:41, piorunz wrote:

> ... fully supported ECC on Intel
> processors but disabled because no "Xeon" in the name.


AIUI memory support on Intel platforms depends upon the chipset and the
processor. For example, my Intel S1200V3RPS (C222 chipset) and
S1200V3RPL (C226 chipset) server boards support ECC memory modules with
specific Celeron, Pentium, Core i3, and Xeon processors. Similarly, my
Dell PowerEdge T30 (Intel C236 chipset) supports specific combinations
of non-ECC or ECC memory with Pentium, Core i3, and Xeon processors.


STFW I see:

https://www.tomshardware.com/news/intel-enables-ecc-on-12th-gen-core-cpus


David

piorunz

unread,
Oct 5, 2022, 7:40:05 AM10/5/22
to
Good for you, proper server grade boards and Xeons.
I went rebel both my machines are customer grade, cheap Ryzens and ASUS
boards. Heck, my home server is almost all made of all USED parts from
eBay. Used first gen 16-thread Ryzen, used ASUS board with ECC, 4x4TB
hard drives from my previous builds, used PSU, used nvidia card for
basic nouveau display. Only ECC RAM is brand new and 2x SSDs.
Very cheap and affordable for everyone. Intel server grade equipment is
order of magnitude more expensive sometimes.

I am glad intel feels breath of competition on their neck and starting
to unlock ECC for *some* customer grade CPUs and motherboards. *Some* being:
"Speaking of Intel’s W680, it is necessary to note that this chipset has
essentially the same features as Z690, but given its workstation nature,
it lacks support for overclocking."

So that's still way behind AMD. I got both ECC and OC. I overclocked ECC
memory from stock 3200 to sweet spot for 5800X CPU - 3600 MT/s. No
hassle, no errors, just few clicks from drop-down BIOS menu. And
disabled aggressive CPU single core boost which takes it to over-spec
~150W power draw, and also underclocked it a little. Entire system is
very quiet even under heavy 100% 16-threaded load thanks to these
modifications.

David Christensen

unread,
Oct 5, 2022, 9:50:06 PM10/5/22
to

On 05/10/2022 05:07, David Christensen wrote:

> https://www.tomshardware.com/news/intel-enables-ecc-on-12th-gen-core-cpus


On 10/5/22 04:32, piorunz wrote:

> "Speaking of Intel’s W680, it is necessary to note that this chipset has
> essentially the same features as Z690, but given its workstation nature,
> it lacks support for overclocking."
>
> So that's still way behind AMD. I got both ECC and OC.


I later found this AnandTech article with the title "The Intel W680
Chipset Overview: Alder Lake Workstations Get ECC Memory and
Overclocking Support":

https://www.anandtech.com/show/17308/the-intel-w680-chipset-overview-ecc-for-alder-lake-workstations


It is surprising that Tom's Hardware and AnandTech state the opposite
regarding the Intel W680 chipset and overclocking support. I assume
AnandTech is correct, as they mention announced motherboards.


AnandTech also indicates that those boards do not have the over-sized
power regulation circuitry that is required for serious overclocking.
Perhaps another vendor has built such a motherboard since then.


Here are some additional articles on the subject of memory integrity
detection and correction:

https://en.wikipedia.org/wiki/ECC_memory

https://www.zdnet.com/article/dram-error-rates-nightmare-on-dimm-street/


While researching ECC memory a few years ago, I recall seeing an article
that discussed memory errors vs. memory size. 16 GB was the unity
threshold -- computers with less than 16 GB of memory had an error
likelihood of less than 1 bit per day (?) and computers with more than
16 GB had a likelihood greater than 1. I cannot remember the system
loading model (e.g. light to moderate desktop 8 hours vs. heavy
workstation 8 hours vs. heavy server 24 hours).


David

Anssi Saari

unread,
Oct 6, 2022, 7:30:06 AM10/6/22
to
piorunz <pio...@gmx.com> writes:

> I am glad intel feels breath of competition on their neck and starting
> to unlock ECC for *some* customer grade CPUs and motherboards. *Some* being:
> "Speaking of Intel’s W680, it is necessary to note that this chipset has
> essentially the same features as Z690, but given its workstation nature,
> it lacks support for overclocking."

Looks like overclocking is there to at least to some extent but at least
what Supermicro W680 boards I could find in the retail channel, they
cost 400-550 US dollars so not exactly something I'd like to buy for an
all-around motherboard. So, happy with my consumer Asrock board and ECC
RAM and Ryzen.

> So that's still way behind AMD. I got both ECC and OC. I overclocked ECC
> memory from stock 3200 to sweet spot for 5800X CPU - 3600 MT/s. No
> hassle, no errors, just few clicks from drop-down BIOS menu. And
> disabled aggressive CPU single core boost which takes it to over-spec
> ~150W power draw, and also underclocked it a little. Entire system is
> very quiet even under heavy 100% 16-threaded load thanks to these
> modifications.

What do you use for cooling if I may ask? I've found my 5600X reaches a
thermal limit pretty quickly with any kind of torture test software and
slows down, with the stock cooler. I haven't really checked if that
happens in actual use but I already got a beefier Noctua
cooler. Unfortunately it doesn't install itself :)

piorunz

unread,
Oct 6, 2022, 10:10:06 AM10/6/22
to
On 06/10/2022 12:29, Anssi Saari wrote:
> piorunz <pio...@gmx.com> writes:
>
>> I am glad intel feels breath of competition on their neck and starting
>> to unlock ECC for *some* customer grade CPUs and motherboards. *Some* being:
>> "Speaking of Intel’s W680, it is necessary to note that this chipset has
>> essentially the same features as Z690, but given its workstation nature,
>> it lacks support for overclocking."
>
> Looks like overclocking is there to at least to some extent but at least
> what Supermicro W680 boards I could find in the retail channel, they
> cost 400-550 US dollars so not exactly something I'd like to buy for an
> all-around motherboard. So, happy with my consumer Asrock board and ECC
> RAM and Ryzen.

400-500 USD for a motherboard is totally way beyond my budget. Server I
put together has been working non stop for like 4, 5 years now? It has
eaten through one UPS battery (just died of age) but server itself never
has any fundamental issues. And its very cheap, paid for itself many
times over.

>> So that's still way behind AMD. I got both ECC and OC. I overclocked ECC
>> memory from stock 3200 to sweet spot for 5800X CPU - 3600 MT/s. No
>> hassle, no errors, just few clicks from drop-down BIOS menu. And
>> disabled aggressive CPU single core boost which takes it to over-spec
>> ~150W power draw, and also underclocked it a little. Entire system is
>> very quiet even under heavy 100% 16-threaded load thanks to these
>> modifications.
>
> What do you use for cooling if I may ask? I've found my 5600X reaches a
> thermal limit pretty quickly with any kind of torture test software and
> slows down, with the stock cooler. I haven't really checked if that
> happens in actual use but I already got a beefier Noctua
> cooler. Unfortunately it doesn't install itself :)
>
Home server with AMD Ryzen 7 1700 (65W TDP) is with stock cooler. Works
very well with normal temperature range even on heavier CPU load. My
server pic: https://i.imgur.com/nguQRAC.jpeg

Workstation with 5800X (105W rated CPU) is very different story.

Motherboard manufacturers are overclocking Ryzens quietly by default,
and by a huge margin. They want to be the best in motherboard rankings.
They do it in several ways: by cheating temperature sensor and total
power draw. So AMD CPU thinks its nice cold and not too power hungry,
thinking it has won the silicon lottery, then it goes crazy and
overclock itself by boosting cores. I remember reading on Anandtech or
somewhere that this practice is so bad that 105W CPU can go as high as
150W while CPU and the user doesn't know what is going on.

So for my 5800X I use Cooler Master MasterLiquid ML280 all-in-one water
cooler. With this cooler and all settings on default (meaning ASUS
default settings), CPU will go crazy boosting to 4.7 or 4.8 GHz (that's
more than boost spec!) on some cores and fans will spin quite fast on
the AIO. On heavy load, temperature hits 90C and stays there, this is
thermal limit of the processor, it throttles to maintain this temperature.

I didn't liked that so I decreased voltage offset slightly in the BIOS,
disabled ASUS boost overclocking (PBOC? Or something? I don't remember
the name but I am sure your ASRock may have similar setting). Now I
never reach 90C, meaning CPU never throttles by temperature, but rather
by wattage. And without crazy boosting. I hope TDP actually stays in
range it supposed to (105W).

Your 5600X is 12 threaded 65W processor. So it should behave like my
1700, same TDP range, but 1700 never reaches throttling temperature.
Your never gen. CPU probably goes far beyond 65W in power draw. See if
you can tweak BIOS settings, that may save you not only buying new
cooler, but in electricity bill and CPU longevity as well :)

David Christensen

unread,
Oct 6, 2022, 9:40:05 PM10/6/22
to
On 10/6/22 07:09, piorunz wrote:

> Home server with AMD Ryzen 7 1700 (65W TDP) is with stock cooler. Works
> very well with normal temperature range even on heavier CPU load. My
> server pic: https://i.imgur.com/nguQRAC.jpeg


What is the device with the fan on top of the middle 3.5" drive bays,
next to the power supply?


David

piorunz

unread,
Oct 7, 2022, 5:40:05 AM10/7/22
to
Good eyes! This is ASIC Bitcoin miner, I stripped it off the its case
and put there, perfect place. It participates in bitcoin mining lottery
without noise or huge power draw. Device is called Apollo BTC by
Futurebit company from USA.

Anssi Saari

unread,
Oct 7, 2022, 10:20:05 AM10/7/22
to
piorunz <pio...@gmx.com> writes:

> Your 5600X is 12 threaded 65W processor. So it should behave like my
> 1700, same TDP range, but 1700 never reaches throttling temperature.
> Your never gen. CPU probably goes far beyond 65W in power draw. See if
> you can tweak BIOS settings, that may save you not only buying new
> cooler, but in electricity bill and CPU longevity as well :)

I actually turned on the PBO optimization myself since I wanted that for
gaming. I don't think Asrock's goal is to win those reviews since it
wasn't even on by default. The motherboard model is B550 Extreme4 so
budget chipset and nothing fancy. Mostly I picked it for the minimal but
apparently sufficient information that ECC is supported and the
interface options.

I should really take some time to test the system a little more to see
where the clock speeds get up to and how the temperature behaves.

Also as I have the commercial memtest86, it should be possible to inject
ECC errors with it. I just haven't tried that yet.
0 new messages