Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

warning: "This [Crucial M.2] storage device is likely to fail soon"

309 views
Skip to first unread message

jkn

unread,
Apr 28, 2022, 4:46:10 AM4/28/22
to
Hi all
Just yesterday I upgraded my main desktop from Kubuntu 20.04 to 22.04 LTS. All went pretty well ... except this morning when I powered it up I got a worrying desktop warning message along the lines of:

The Storage Device /dev/nvme0n1 is likely to fail soon

erk!

I have not seen this before, but TBH yesterday was the first time I have powered this machine off in a long time; I'm checking 'quiescent' household power consumption.

I did a quick "sudo smartctl /dev/nvme0n1 -a" which doesn't look too bad, although I haven't delved deep. See below for the output.

So a few thoughts:

- any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
- is this likely to be a real issue, or an over-zealous warning?

I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2 drive

- Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

Thanks for any thoughts
Jon N

{{{ output from: sudo smartctl /dev/nvme0n1 -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-27-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: CT500P2SSD8
Serial Number: 2043E4BD82DC
Firmware Version: P2CR010
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 fff0000000
Local Time is: Thu Apr 28 09:39:36 2022 BST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.50W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.16W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 1000 1000
4 - 0.0020W - - 4 4 4 4 5000 55000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 1,140,109 [583 GB]
Data Units Written: 2,357,594 [1.20 TB]
Host Read Commands: 14,832,947
Host Write Commands: 25,228,486
Controller Busy Time: 11,685
Power Cycles: 236
Power On Hours: 8,833
Unsafe Shutdowns: 38
Media and Data Integrity Errors: 0
Error Information Log Entries: 275
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 275 0 0x1008 0x4005 0x028 0 0 -
}}}

Theo

unread,
Apr 28, 2022, 5:12:12 AM4/28/22
to
jkn <jkn...@nicorp.f9.co.uk> wrote:
> So a few thoughts:
>
> - any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
> - is this likely to be a real issue, or an over-zealous warning?

I checked my NVMe and I don't have an 'error log' section. I don't know
what yours means. You could try nvme-cli (that's the package in Ubuntu) eg:

$ sudo nvme error-log /dev/nvme0n1
$ sudo nvme smart-log /dev/nvme0n1
(other *-log commands available)

and see if it reports anything interesting. eg for me smart-log says:

$ sudo nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 25 C
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 10%
endurance group critical warning summary: 0

so I seem to have used 10% of my write endurance (I think).

It is possible doing an upgrade has eaten some of your available writes and
pushed it over some threshold.

> I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and
> dd'ing everything over; and upgrading the firmware on this Crucial M.2
> drive
>
> - Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

In theory it shouldn't be a risk to update the firmware (it happens in
production all the time), but if the drive is exhibiting failure signs I'd
want to make a backup first just in case.

Theo

(who hadn't come across nvme-cli before and thinks it could be a useful way
of using cheaper NVMe in servers and replacing drives when they start
running out of writes)

jkn

unread,
Apr 28, 2022, 5:24:05 AM4/28/22
to
Thanks a lot Theo, very useful.
I installed nvme-cli and get this:

{{{ $ sudo nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:16
.................
Entry[ 0]
.................
error_count : 275
sqid : 0
cmdid : 0x1008
status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0x1
parm_err_loc : 0x28
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0

# (all other log entries seem 'empty')
}}}

{{{ $ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 31 C (304 Kelvin)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 1,140,151
data_units_written : 2,357,758
host_read_commands : 14,833,879
host_write_commands : 25,231,374
controller_busy_time : 11,687
power_cycles : 236
power_on_hours : 8,834
unsafe_shutdowns : 38
media_errors : 0
num_err_log_entries : 275
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
}}}

> It is possible doing an upgrade has eaten some of your available writes and
> pushed it over some threshold.

That is a good thought...

I think I will press on with buying a new 1TB M.2 drive (I was thinking of doing
that anyway, as it happens), and updating the firmware on this one
only after I have dd'd everything over and swapped to the new one.

Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off
on Amazon on the Crucial P2 I have here...

Thanks, J^n

Theo

unread,
Apr 28, 2022, 6:14:08 AM4/28/22
to
jkn <jkn...@nicorp.f9.co.uk> wrote:
> I think I will press on with buying a new 1TB M.2 drive (I was thinking of
> doing that anyway, as it happens), and updating the firmware on this one
> only after I have dd'd everything over and swapped to the new one.

Sounds reasonable.

> Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off
> on Amazon on the Crucial P2 I have here...

Samsung Evo are my standard fit. I've also been using Sabrent as they have
been better at producing PCIe Gen4 drives at a decent price, although I'm a
bit more uncertain about reliability. (I have 8 of them in a server,
everything is fine thus far...)

I would avoid QLC drives (cheap but slow, sometimes HDD-slow). TLC isn't a
great deal more expensive.

Previously I would have aimed for an SSD with DRAM rather than a DRAM-less
one but they seem to be harder to find these days. DRAMless is probably
fine unless you're serving databases or similar.

Theo

Theo

unread,
Apr 28, 2022, 6:33:07 AM4/28/22
to
Theo <theom...@chiark.greenend.org.uk> wrote:
> Previously I would have aimed for an SSD with DRAM rather than a DRAM-less
> one but they seem to be harder to find these days. DRAMless is probably
> fine unless you're serving databases or similar.

One other thing... if it's going to be under any kind of intense workload
(eg compiling) what I tend to do is look for 'performance consistency'
graphs, eg this is a cheap and old drive:

https://www.anandtech.com/show/9258/crucial-mx200-250gb-500gb-1tb-ssd-review/2

You can see the IOPS fall off a cliff once the buffer cache is exhausted.

One way to improve this is to leave some portion of the drive unwritten - eg
partition it to 900GB not 1TB and leave the last 100GB as unwritten blocks.
This gives the drive a bit more breathing space as it can have more spare
blocks to play with. Anandtech's benchmarks sometimes incorporate such
overprovisioning, eg:
https://www.anandtech.com/show/9451/the-2tb-samsung-850-pro-evo-ssd-review/2

You probably don't care about performance to that level but it's something I
look at when selecting drives.

Theo

jkn

unread,
Apr 28, 2022, 7:07:35 AM4/28/22
to
Thanks Theo, that's all very useful. I do do some semi-intensive compiling so
the point is well taken.

As it happens I've just placed an order for a Samsung 970 EVO Plus (1TB),
I will take your advice and leave some part unpartitioned.

Anyone suggest any semi-decent NVME USB-C enclosures to move my
Crucial P2 drive to once I have updated the firmware etc?

Thanks again, Jon N
0 new messages