Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: using ddrescue on the root partition - boot with / as read-only

54 views
Skip to first unread message

Michael Kjörling

unread,
Sep 13, 2023, 8:10:06 AM9/13/23
to
On 13 Sep 2023 13:54 +0200, from vin...@vinc17.net (Vincent Lefevre):
> I need to use ddrescue on the root partition of my laptop.
>
> So I need to have the root partition mounted in read-only mode.
> How can I do that?

Boot a separate environment. For example Debian installation media
offers a rescue environment which can be used for the purpose, or you
can use live media for just about any distribution. You may need to
install ddrescue into the live environment.

--
Michael Kjörling 🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Vincent Lefevre

unread,
Sep 13, 2023, 8:10:06 AM9/13/23
to
Hi,

I need to use ddrescue on the root partition of my laptop.

So I need to have the root partition mounted in read-only mode.
How can I do that?

Note that "mount -o remount,ro /" gives an error "mount point is busy"
apparently because various log files are open in write mode.

Using the recovery mode via GRUB (which mounts / in read-only mode)
is useless because the system remounts it later as rw.

Or is there a way to force a remount in read-only mode?
(I could probably trigger a disk error to make the kernel remount /
as read-only, but well...)

--
Vincent Lefèvre <vin...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

to...@tuxteam.de

unread,
Sep 13, 2023, 8:20:06 AM9/13/23
to
On Wed, Sep 13, 2023 at 01:54:04PM +0200, Vincent Lefevre wrote:
> Hi,
>
> I need to use ddrescue on the root partition of my laptop.
>
> So I need to have the root partition mounted in read-only mode.
> How can I do that?

In roughly ascending order of comfort (but also of "external tools
needed"):

- break out in the initramfs, before root is pivoted to
your customary root partition. At this point your system
is running on the loaded initramfs (enter 'break=pre-mount'
at the grub command line ** NOTE: please, double check this,
it's from memory!)
- use a "live" OS (either explicitly built for that, like
Knoppix, or, e.g. Debian's "rescue" mode)
- extract the disk and put it (perhaps in a USB case) into
another computer

I'm sure there are others :-)

Cheers
--
t
signature.asc

to...@tuxteam.de

unread,
Sep 13, 2023, 8:30:06 AM9/13/23
to
On Wed, Sep 13, 2023 at 02:15:30PM +0200, to...@tuxteam.de wrote:

[...]

> - break out in the initramfs [...]

More details on how to do that:

https://wiki.debian.org/InitramfsDebug

Cheers
--
t
signature.asc

Vincent Lefevre

unread,
Sep 13, 2023, 10:00:07 AM9/13/23
to
On 2023-09-13 14:15:30 +0200, to...@tuxteam.de wrote:
> On Wed, Sep 13, 2023 at 01:54:04PM +0200, Vincent Lefevre wrote:
> > I need to use ddrescue on the root partition of my laptop.
> >
> > So I need to have the root partition mounted in read-only mode.

BTW, in recovery mode, it is systemd that remounts the partition in rw
mode. From the "journalctl -b" output:

Sep 13 13:20:13 zira kernel: EXT4-fs (dm-2): mounted filesystem fb1e7272-f798-4ae9-a53b-e62e3139e239 ro with ordered data mode. Quota mode: none.

so this is initially "ro" as wanted. But later:

Sep 13 13:20:14 zira systemd[1]: Starting systemd-remount-fs.service - Remount Root and Kernel File Systems...
Sep 13 13:20:14 zira kernel: EXT4-fs (dm-2): re-mounted fb1e7272-f798-4ae9-a53b-e62e3139e239 r/w. Quota mode: none.

> > How can I do that?
>
> In roughly ascending order of comfort (but also of "external tools
> needed"):
>
> - break out in the initramfs, before root is pivoted to
> your customary root partition. At this point your system
> is running on the loaded initramfs (enter 'break=pre-mount'
> at the grub command line ** NOTE: please, double check this,
> it's from memory!)

I suppose that it would be better to break after the mount
(break=mountroot or perhaps bottom or init): so the root partition
would already be mounted in read-only mode (otherwise, since this
is an encrypted partition, it would be more complex).

But this would also mean that I would have to run ddrescue from
the initramfs.

Or perhaps I could use /bin/sh as init, so that systemd (and its
remount as rw) would be avoided?

> - use a "live" OS (either explicitly built for that, like
> Knoppix,

If I could install it on the disk used for the backup (there is enough
space) and boot from the USB drive, this could be a solution.

> or, e.g. Debian's "rescue" mode)

I suppose that this is what is actually called "recovery mode"
in GRUB. But I suppose that I would need to add init=/bin/sh
to avoid systemd (see above).

> - extract the disk and put it (perhaps in a USB case) into
> another computer

Not really possible for me (except if everything else fails).

Stefan Monnier

unread,
Sep 13, 2023, 11:10:07 PM9/13/23
to
> Or perhaps I could use /bin/sh as init, so that systemd (and its
> remount as rw) would be avoided?

Indeed booting with `init=/bin/bash` can be a handy option I've used in
the past: you get into the normal root (so you don't have to figure out
how to find and mount root from the initramfs), mounted read-only.


Stefan

David Christensen

unread,
Sep 14, 2023, 12:00:06 AM9/14/23
to
On 9/13/23 04:54, Vincent Lefevre wrote:
> Hi,
>
> I need to use ddrescue on the root partition of my laptop.
>
> So I need to have the root partition mounted in read-only mode.
> How can I do that?
>
> Note that "mount -o remount,ro /" gives an error "mount point is busy"
> apparently because various log files are open in write mode.
>
> Using the recovery mode via GRUB (which mounts / in read-only mode)
> is useless because the system remounts it later as rw.
>
> Or is there a way to force a remount in read-only mode?
> (I could probably trigger a disk error to make the kernel remount /
> as read-only, but well...)


What symptom(s) is your laptop exhibiting that make you think that you
need to use ddrescue(1) on the root partition?

https://en.wikipedia.org/wiki/XY_problem


Have you read the "GNU ddrescue Manual"?

https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html


David

Vincent Lefevre

unread,
Sep 14, 2023, 5:30:05 AM9/14/23
to
I've used init=/bin/sh, but bash (or zsh) would have been more
practical (as dash doesn't have completions).

Two other things are needed:
* The USB drive must be connected before booting so that the device
is recognized (otherwise lsblk doesn't list it).
* At some point, I got too many kernel messages in the console, and
I couldn't stop them (I thought that they were due to disk errors
because of ddrescue, but Ctrl-C had no effect). The solution
I found on the web was to set kernel.printk to 3 3 3 3. I did
that with:
sysctl kernel.printk="3 3 3 3"
Then I resumed ddrescue. Still onging...

Vincent Lefevre

unread,
Sep 14, 2023, 6:20:05 AM9/14/23
to
On 2023-09-13 20:52:43 -0700, David Christensen wrote:
> On 9/13/23 04:54, Vincent Lefevre wrote:
> > Hi,
> >
> > I need to use ddrescue on the root partition of my laptop.
> >
> > So I need to have the root partition mounted in read-only mode.
> > How can I do that?
> >
> > Note that "mount -o remount,ro /" gives an error "mount point is busy"
> > apparently because various log files are open in write mode.
> >
> > Using the recovery mode via GRUB (which mounts / in read-only mode)
> > is useless because the system remounts it later as rw.
> >
> > Or is there a way to force a remount in read-only mode?
> > (I could probably trigger a disk error to make the kernel remount /
> > as read-only, but well...)
>
> What symptom(s) is your laptop exhibiting that make you think that you need
> to use ddrescue(1) on the root partition?

I get UNC errors like

2023-09-10T11:50:59.858670+0200 zira kernel: ata1.00: exception Emask 0x0 SAct 0xc00 SErr 0x40000 action 0x0
2023-09-10T11:51:00.117366+0200 zira kernel: ata1.00: irq_stat 0x40000008
2023-09-10T11:51:00.117431+0200 zira kernel: ata1: SError: { CommWake }
2023-09-10T11:51:00.117474+0200 zira kernel: ata1.00: failed command: READ FPDMA QUEUED
2023-09-10T11:51:00.117511+0200 zira kernel: ata1.00: cmd 60/00:50:b8:12:c5/02:00:1f:00:00/40 tag 10 ncq dma 262144 in
res 41/40:00:90:13:c5/00:02:1f:00:00/00 Emask 0x409 (media error) <F>
2023-09-10T11:51:00.117537+0200 zira kernel: ata1.00: status: { DRDY ERR }
2023-09-10T11:51:00.117560+0200 zira kernel: ata1.00: error: { UNC }
2023-09-10T11:51:00.117583+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
2023-09-10T11:51:00.117614+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
2023-09-10T11:51:00.117651+0200 zira kernel: ata1.00: configured for UDMA/133
2023-09-10T11:51:00.117681+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
2023-09-10T11:51:00.117953+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
2023-09-10T11:51:00.118165+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
2023-09-10T11:51:00.118366+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 1f c5 12 b8 00 02 00 00
2023-09-10T11:51:00.118557+0200 zira kernel: I/O error, dev sda, sector 533009296 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 2
2023-09-10T11:51:00.118582+0200 zira kernel: ata1: EH complete
2023-09-10T11:51:00.118608+0200 zira kernel: ata1.00: Enabling discard_zeroes_data

and after these errors, the kernel remount the root partition as
read-only. Due to these errors, some files are unreadable.

badblocks says that there are 25252 bad blocks.

I'm using ddrescue before doing anything else (mainly in case things
would go worse), but I would essentially be interested in knowing
which files are affected.

The laptop is in the process of being replaced, so I don't plan to
replace the disk (unless things get really wrong).

> Have you read the "GNU ddrescue Manual"?
>
> https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

Yes.

Michael Kjörling

unread,
Sep 14, 2023, 8:50:06 AM9/14/23
to
On 14 Sep 2023 12:17 +0200, from vin...@vinc17.net (Vincent Lefevre):
> badblocks says that there are 25252 bad blocks.
>
> I'm using ddrescue before doing anything else (mainly in case things
> would go worse), but I would essentially be interested in knowing
> which files are affected.

There's always the brute-force way; for each ordinary file, try to cat
it to /dev/null, and log the name of the file if that fails.

Alternatively, since you have an on-disk offset, you might be able to
use `debugfs -R "stat <filename>"` or `hdparm --fibmap <filename>` or
`filefrag -e <filename>` for each file visible to figure out which
file name(s) map to that location on disk.

I couldn't immediately find a convenient way to go from an on-disk
offset to a file name directly.

Max Nikulin

unread,
Sep 14, 2023, 10:50:06 AM9/14/23
to
On 14/09/2023 19:48, Michael Kjörling wrote:
> On 14 Sep 2023 12:17 +0200, from vin...@vinc17.net (Vincent Lefevre):
>> badblocks says that there are 25252 bad blocks.
>>
>> I'm using ddrescue before doing anything else (mainly in case things
>> would go worse), but I would essentially be interested in knowing
>> which files are affected.
>
> There's always the brute-force way; for each ordinary file, try to cat
> it to /dev/null, and log the name of the file if that fails.

ddrescue should be a better way to recover data from failing disk. It
tries to minimize seeks and may reiterate over fragments failed during
earlier passes. It may write session/log file to track data recovered
during previous runs.

Not only the root partition should be mounted read-only. No partitions
from this disk should be mounted at all. Even reads may cause more
severe damage. Boot from a live media and run ddrescue from it. Check
that no partitions are automatically mounted or used as swap.

It is better to postpone badblocks till disk image will be created.

If data are really precious then seek for a specialized service.

Vincent Lefevre

unread,
Sep 14, 2023, 11:40:34 AM9/14/23
to
On 2023-09-14 21:44:18 +0700, Max Nikulin wrote:
> If data are really precious then seek for a specialized service.

I normally have 2+ backups for important data. But I'd like to
double-check with what is no longer readable on the laptop disk.

Stefan Monnier

unread,
Sep 14, 2023, 12:10:06 PM9/14/23
to
>> Indeed booting with `init=/bin/bash` can be a handy option I've used in
>> the past: you get into the normal root (so you don't have to figure out
>> how to find and mount root from the initramfs), mounted read-only.

[ One other advantage over `break=premount` and friends is that I find
it much easier to remember for some reason. I still need to resort
to things like `break=premount` every once in a while, e.g. when
I need to rename the root LVM volume group. ]

> I've used init=/bin/sh, but bash (or zsh) would have been more
> practical (as dash doesn't have completions).

You can always launch `bash` from `/bin/sh` :-)

> Then I resumed ddrescue. Still onging...

Good luck!


Stefan

David Christensen

unread,
Sep 15, 2023, 1:30:07 AM9/15/23
to
What is the make and model of the laptop?


What is the make and model of the disk drive?


When and where do you see the above error messages?


> and after these errors, the kernel remount the root partition as
> read-only.


That sounds like a reasonable boot loader response to an OS drive error
during boot.


> Due to these errors, some files are unreadable.
>
> badblocks says that there are 25252 bad blocks.
>
> I'm using ddrescue before doing anything else (mainly in case things
> would go worse), but I would essentially be interested in knowing
> which files are affected.


Was the computer working correctly in the past?


When did you first notice the error messages? What was the computer
doing at the time?


Did you make any changes to the computer (hardware, software,
configuration, apps, other) immediately prior to the start of the error
messages?


Does the computer now generate error messages? Consistently? What is
it doing when the error messages are generated?


> The laptop is in the process of being replaced, so I don't plan to
> replace the disk (unless things get really wrong).


Then perhaps you should get the replacement and decommission the olde
laptop (remove the disk drive, have it shredded, and resell, recycle, or
reuse the laptop).


>> Have you read the "GNU ddrescue Manual"?
>>
>> https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html
>
> Yes.


Okay.


It sounds like you are booting a computer with OS drive issues and
attempting to use that computer to trouble-shoot itself. If the issue
seems minor, or is familiar, I might do the same. Otherwise, I do one
or more of the following:

* Browse the disk drive manufacturer's web site for a bootable drive
diagnostic tool. If available, I download the tool, burn it to media,
and use it to trouble-shoot the disk drive.

* I install Debian with Xfce onto a good USB 3.0 flash drive. I then
install my favorite system administration and trouble-shooting packages.
I boot the USB flash drive in suitable computers and use it to
trouble-shoot the computer and/or components (including disk drives).

* I remove the disk from the computer, install it another computer, and
use the other computer to trouble-shoot the disk.


David

Vincent Lefevre

unread,
Sep 15, 2023, 8:50:06 AM9/15/23
to
On 2023-09-14 22:24:59 -0700, David Christensen wrote:
> On 9/14/23 03:17, Vincent Lefevre wrote:
> > I get UNC errors like
> >
> > 2023-09-10T11:50:59.858670+0200 zira kernel: ata1.00: exception Emask 0x0 SAct 0xc00 SErr 0x40000 action 0x0
> > 2023-09-10T11:51:00.117366+0200 zira kernel: ata1.00: irq_stat 0x40000008
> > 2023-09-10T11:51:00.117431+0200 zira kernel: ata1: SError: { CommWake }
> > 2023-09-10T11:51:00.117474+0200 zira kernel: ata1.00: failed command: READ FPDMA QUEUED
> > 2023-09-10T11:51:00.117511+0200 zira kernel: ata1.00: cmd 60/00:50:b8:12:c5/02:00:1f:00:00/40 tag 10 ncq dma 262144 in
> > res 41/40:00:90:13:c5/00:02:1f:00:00/00 Emask 0x409 (media error) <F>
> > 2023-09-10T11:51:00.117537+0200 zira kernel: ata1.00: status: { DRDY ERR }
> > 2023-09-10T11:51:00.117560+0200 zira kernel: ata1.00: error: { UNC }
> > 2023-09-10T11:51:00.117583+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
> > 2023-09-10T11:51:00.117614+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
> > 2023-09-10T11:51:00.117651+0200 zira kernel: ata1.00: configured for UDMA/133
> > 2023-09-10T11:51:00.117681+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> > 2023-09-10T11:51:00.117953+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
> > 2023-09-10T11:51:00.118165+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
> > 2023-09-10T11:51:00.118366+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 1f c5 12 b8 00 02 00 00
> > 2023-09-10T11:51:00.118557+0200 zira kernel: I/O error, dev sda, sector 533009296 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 2
> > 2023-09-10T11:51:00.118582+0200 zira kernel: ata1: EH complete
> > 2023-09-10T11:51:00.118608+0200 zira kernel: ata1.00: Enabling discard_zeroes_data
>
> What is the make and model of the laptop?

HP ZBook 15 G2 (2015)

> What is the make and model of the disk drive?

Samsung 870 EVO 1TB SATA (since January 2022)

> When and where do you see the above error messages?

It seems that this occurs when bad sectors are read, either when some
files (using these bad sectors) are read or when I use the badblocks
utility (until now, I've used it only with the read test, i.e. with
no options). The messages appear in the journalctl output.

> > and after these errors, the kernel remount the root partition as
> > read-only.
>
> That sounds like a reasonable boot loader response to an OS drive error
> during boot.

There are no errors during boot. Only when I read the affected files
or use badblocks, but only after some given number of errors.

> > Due to these errors, some files are unreadable.
> >
> > badblocks says that there are 25252 bad blocks.
> >
> > I'm using ddrescue before doing anything else (mainly in case things
> > would go worse), but I would essentially be interested in knowing
> > which files are affected.
>
> Was the computer working correctly in the past?

Yes, except a few days before the first disk errors on 6 December 2022:
I got crashes from time to time (which never happened before). About
2 hours before the first errors, I upgraded the kernel and the NVIDIA
drivers from 390.154 to 390.157. In the changelog of 390.157-1:

nvidia-graphics-drivers-legacy-390xx (390.157-1) unstable; urgency=medium

* New upstream legacy branch release 390.157 (2022-11-22).
* Fixed CVE-2022-34670, CVE-2022-34674, CVE-2022-34675, CVE-2022-34677,
CVE-2022-34680, CVE-2022-42257, CVE-2022-42258, CVE-2022-42259.
https://nvidia.custhelp.com/app/answers/detail/a_id/5415
(Closes: #1025281)
* Improved compatibility with recent Linux kernels.

[ Andreas Beckmann ]
* Refresh patches.
* Rename the internally used ARCH variable which might clash on externally
set values.
* Use substitutions for ${nvidia-kernel} and friends (510.108.03-1).
* Try to compile a kernel module at package build time (510.108.03-1).

-- Andreas Beckmann <an...@debian.org> Sat, 03 Dec 2022 22:17:01 +0100

I'm wondering whether the crashes were due to the compatibility
with the kernel (which was the latest Debian/unstable one).

> When did you first notice the error messages? What was the computer doing
> at the time?

I first got errors on 6 December 2022 when I was reading these files.
At that time, I identified 5 files, which I put in a
private/unreadable-files directory. Then everything was OK
until a few days ago, when I wanted to duplicate a big directory
(to try to reproduce a bug).

> Did you make any changes to the computer (hardware, software, configuration,
> apps, other) immediately prior to the start of the error messages?

See above (and no hardware change).

> Does the computer now generate error messages? Consistently? What is it
> doing when the error messages are generated?

I get errors only when I read some particular files.

David Christensen

unread,
Sep 15, 2023, 5:00:06 PM9/15/23
to
On 9/15/23 05:46, Vincent Lefevre wrote:
> On 2023-09-14 22:24:59 -0700, David Christensen wrote:
>> On 9/14/23 03:17, Vincent Lefevre wrote:
>>> I get UNC errors like
>>>
>>> 2023-09-10T11:50:59.858670+0200 zira kernel: ata1.00: exception Emask 0x0 SAct 0xc00 SErr 0x40000 action 0x0
>>> 2023-09-10T11:51:00.117366+0200 zira kernel: ata1.00: irq_stat 0x40000008
>>> 2023-09-10T11:51:00.117431+0200 zira kernel: ata1: SError: { CommWake }
>>> 2023-09-10T11:51:00.117474+0200 zira kernel: ata1.00: failed command: READ FPDMA QUEUED
>>> 2023-09-10T11:51:00.117511+0200 zira kernel: ata1.00: cmd 60/00:50:b8:12:c5/02:00:1f:00:00/40 tag 10 ncq dma 262144 in
>>> res 41/40:00:90:13:c5/00:02:1f:00:00/00 Emask 0x409 (media error) <F>
>>> 2023-09-10T11:51:00.117537+0200 zira kernel: ata1.00: status: { DRDY ERR }
>>> 2023-09-10T11:51:00.117560+0200 zira kernel: ata1.00: error: { UNC }
>>> 2023-09-10T11:51:00.117583+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
>>> 2023-09-10T11:51:00.117614+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
>>> 2023-09-10T11:51:00.117651+0200 zira kernel: ata1.00: configured for UDMA/133
>>> 2023-09-10T11:51:00.117681+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
>>> 2023-09-10T11:51:00.117953+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
>>> 2023-09-10T11:51:00.118165+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
>>> 2023-09-10T11:51:00.118366+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 1f c5 12 b8 00 02 00 00
>>> 2023-09-10T11:51:00.118557+0200 zira kernel: I/O error, dev sda, sector 533009296 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 2
>>> 2023-09-10T11:51:00.118582+0200 zira kernel: ata1: EH complete
>>> 2023-09-10T11:51:00.118608+0200 zira kernel: ata1.00: Enabling discard_zeroes_data
>>
>> What is the make and model of the laptop?
>
> HP ZBook 15 G2 (2015)


That is a good laptop.


>
>> What is the make and model of the disk drive?
>
> Samsung 870 EVO 1TB SATA (since January 2022)


That is a good SSD.


>
>> When and where do you see the above error messages?
>
> It seems that this occurs when bad sectors are read, either when some
> files (using these bad sectors) are read or when I use the badblocks
> utility (until now, I've used it only with the read test, i.e. with
> no options). The messages appear in the journalctl output.


Okay.


>
>>> and after these errors, the kernel remount the root partition as
>>> read-only.
>>
>> That sounds like a reasonable boot loader response to an OS drive error
>> during boot.
>
> There are no errors during boot. Only when I read the affected files
> or use badblocks, but only after some given number of errors.


Oops -- I misread "remount" as "mount".


>>> Due to these errors, some files are unreadable.
>>>
>>> badblocks says that there are 25252 bad blocks.


That number is large enough to make me worry.


>>>
>>> I'm using ddrescue before doing anything else (mainly in case things
>>> would go worse), but I would essentially be interested in knowing
>>> which files are affected.
>>
>> Was the computer working correctly in the past?
>
> Yes, except a few days before the first disk errors on 6 December 2022:
> I got crashes from time to time (which never happened before). About
> 2 hours before the first errors, I upgraded the kernel and the NVIDIA
> drivers from 390.154 to 390.157. In the changelog of 390.157-1:
>
> nvidia-graphics-drivers-legacy-390xx (390.157-1) unstable; urgency=medium
>
> * New upstream legacy branch release 390.157 (2022-11-22).
> * Fixed CVE-2022-34670, CVE-2022-34674, CVE-2022-34675, CVE-2022-34677,
> CVE-2022-34680, CVE-2022-42257, CVE-2022-42258, CVE-2022-42259.
> https://nvidia.custhelp.com/app/answers/detail/a_id/5415
> (Closes: #1025281)
> * Improved compatibility with recent Linux kernels.
>
> [ Andreas Beckmann ]
> * Refresh patches.
> * Rename the internally used ARCH variable which might clash on externally
> set values.
> * Use substitutions for ${nvidia-kernel} and friends (510.108.03-1).
> * Try to compile a kernel module at package build time (510.108.03-1).
>
> -- Andreas Beckmann <an...@debian.org> Sat, 03 Dec 2022 22:17:01 +0100
>
> I'm wondering whether the crashes were due to the compatibility
> with the kernel (which was the latest Debian/unstable one).


The sum total of the clues make me think the SSD is failing.


>
>> When did you first notice the error messages? What was the computer doing
>> at the time?
>
> I first got errors on 6 December 2022 when I was reading these files.
> At that time, I identified 5 files, which I put in a
> private/unreadable-files directory. Then everything was OK
> until a few days ago, when I wanted to duplicate a big directory
> (to try to reproduce a bug).
>
>> Did you make any changes to the computer (hardware, software, configuration,
>> apps, other) immediately prior to the start of the error messages?
>
> See above (and no hardware change).
>
>> Does the computer now generate error messages? Consistently? What is it
>> doing when the error messages are generated?
>
> I get errors only when I read some particular files.


I suggest:

1. Keep your backups safe. Run an incremental backup to get newer
files that can be read. Forget about files that cannot be read.

2. If you do not have a backup of a file that cannot be read and you
need that data, send the SSD to the manufacturer or a service for data
recovery.

3. Otherwise, get a SMART extended report for the SSD:

# smartctl -x /dev/disk/by-id/ata-pick-the-correct-disk

4. Get disk partitioning, etc., information for the SSD:

# fdisk -l /dev/disk/by-id/ata-pick-the-correct-disk

(Relevant LVM, LUKS, or other commands, as appropriate).

5. Use the SSD manufacturer diagnostic tool to gather information, run
tests, update the firmware, secure erase the SSD, and test again.


If the manufacturer diagnostic cannot secure erase the SSD, physically
destroy the SSD.


If the manufacturer diagnostic can secure erase the SSD, but the SSD
cannot pass all tests, recycle the SSD.


If the manufacturer diagnostic can secure erase the SSD and the SSD can
pass all tests, get an updated SMART report, pick a suitable use for the
drive (ZFS cache device comes to mind), deploy the SSD, and monitor the
SSD frequently going forward.


David
0 new messages