smartctl meldet HDD error, der Kontroller aber nicht

Jan Novak

unread,

Oct 6, 2021, 9:56:07 AM10/6/21

to

Hallo,

hier läuft eine Dell Server mit einem Scsi Kontroller, an welchem 5
Platten hängen. Per perccli ist nichts auffälliges zu finden. Auch
meckert der Kontroller in seinem Bios nicht, dass eine Platte ein
Problem hätte (er liest auch die smart table aus)
Laut Smartmondeamon gibt es diese Fehler:

---

This message was generated by the smartd daemon running on:

host name: [name]
DNS domain: [domain]

The following warning/error was logged by the smartd daemon:

Device: /dev/bus/1 [megaraid_disk_02] [SAT], 2 Offline uncorrectable sectors

Device info:
WDC WD6003FFBX-68MU3N0, S/N:V8GYGA7R, WWN:5-000cca-098cd648e,
FW:83.00A83, 6.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Jun 5 14:35:16
2020 CEST
Another message will be sent in 24 hours if the problem persists.

---

Auf der Konsole sehe ich das hier:

smartctl -a /dev/sda -d sat+megaraid,02|grep -i error
was completed without error.
Error logging capability: (0x01) Error logging supported.
SCT Error Recovery Control supported.
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age
Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always
- 0
SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
ER = Error register [HEX]
Error 16 occurred at disk power-on lifetime: 17690 hours (737 days + 2
hours)
When the command that caused the error occurred, the device was
active or idle.
40 51 00 a7 01 0b 40 Error: UNC at LBA = 0x000b01a7 = 721319
Commands leading to the command that caused the error were:
Error 15 occurred at disk power-on lifetime: 17690 hours (737 days + 2
hours)
When the command that caused the error occurred, the device was
active or idle.
40 51 00 a6 01 0b 40 Error: UNC at LBA = 0x000b01a6 = 721318
Commands leading to the command that caused the error were:
Error 14 occurred at disk power-on lifetime: 17690 hours (737 days + 2
hours)
When the command that caused the error occurred, the device was
active or idle.
40 51 00 a5 01 0b 40 Error: UNC at LBA = 0x000b01a5 = 721317
Commands leading to the command that caused the error were:
Error 13 occurred at disk power-on lifetime: 17690 hours (737 days + 2
hours)
When the command that caused the error occurred, the device was
active or idle.
40 51 00 a4 01 0b 40 Error: UNC at LBA = 0x000b01a4 = 721316
Commands leading to the command that caused the error were:
Error 12 occurred at disk power-on lifetime: 17690 hours (737 days + 2
hours)
When the command that caused the error occurred, the device was
active or idle.
40 51 00 a3 01 0b 40 Error: UNC at LBA = 0x000b01a3 = 721315
Commands leading to the command that caused the error were:

Kann das jemand interpretieren?

Jan

Marco Moock

unread,

Oct 6, 2021, 1:45:17 PM10/6/21

to

Am Wed, 6 Oct 2021 15:56:07 +0200
schrieb Jan Novak <rep...@gmail.com>:

> Device: /dev/bus/1 [megaraid_disk_02] [SAT], 2 Offline uncorrectable
> sectors

Ist nicht gerade gut und deutet auf einen kommenden Defekt hin.
Wenn sich der Wert erhöht ist diese Platte kurz vor dem Ausfall.
Hier gibt es dazu weitere Infos: https://techoverflow.net/2016/07/25/how-to-interpret-smartctl-messages-like-error-unc-at-lba/

Ich rate dazu, die Platten einzeln mit testdisk zu prüfen.
Wenn dann Fehler auftreten weg damit - macht nur Ärger.
PS: Ein Backup sollte man immer haben.

--
Marco

Jan Novak

unread,

Oct 7, 2021, 1:02:58 AM10/7/21

to

Am 06.10.21 um 19:45 schrieb Marco Moock:

> Am Wed, 6 Oct 2021 15:56:07 +0200
> schrieb Jan Novak <rep...@gmail.com>:
>
>> Device: /dev/bus/1 [megaraid_disk_02] [SAT], 2 Offline uncorrectable
>> sectors
> Ist nicht gerade gut und deutet auf einen kommenden Defekt hin.
> Wenn sich der Wert erhöht ist diese Platte kurz vor dem Ausfall.
> Hier gibt es dazu weitere Infos: https://techoverflow.net/2016/07/25/how-to-interpret-smartctl-messages-like-error-unc-at-lba/

Meine Frage wäre eher: Wie komme ich an genau diese (smart) Daten ran?
Über die Konsole bekomme ich diese Fehler ja gar nicht angezeigt.

> Ich rate dazu, die Platten einzeln mit testdisk zu prüfen.
> Wenn dann Fehler auftreten weg damit - macht nur Ärger.

Schon klar. Aber es muss sicher sein, dass die Platte auch wirklich
kaputt ist und nicht wegen einer falschen Interpretation ausgewechselt wird

> PS: Ein Backup sollte man immer haben.

Raid5 an einem SCSI Kontroller ... passt schon.
(und natürlich Backups an anderer Stelle)

Jan

Marco Moock

unread,

Oct 7, 2021, 1:55:11 AM10/7/21

to

Am Thu, 7 Oct 2021 07:02:57 +0200
schrieb Jan Novak <rep...@gmail.com>:

> Meine Frage wäre eher: Wie komme ich an genau diese (smart) Daten ran?
> Über die Konsole bekomme ich diese Fehler ja gar nicht angezeigt.

sudo smartctl /dev/sda -a

zeigt dir diese an. ggf. den Namen der Platte anpassen.
Das muss dann für jede Platte einzeln geschehen.
Diese Werte sind in Wikipedia erklärt:
https://de.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology

--
Marco

Laurenz Trossel

unread,

Oct 9, 2021, 1:43:29 PM10/9/21

to

On 2021-10-06, Jan Novak <rep...@gmail.com> wrote:

> Device: /dev/bus/1 [megaraid_disk_02] [SAT], 2 Offline uncorrectable sectors

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> smartctl -a /dev/sda -d sat+megaraid,02|grep -i error

^^^^^^^^^^^^^
> Kann das jemand interpretieren?

Du filterst dir die Information selber weg.