How to fix a bad block on one disk in a RAID1 array?

745 views
Skip to first unread message

Andrey Suprun

unread,
May 10, 2017, 10:49:31 AM5/10/17
to Alt-F
Hello.

Today I've got a mail from my D-Link DNS-323 rev. B1 stating that self-test log error count increased from 0 to 1 on /dev/sda. I've checked the status of this HDD and it gives me this:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     19071         -
# 2  Short offline       Completed: read failure       60%     19049         6112
# 3  Extended offline    Completed without error       00%     19015         -
# 4  Extended offline    Completed without error       00%     18972         -
# 5  Short offline       Completed without error       00%     18953         -
# 6  Short offline       Completed without error       00%     18934         -
# 7  Short offline       Completed without error       00%     18910         -
# 8  Short offline       Completed without error       00%     18881         -
# 9  Short offline       Completed without error       00%     18862         -
#10  Extended offline    Completed without error       00%     18823         -
#11  Short offline       Completed without error       00%     18785         -
#12  Short offline       Completed without error       00%     18767         -
#13  Short offline       Completed without error       00%     18743         -
#14  Short offline       Completed without error       00%     18713         -
#15  Short offline       Completed without error       00%     18695         -
#16  Short offline       Completed without error       00%     18641         -
#17  Short offline       Completed without error       00%     18617         -
#18  Extended offline    Completed without error       00%     18602         -
#19  Extended offline    Completed without error       00%     18576         -
#20  Short offline       Completed without error       00%     18546         -
#21  Short offline       Completed without error       00%     18522         -

/dev/sda is one of two disks in RAID1 array. The other disk is fine.

Please tell me what should I do to fix this?

João Cardoso

unread,
May 10, 2017, 12:14:34 PM5/10/17
to Alt-F
I would need the full report for better diagnosis.
There is little one can do about it, only keep an eye on some parameters and replace the disk if some of them deteriorates -- it could have been be a single glitch or a trend. As it is raid 1 you can wait for one disk to completely fail and replace it then.

Andrey Suprun

unread,
May 11, 2017, 6:25:36 AM5/11/17
to Alt-F
How I can get the full report?

I think that I should force failed HDD to reallocate bad sectors, so they will not be used in array, no?

João Cardoso

unread,
May 11, 2017, 1:06:46 PM5/11/17
to al...@googlegroups.com


On Thursday, 11 May 2017 11:25:36 UTC+1, Andrey Suprun wrote:
How I can get the full report?

Disk->Utilities, Health, Show Status
 

I think that I should force failed HDD to reallocate bad sectors,

That is done automatically by the drive when a sector read fails -- it's is relocated/mapped to a spare sector. That's the meaning of "Reallocated_Sector_Ct" and "Current_Pending_Sector" in the SMART log. Most of the time that sector content might be lost.

All disks have bad sectors, and more bad sectors develop over time. You should concern if that continues happening/growing at a "fast" rate.

To force a remapping you have to read the whole disk sectors, or even writing bit patterns to each sector. That's what the (outdated) 'badblocks' program from the e2fsprogs-badblocks package does, taking days to accomplish... and requiring some user expertise.

If the error is still developing, more errors can appear after "badblocks" finish, and you will be faced with the problem again. Some people question if the "badblock" usage does not stress the disk more than its normal (home/desktop) usage. Even RAID resyncs puts an extra load on disks.

Eventually, the spare sectors area becomes empty, and the disk unusable.
You should receive more smartd e-mails when it predicts that the disk will fail (but remember weather forecasts :-) 

Andrey Suprun

unread,
May 11, 2017, 4:13:18 PM5/11/17
to Alt-F
Thank you, João for detailed answer. Here is the full report:

smartctl 6.4 2015-06-04 r4109 [armv5tel-linux-3.18.28] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ1950221
LU WWN Device Id: 5 0014ee 206cb7224
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu May 11 23:07:04 2017 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0027   155   150   021    Pre-fail  Always       -       9250
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3284
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       19101
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       123
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5019
194 Temperature_Celsius     0x0022   125   103   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     19091         -
# 2  Short offline       Completed without error       00%     19071         -
# 3  Short offline       Completed: read failure       60%     19049         6112
# 4  Extended offline    Completed without error       00%     19015         -
# 5  Extended offline    Completed without error       00%     18972         -
# 6  Short offline       Completed without error       00%     18953         -
# 7  Short offline       Completed without error       00%     18934         -
# 8  Short offline       Completed without error       00%     18910         -
# 9  Short offline       Completed without error       00%     18881         -
#10  Short offline       Completed without error       00%     18862         -
#11  Extended offline    Completed without error       00%     18823         -
#12  Short offline       Completed without error       00%     18785         -
#13  Short offline       Completed without error       00%     18767         -
#14  Short offline       Completed without error       00%     18743         -
#15  Short offline       Completed without error       00%     18713         -
#16  Short offline       Completed without error       00%     18695         -
#17  Short offline       Completed without error       00%     18641         -
#18  Short offline       Completed without error       00%     18617         -
#19  Extended offline    Completed without error       00%     18602         -
#20  Extended offline    Completed without error       00%     18576         -
#21  Short offline       Completed without error       00%     18546         -

Should I do something or just keep an eye on that disk?

João Cardoso

unread,
May 12, 2017, 10:52:42 AM5/12/17
to Alt-F


On Thursday, 11 May 2017 21:13:18 UTC+1, Andrey Suprun wrote:
Thank you, João for detailed answer. Here is the full report:

... 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  ...
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  ...
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
...
... 

Should I do something or just keep an eye on that disk?

I don't see nothing that would worry me. And you have RAID1

Just keep an eye on it, particularly the "Reallocated_Sector_Ct". Its current value is 200 and the threshold is 140; if the current or worst value start decreasing (higher is better) and approaching the 140 threshold value start searching for a new disk.

"Current_Pending_Sector" means possible bad sectors detected (on read) and waiting for an unsuccessful write for being remapped, then making Reallocated_Sector_Ct current to lower. Sometimes the write has success, Current_Pending_Sector raw value returns to zero and current/worst Reallocated_Sector_Ct don't decrease. But having its value changing is not a good sign.

Andrey Suprun

unread,
May 12, 2017, 12:27:06 PM5/12/17
to Alt-F
Thank you very much João.
Reply all
Reply to author
Forward
0 new messages