RAID1 drive replacement in DNS323 with ALT-F 1.0 = always degraded

378 views
Skip to first unread message

mar...@betterdeveloper.net

unread,
Dec 24, 2018, 4:46:43 PM12/24/18
to Alt-F
Hi all,

I have a DNS323 with ALT-F 1.0 in RAID1 mode, 2 HDs, 4TB each. One of them failed so I purchsed a replacement. I followed the instructions at https://sourceforge.net/p/alt-f/wiki/How%20to%20fix%20a%20degraded%20RAID1%20array/ and teh result is always the same:

- The Status page shows a degraded drive, nothing listed under ETA:
Dev. Capacity Level State Status Action Done ETA
md0 3725.1GB raid1 clean degraded idle

- Raid Creation Maintenance shows sda2 (partition in newly inserted drive) in red:
RAID Maintenance
Dev. Capacity Level Ver. Components Array RAID Operations Component Operations
md0 3725.1GB raid1 1.0     sda2 sdb2

Level=raid1 is in red. sda2 is in red. sdb2 is in black.

Operations in sda2/examine produces this:

/dev/sda2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 00b69b0a:6bc129ea:ebcb9df8:5818936f
           Name : dlink-raid2:0  (local to host dlink-raid2)
  Creation Time : Sun Jun 25 12:06:04 2017
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 7812128160 (3725.11 GiB 3999.81 GB)
     Array Size : 7812128160 (3725.11 GiB 3999.81 GB)
   Super Offset : 7812128416 sectors
          State : clean
    Device UUID : d3a098f6:54ec2ba2:c9c5002b:17fb7a67

Internal Bitmap : -16 sectors from superblock
    Update Time : Sat Dec 22 13:10:24 2018
       Checksum : 650f4a07 - correct
         Events : 480011


   Device Role : spare
   Array State : A. ('A' == active, '.' == missing)
   
   
   Any ideas? Can the instructions be fixed so they work? For now I entered a bug report at https://sourceforge.net/p/alt-f/tickets/409/ . Thanks!

João Cardoso

unread,
Dec 26, 2018, 12:21:22 PM12/26/18
to Alt-F
 Can you please also post the sdb2/Examine and RAID Operations/Details output?

Marcio Marchini

unread,
Dec 26, 2018, 4:22:11 PM12/26/18
to al...@googlegroups.com

I updated https://sourceforge.net/p/alt-f/tickets/409/ with the info you asked for (sdb2/Examine and RAID Operations/Details output).

Thanks!

marcio


--
You received this message because you are subscribed to the Google Groups "Alt-F" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/alt-f.
For more options, visit https://groups.google.com/d/optout.

João Cardoso

unread,
Dec 27, 2018, 1:32:10 PM12/27/18
to Alt-F


On Wednesday, 26 December 2018 21:22:11 UTC, Marcio Marchini wrote:

I updated https://sourceforge.net/p/alt-f/tickets/409/ with the info you asked for (sdb2/Examine and RAID Operations/Details output).

The wiki is fine, that is the procedure to follow. It didn't work for you for some reason that we migh try to uncover.

The sda2 partition is marked in the array and drive itself as a 'spare' drive for the array, and should be displayed in green and not red in the webUI. Have you used  other Alt-F version before 1.0?

In any case, the spare drive should trigger a rebuild when it was first added to the array -- did you notice that? The rebuild might have failed because of errors on sdb2? Don't know, only kernel logs taken at the time could tell.

You should do short SMART tests (for a start) on both the sda and sdb drives, using Disk->Utilities, Health, and after they complete (use a clock, you will not be warned that it has finished) post the Status results for both disks. If any errors shows up, a Kernel Log (System->Utilities, View logs), will be helpful.
After being sure that there are no errors on any drives, you can Fail, then Remove, then Clear the new drive -- don't get it wrong! After that you can stop the array, or just reboot the box, and try again the F step in the wiki.

Worked?

If the above doesn't work, you will have to resort to the command line, see eg this post (although it is old it might apply to the mdadm version that, for memory space reasons, Alt-F uses).
 

Thanks!

marcio


To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+unsubscribe@googlegroups.com.

Marcio Marchini

unread,
Dec 27, 2018, 4:08:41 PM12/27/18
to al...@googlegroups.com
Hi Joao,

   I have repeated the steps multiple times, from the start, over 3 or 4 days. I remove the drive from the array, copy the partition table from the good drive, etc. Basically I Follow the wiki steps. I am travelling now so I am typing this particular email from memory. I recall a “and if your drive was formatted with Alt-F you can skip to step F” near the end.

    It always looks like it is going to work. If I recall properly it prints something like “synchronizing” or a similar word when I add the drive to the array. That one page never refreshes. Many hours later I open a new browser tab on the status page and it shows sda2 in green but the ETA column is always blank, if I recall properly. When I come again hours later the sda2 part  is red and I can see it is a spare.

   I am on the road and will be back Jan 07, so if you can think of particular steps I can try them when I am back.

    I have a guess. Maybe a “fsck” should be run before adding the new drive? I did try that on my second or third attempt though. Couldn’t think of anything else.

    Thanks,

Marcio 
To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.

João Cardoso

unread,
Dec 29, 2018, 1:27:08 PM12/29/18
to Alt-F


On Thursday, 27 December 2018 21:08:41 UTC, Marcio Marchini wrote:
Hi Joao,

   I have repeated the steps multiple times, from the start, over 3 or 4 days. I remove the drive from the array, copy the partition table from the good drive, etc. Basically I Follow the wiki steps.

Once you do the A to E steps once you don't need to repeat them; specifically, as soon as the new disk RAID partition appears under "Component" in the Alt-F RAID webUI.
 
I am travelling now so I am typing this particular email from memory. I recall a “and if your drive was formatted with Alt-F you can skip to step F” near the end.

    It always looks like it is going to work. If I recall properly it prints something like “synchronizing” or a similar word when I add the drive to the array. That one page never refreshes.

No, it only shows the status when the page was generated (as most will pages do). You have to go to the Status page to see progress, the operation has been started and it will continue. Yes, I agree that showing the wording with ellipses might make the user infer that he must wait for completion.
 
Many hours later I open a new browser tab on the status page and it shows sda2 in green

The status page doesn't show disk partitions (sda2 or sdb2) in green or red, only the RAID page does.
The expected behaviour when a "spare" drive is added to the array is for it to be displayed in green (spare) in the RAID page. If a resync or rebuild happens, that is shown in the Status page, in the RAID section; in the RAID page, the partition/component should be in green until the resync/rebuild succeeds.
 
but the ETA column is always blank, if I recall properly.

You have to repeat that when you return, just to be sure.
As the partition/component already belongs to the array (as yours does), you have to Fail it, Remove it, and Clear it first, then Add it. Now go to the Status page, and see the RAID status, go to the RAID page and see more details about the RAID array.
 
When I come again hours later the sda2 part  is red and I can see it is a spare.

That should happens only if the rebuild/resync failed for some reason, and the component/partition didn't become active and remains as spare.


   I am on the road and will be back Jan 07, so if you can think of particular steps I can try them when I am back.

Start doing a backup, if you don't have one. (yes, I know, but RAID is not necessarily about backups, it's about 24/7 availability).
Then perform the SMART short tests (a couple of minutes) on both drives, and post their Status output (go to Disk->Utilities, Health). Hope you don't see "read error" under the self test section, "# 1  Short offline       Completed without error" is the right output. If you want to be completely sure, do long test also, but those take several hours to complete.
Make sure that SMART and "mdadm" are continuously monitoring your disks, Services->System, "mdadm" and "smart" should be running and boot active (that is for the future, it will not solve any problem now).
 
I have seen you problem reported more than once, e.g.

but there is no definitive answer on how to solve it, or if it is a mdadm or kernel bug.
If there are no "read error" in none of the drives SMART tests, we will try the command line (and fix the RAID webUI)


    I have a guess. Maybe a “fsck” should be run before adding the new drive? I did try that on my second or third attempt though. Couldn’t think of anything else.

No fsck is intended for filesystems, i.e., user files management, while RAID is only concerned with keeping bytes synced across drives, it doesn't even care with the bytes contents. A filesystem is created once on top of a RAID array after building it.

Marcio Marchini

unread,
Jan 6, 2019, 10:28:59 AM1/6/19
to al...@googlegroups.com
Hi Joao,

   I am back.

Start doing a backup, if you don't have one. (yes, I know, but RAID is not necessarily about backups, it's about 24/7 availability).

This drive is a TimeMachine drive for a Mac. The Mac already alternates backing up to 3 drives, 2 USB-attached drives and the this NAS as a network drive, in another room. The NAS has the longest history, due to its higher capacity. I’d really like to prevent losing this long-lasting TimeMachine data.


 perform the SMART short tests


sda: (new drive)

Fail: smartctl 6.5 2016-05-07 r4318 [armv5tel-linux-4.4.86] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Can't start self-test without aborting current test (10% remaining),
add '-t force' option to override, or run 'smartctl -X' to abort test.

Weird. Even so, it displays:
smartctl 6.5 2016-05-07 r4318 [armv5tel-linux-4.4.86] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K6VCNHZ7
LU WWN Device Id: 5 0014ee 2104174b3
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jan  6 13:13:52 2019 BRST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   167   021    Pre-fail  Always       -       5216
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       313
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       574
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       791
194 Temperature_Celsius     0x0022   107   103   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       543         -
# 2  Short offline       Completed without error       00%       519         -
# 3  Short offline       Completed without error       00%       495         -
# 4  Short offline       Completed without error       00%       471         -
# 5  Short offline       Completed without error       00%       447         -
# 6  Short offline       Completed without error       00%       423         -
# 7  Extended offline    Completed without error       00%       408         -
# 8  Short offline       Completed without error       00%       375         -
# 9  Short offline       Completed without error       00%       351         -
#10  Short offline       Completed without error       00%       327         -
#11  Short offline       Completed without error       00%       304         -
#12  Short offline       Completed without error       00%       280         -
#13  Short offline       Completed without error       00%       256         -
#14  Extended offline    Completed without error       00%       241         -
#15  Short offline       Completed without error       00%       213         -
#16  Short offline       Completed without error       00%       202         -
#17  Short offline       Completed without error       00%       171         -
#18  Short offline       Completed without error       00%       129         -
#19  Extended offline    Completed without error       00%       113         -
#20  Short offline       Completed without error       00%        82         -
#21  Short offline       Completed without error       00%        52         -





sdb: (drive with the data)

smartctl 6.5 2016-05-07 r4318 [armv5tel-linux-4.4.86] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E7TAACPF
LU WWN Device Id: 5 0014ee 2625e655f
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jan  6 13:15:49 2019 BRST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       471
  3 Spin_Up_Time            0x0027   198   179   021    Pre-fail  Always       -       7091
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4692
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   199   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13456
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       23322
194 Temperature_Celsius     0x0022   109   100   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       106
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       99
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       356

SMART Error Log Version: 1
ATA Error Count: 168 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 168 occurred at disk power-on lifetime: 2706 hours (112 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08  16d+21:27:47.375  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  16d+21:27:47.360  IDENTIFY DEVICE
  c8 00 08 a0 09 00 e0 08  16d+21:27:46.822  READ DMA
  ec 00 00 00 00 00 a0 08  16d+21:27:46.795  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  16d+21:27:46.776  SET FEATURES [Set transfer mode]

Error 167 occurred at disk power-on lifetime: 2706 hours (112 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 08 a0 09 00 e0  Device Fault; Error: ABRT 8 sectors at LBA = 0x000009a0 = 2464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 a0 09 00 e0 08  16d+21:27:46.822  READ DMA
  ec 00 00 00 00 00 a0 08  16d+21:27:46.795  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  16d+21:27:46.776  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  16d+21:27:46.760  IDENTIFY DEVICE
  c8 00 08 a0 09 00 e0 08  16d+21:27:46.223  READ DMA

Error 166 occurred at disk power-on lifetime: 2706 hours (112 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08  16d+21:27:46.776  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  16d+21:27:46.760  IDENTIFY DEVICE
  c8 00 08 a0 09 00 e0 08  16d+21:27:46.223  READ DMA
  ec 00 00 00 00 00 a0 08  16d+21:27:46.196  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  16d+21:27:46.176  SET FEATURES [Set transfer mode]

Error 165 occurred at disk power-on lifetime: 2706 hours (112 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 08 a0 09 00 e0  Device Fault; Error: ABRT 8 sectors at LBA = 0x000009a0 = 2464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 a0 09 00 e0 08  16d+21:27:46.223  READ DMA
  ec 00 00 00 00 00 a0 08  16d+21:27:46.196  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  16d+21:27:46.176  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  16d+21:27:46.161  IDENTIFY DEVICE
  c8 00 08 a0 09 00 e0 08  16d+21:27:45.623  READ DMA

Error 164 occurred at disk power-on lifetime: 2706 hours (112 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08  16d+21:27:46.176  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 08  16d+21:27:46.161  IDENTIFY DEVICE
  c8 00 08 a0 09 00 e0 08  16d+21:27:45.623  READ DMA
  ec 00 00 00 00 00 a0 08  16d+21:27:45.597  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  16d+21:27:45.597  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     13456         1839680
# 2  Extended offline    Completed: read failure       90%     13449         1839682
# 3  Short offline       Completed: read failure       90%     13425         1839685
# 4  Short offline       Completed: read failure       90%     13401         1839681
# 5  Short offline       Completed: read failure       90%     13377         1839680
# 6  Short offline       Completed: read failure       90%     13353         1839680
# 7  Short offline       Completed without error       00%     13329         -
# 8  Short offline       Completed without error       00%     13305         -
# 9  Extended offline    Completed: read failure       90%     13281         1883176
#10  Short offline       Completed without error       00%     13257         -
#11  Short offline       Completed without error       00%     13234         -
#12  Short offline       Completed without error       00%     13210         -
#13  Short offline       Completed without error       00%     13186         -
#14  Short offline       Completed: read failure       90%     13162         1886732
#15  Short offline       Completed without error       00%     13138         -
#16  Extended offline    Completed: read failure       90%     13114         1890287
#17  Short offline       Completed: read failure       90%     13092         1890285
#18  Short offline       Completed: read failure       90%     13082         1890280
#19  Short offline       Completed: read failure       90%     13047         1890282
#20  Short offline       Completed: read failure       50%     13009         1900936
#21  Extended offline    Completed: read failure       90%     12991         1900936


Make sure that SMART and "mdadm" are continuously monitoring your disks, Services->System, "mdadm" and "smart" should be running and boot active (that is for the future, it will not solve any problem now).
 

Yes, this was on/setup already.


you have to Fail it, Remove it, and Clear it first, then Add it. Now go to the Status page, and see the RAID status, go to the RAID page and see more details about the RAID array.

Done: Fail, Remove, Clear. All ok.


Added. I see this at the top:

awk: Shortcuts*.men: No such file or directory







The page shows:

RAID Maintenance
Dev.CapacityLevelVer.ComponentsArrayRAID OperationsComponent Operations
md03725.1GBraid11.0sda2 sdb2
recovering



The status page shows:

RAID
Dev.CapacityLevelStateStatusActionDoneETA
md03725.1GBraid1cleandegradedidle
Mounted Filesystems
Dev.LabelCapacityAvailableFSModeDirtyAutomatic FSCK in
md03.6TB
347.7GB
ext4RW48 mounts or 14 days


Back to teh RTAID page shows this:

RAID Maintenance
Dev.CapacityLevelVer.ComponentsArrayRAID OperationsComponent Operations
md03725.1GBraid11.0sda2 sdb2


So, is the RAID1 reconstruction failing because the sdb drive is showing errors? This is yet another reason why I need the copy to work asap before sdb also fails and I lose everything :-(

Suggestions? Thanks!

marcio



To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.

João Cardoso

unread,
Jan 7, 2019, 1:53:23 PM1/7/19
to Alt-F


On Sunday, 6 January 2019 15:28:59 UTC, Marcio Marchini wrote:
Hi Joao,

   I am back.

Start doing a backup, if you don't have one. (yes, I know, but RAID is not necessarily about backups, it's about 24/7 availability).

This drive is a TimeMachine drive for a Mac. The Mac already alternates backing up to 3 drives, 2 USB-attached drives and the this NAS as a network drive, in another room. The NAS has the longest history, due to its higher capacity. I’d really like to prevent losing this long-lasting TimeMachine data.

Yes, there are read errors on your sdb drive, and that explains why the RAID resync starts but fails later on, as *all* bytes on the sdb drive, independently of its meaning or contents, have to be read to be written to the new sda drive.
The read errors occurs near the drive end, and it might not even contain user data, so you might not even notice errors when reading files from the degraded raid, but the rsync has to fail.

You see that on the SMART log:
   197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       106
some sectors can't be read and are waiting for a write to remap themself to a good know spare area, none of which hasn't been yet used:
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

The 
  198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       99
parameter is also really bad.

The issue, read errors, start occurring  some 19 days ago (assuming the box is powered on 24/7) in several disk areas (LBA) near the disk end (90%)

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     13456         1839680
... several other errors
#21  Extended offline    Completed: read failure       90%     12991         1900936

Odly, the overall SMART health status still says "PASSED"; it might be that it only applies to the pre-fail tests, I don't know.

The disk is relatively new, 134560 power on hours, grossly 1.5 year -- is it yet under warranty?

So, with the diagnosis made, what is the cure for the RAID? A new drive, of course.
If you don't want to loose the yet available old historic data in your TM, keep it in read only mode, if that is possible, and replace the drive only when your history data becomes prehistoric :)

That said, it is possible to force the remap of the currently bad sectors by writing to them (you know their LBA, location on the disk, not partition), with some possible data loss. I have done that already and the drive hasn't develop new errors since them, but I only use it for tests anyway.

You might also want to run the vendors diagnostics program, but that will almost certainly wipe out its contents.

The wiki has to be complemented with something like:
If one disk of your RAID1 array fails, do a SMART test on the good one, to be sure that it is not going to fail shortly after, specially if both disks are from the same model and manufacturer and you bought both of them at the same time on the same store, as they are probably from the same manufacturing lote and as such the probably of a double failure increases.

sorry for not helping more

Marcio Marchini

unread,
Jan 9, 2019, 1:05:25 PM1/9/19
to al...@googlegroups.com
Hi Joao,

   Thanks for your help!!!!

   Yes, updating the wiki would be nice. I would be more emphatic and say that if there is *any* read error in the SMART test, the RAID mirroring will fail for sure. Your text proposal suggests it is just a precaution, when in fact it is a guarantee of frustration and failure for the hard drive replacement.

    Yes, the drives are fairly new and were bought together. I am not sure what the warranty is on these drives. I would suspect one year? I will check.

   I am interested in this:

"That said, it is possible to force the remap of the currently bad sectors by writing to them (you know their LBA, location on the disk, not partition), with some possible data loss. I have done that already and the drive hasn't develop new errors since them, but I only use it for tests anyway.

   I am not a Linux wizard. Can you help with commands I could/should run?

   Thanks!

marcio


To unsubscribe from this group and stop receiving emails from it, send an email to alt-f+un...@googlegroups.com.

João Cardoso

unread,
Jan 10, 2019, 2:34:33 PM1/10/19
to Alt-F


On Wednesday, 9 January 2019 18:05:25 UTC, Marcio Marchini wrote:
Hi Joao,

   Thanks for your help!!!!

   Yes, updating the wiki would be nice. I would be more emphatic and say that if there is *any* read error in the SMART test, the RAID mirroring will fail for sure. Your text proposal suggests it is just a precaution, when in fact it is a guarantee of frustration and failure for the hard drive replacement.

I have done that, thanks. Can an Inglish native speaker review and fix it, please?


    Yes, the drives are fairly new and were bought together. I am not sure what the warranty is on these drives. I would suspect one year? I will check.

   I am interested in this:

"That said, it is possible to force the remap of the currently bad sectors by writing to them (you know their LBA, location on the disk, not partition), with some possible data loss. I have done that already and the drive hasn't develop new errors since them, but I only use it for tests anyway.

   I am not a Linux wizard. Can you help with commands I could/should run?

The amount of work to do and knowledge it requires depends strongly on the motivation -- just for fun/learning, or for savage real data?

The 'badblocks' command exists for tens of years and was a mandatory tool to use when one bought a new 20MB drive :-) (spoiler alert -- fisherman talking)
badblock is very slow and has an yet slower non destructive test mode. You can use it in a faster way to test just the area of the disk where disk errors are developing. But that doesn't make sense, does it? badblocks writes and reads bit patterns for every disk sector.
There is a badblockhowto somewhere (was on smartmontools site, I believe), but I believe that it concentrates on identifying the affected files, which is filesystem dependent and complex. A simple non-destructive test with filesystems unmounted and raid stopped is simpler. Again, depends on user motivation.

The other tool is the ubiquitous dd command, that you can use to read and write any data on a device. E.g., if you have a smart test saying you that a read error occurred at LBA 123456 on device sda, you can use
dd if=/dev/zero of=/dev/sda bs=512 count=1 skip=123455
to write 512 bytes (bs=512 and count=1) of zeros (if=/dev/zero) to the sda device (of=/dev/sda) on sector 123456 (skip=) LBA/sectors might be offset by one, can't remember. And on 4k internal sector size drives, as your's is, it might turn things more complex.
Of course, if you blindly write to the disk and it happens to be a filesystem maintenance area such as a superblock, you might loose not only a file but make the whole filesystem inconsistent. Alternate superblocks exists, and other tools can revel and use them.


   Thanks!

marcio
Reply all
Reply to author
Forward
0 new messages