Raid 5 USB Disk Faillure - Raid Degraded - What to do now?

40 views
Skip to first unread message

Richard Lehun

unread,
Mar 5, 2015, 8:47:19 AM3/5/15
to

Hi all,


After the external USB drive stopped working. The raid degraded. I turned everything off and an. The USB drive's light is back on, the 323's drive lights are both amber. How can tell ALT F to try and reestablish the Raid 5 with the USB drive? Below excerpts from Status and a short test of the drive.


Thanks

Richard

Name: dlink-7B910E
Model: DNS-323-B1


Disks
Bay Dev. Model Capacity Power Status Temp Health
usb sda WDC WD10EACS-00ZJB0 1.0TB -- 33°C/91.4°F passed
right sdb ST31000340NS 1.0TB active or idle 43°C/109.4°F passed
left sdc ST31000528AS 1.0TB active or idle 40°C/104°F passed



Dev.      Capacity          Level   State              Status            Action             Done   ETA

md0     1862.0GB        raid5               clean              degraded      idle

 

 

martctl 6.2 2013-07-26 r3841 [armv5tel-linux-3.10.32] (local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Caviar Green

Device Model:     WDC WD10EACS-00ZJB0

Serial Number:    WD-WCASJ1243256

LU WWN Device Id: 5 0014ee 2567a0768

Firmware Version: 01.01B01

User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Sector Size:      512 bytes logical/physical

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 2.5, 3.0 Gb/s

Local Time is:    Thu Mar  5 08:39:49 2015 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       778

  3 Spin_Up_Time            0x0003   186   176   021    Pre-fail  Always       -       7700

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       865

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   074   074   000    Old_age   Always       -       19412

 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       109

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       73

193 Load_Cycle_Count        0x0032   159   159   000    Old_age   Always       -       123891

194 Temperature_Celsius     0x0022   120   090   000    Old_age   Always       -       32

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   197   051    Old_age   Offline      -       0

 

SMART Error Log Version: 1

ATA Error Count: 215 (device log contains only the most recent five errors)

            CR = Command Register [HEX]

            FR = Features Register [HEX]

            SC = Sector Count Register [HEX]

            SN = Sector Number Register [HEX]

            CL = Cylinder Low Register [HEX]

            CH = Cylinder High Register [HEX]

            DH = Device/Head Register [HEX]

            DC = Device Command Register [HEX]

            ER = Error register [HEX]

            ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 215 occurred at disk power-on lifetime: 19081 hours (795 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 cf d0 18 e0  Error: UNC at LBA = 0x0018d0cf = 1626319

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 cf d0 18 00 00   3d+13:25:03.625  READ DMA EXT

  25 00 00 cf d0 18 00 00   3d+13:24:54.048  READ DMA EXT

  25 00 00 cf d0 18 00 00   3d+13:24:41.706  READ DMA EXT

  25 00 00 cf ad 18 00 00   3d+13:24:37.668  READ DMA EXT

  25 00 00 cf d0 18 00 00   3d+13:24:26.105  READ DMA EXT

 

Error 214 occurred at disk power-on lifetime: 19081 hours (795 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 cf d0 18 e0  Error: UNC at LBA = 0x0018d0cf = 1626319

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 cf d0 18 00 00   3d+13:24:54.048  READ DMA EXT

João Cardoso

unread,
Mar 6, 2015, 10:49:54 AM3/6/15
to al...@googlegroups.com


On Thursday, March 5, 2015 at 1:47:19 PM UTC, Richard Lehun wrote:

Hi all,


After the external USB drive stopped working. The raid degraded. I turned everything off and an. The USB drive's light is back on, the 323's drive lights are both amber. How can tell ALT F to try and reestablish the Raid 5 with the USB drive? Below excerpts from Status and a short test of the drive.


The drive seems to be reasonably  "old", 19412 hours of continuous operation (26 months), and the SMART errors have occurred recently.

You need to be sure that the drive errors was due to some glitch and are not permanent or an indication of a starting drive failure. Perform a long smart test and activate the SMART service (Services->System), so the drive health will be checked periodically.

Do you perform the box/USB drive power on/off sequence always the same way? That is important, to avoid a lengthy rsycn on each power on.

To add the drive to the RAID again you should add the drive RAID partition component to the array:

Disk->RAID, RAID Maintenance section,  under "Component Operations",  under "Partition" select the USB drive RAID partition, then select "Add" under "Operation".

That will probably fail, as you have been using the RAID in the degraded state. If that happens, you have first to use "Clear" under Operations, and afterwards "Add". This will initiate a lengthy (tens of hours) rebuild. Be sure that the USB drive is really OK (long SMART test) before doing that, as if it is failing you will have to add a new drive and a new rebuild will happen. As RAID5 is particularly susceptible to data loss during that that step if a drive fails while it's happening you want to minimize the number of that operation.

If you have to replace the drive, the procedure is similar (not identical) to the one described in the Degraded RAID1  wiki. In short, you have to partition the new drive with a partition of type RAID with the same size as each RAID partition of the existing RAID components (use the Disk->Partitioner for that) and "Add" it to the array.
To clarify "with the same size as each RAID partition of the existing RAID components": under the Disk Partitioner, on all disk, the RAID partition must have the same size

Reply all
Reply to author
Forward
0 new messages