Raid 5 USB Disk Faillure - Raid Degraded - What to do now?

43 views

Skip to first unread message

Richard Lehun

unread,

Mar 5, 2015, 8:47:19 AM3/5/15

Hi all,

After the external USB drive stopped working. The raid degraded. I turned everything off and an. The USB drive's light is back on, the 323's drive lights are both amber. How can tell ALT F to try and reestablish the Raid 5 with the USB drive? Below excerpts from Status and a short test of the drive.

Thanks

Richard

Name: dlink-7B910E

Model: DNS-323-B1

Disks

Bay	Dev.	Model	Capacity	Power Status	Temp	Health
usb	sda	WDC WD10EACS-00ZJB0	1.0TB	--	33°C/91.4°F	passed
right	sdb	ST31000340NS	1.0TB	active or idle	43°C/109.4°F	passed
left	sdc	ST31000528AS	1.0TB	active or idle	40°C/104°F	passed

Dev. Capacity Level State Status Action Done ETA

md0 1862.0GB raid5 clean degraded idle

martctl 6.2 2013-07-26 r3841 [armv5tel-linux-3.10.32] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Caviar Green

Device Model: WDC WD10EACS-00ZJB0

Serial Number: WD-WCASJ1243256

LU WWN Device Id: 5 0014ee 2567a0768

Firmware Version: 01.01B01

User Capacity: 1,000,204,886,016 bytes [1.00 TB]

Sector Size: 512 bytes logical/physical

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS (minor revision not indicated)

SATA Version is: SATA 2.5, 3.0 Gb/s

Local Time is: Thu Mar 5 08:39:49 2015 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 778

3 Spin_Up_Time 0x0003 186 176 021 Pre-fail Always - 7700

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 865

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0

9 Power_On_Hours 0x0032 074 074 000 Old_age Always - 19412

10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0

11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 109

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 73

193 Load_Cycle_Count 0x0032 159 159 000 Old_age Always - 123891

194 Temperature_Celsius 0x0022 120 090 000 Old_age Always - 32

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 197 051 Old_age Offline - 0

SMART Error Log Version: 1

ATA Error Count: 215 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 215 occurred at disk power-on lifetime: 19081 hours (795 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 cf d0 18 e0 Error: UNC at LBA = 0x0018d0cf = 1626319

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 00 cf d0 18 00 00 3d+13:25:03.625 READ DMA EXT

25 00 00 cf d0 18 00 00 3d+13:24:54.048 READ DMA EXT

25 00 00 cf d0 18 00 00 3d+13:24:41.706 READ DMA EXT

25 00 00 cf ad 18 00 00 3d+13:24:37.668 READ DMA EXT

25 00 00 cf d0 18 00 00 3d+13:24:26.105 READ DMA EXT

Error 214 occurred at disk power-on lifetime: 19081 hours (795 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 cf d0 18 e0 Error: UNC at LBA = 0x0018d0cf = 1626319

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 00 cf d0 18 00 00 3d+13:24:54.048 READ DMA EXT

João Cardoso

unread,

Mar 6, 2015, 10:49:54 AM3/6/15

to al...@googlegroups.com

On Thursday, March 5, 2015 at 1:47:19 PM UTC, Richard Lehun wrote:

Hi all,

After the external USB drive stopped working. The raid degraded. I turned everything off and an. The USB drive's light is back on, the 323's drive lights are both amber. How can tell ALT F to try and reestablish the Raid 5 with the USB drive? Below excerpts from Status and a short test of the drive.

The drive seems to be reasonably "old", 19412 hours of continuous operation (26 months), and the SMART errors have occurred recently.

You need to be sure that the drive errors was due to some glitch and are not permanent or an indication of a starting drive failure. Perform a long smart test and activate the SMART service (Services->System), so the drive health will be checked periodically.

Do you perform the box/USB drive power on/off sequence always the same way? That is important, to avoid a lengthy rsycn on each power on.

To add the drive to the RAID again you should add the drive RAID partition component to the array:

Disk->RAID, RAID Maintenance section, under "Component Operations", under "Partition" select the USB drive RAID partition, then select "Add" under "Operation".

That will probably fail, as you have been using the RAID in the degraded state. If that happens, you have first to use "Clear" under Operations, and afterwards "Add". This will initiate a lengthy (tens of hours) rebuild. Be sure that the USB drive is really OK (long SMART test) before doing that, as if it is failing you will have to add a new drive and a new rebuild will happen. As RAID5 is particularly susceptible to data loss during that that step if a drive fails while it's happening you want to minimize the number of that operation.

If you have to replace the drive, the procedure is similar (not identical) to the one described in the Degraded RAID1 wiki. In short, you have to partition the new drive with a partition of type RAID with the same size as each RAID partition of the existing RAID components (use the Disk->Partitioner for that) and "Add" it to the array.

To clarify "with the same size as each RAID partition of the existing RAID components": under the Disk Partitioner, on all disk, the RAID partition must have the same size

Reply all

Reply to author

Forward

0 new messages