fixing error in disk on RAID 1

206 views
Skip to first unread message

Pornchai R.

unread,
Apr 26, 2013, 11:54:21 PM4/26/13
to al...@googlegroups.com
Hi Joao,
First of all, I love the firmware and your quickly support.

I'm start try more feature in the firmware and recently,I use Disk Utilities to check my disks

After use "start short test" to test a harddisk, there are errors as show below. 

I thought there are bad sectors, and I would like to fix them but I don't know how to do it. Could I use "ForceFix" in Filesystem Maintenance against md0?

I've tried "Verify" and "Repair" in Raid.
 
smartctl 6.0 2012-10-10 r3643 [armv5tel-linux-2.6.35.14] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA Device Model: WDC WD10EALX-009BA0 Serial Number: WD-WCATR5128813 LU WWN Device Id: 5 0014ee 2b00b38f9 Firmware Version: 15.01H15 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Apr 27 10:41:21 2013 ICT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 195 170 021 Pre-fail Always - 3225 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1049 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3364 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 209 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 186 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 862 194 Temperature_Celsius 0x0022 097 094 000 Old_age Always - 50 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 5 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 5 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 3364 1790849419 # 2 Short offline Completed: read failure 90% 3325 1790849419 # 3 Extended offline Completed: read failure 90% 3254 1790849419 # 4 Short offline Completed: read failure 90% 3254 1790849419 # 5 Short offline Completed: read failure 90% 3232 1790848116

My NAS information
Disks
Bay   Dev.ModelCapacityPower StatusTempHealth
rightsdaWDC WD10EALX-009BA0   1000.2 GBactive or idle49°C/120.2°Fpassed
leftsdbWDC WD10EALX-009BA01000.2 GB    active or idle50°C/122°Fpassed
usbsdcSTORAGE DEVICE8.2 GB------
RAID
Dev. Capacity   Level  State  Status  Action  Done  ETA
md0  931.0 GB   raid1  clean    OK     idle

João Cardoso

unread,
Apr 27, 2013, 10:42:48 AM4/27/13
to al...@googlegroups.com


On Saturday, April 27, 2013 4:54:21 AM UTC+1, Pornchai R. wrote:
Hi Joao,
First of all, I love the firmware and your quickly support.

I'm start try more feature in the firmware and recently,I use Disk Utilities to check my disks

After use "start short test" to test a harddisk, there are errors as show below. 

I thought there are bad sectors,

probably yes, although "read error" might be something else, don't know.
The error occurs very early in the test and at around the same area, around sector 1790849419.
 
and I would like to fix them but I don't know how to do it. Could I use "ForceFix" in Filesystem Maintenance against md0?

No, SMART works at a very, very low level.
SMART reveals errors or tendencies that are not correctable, indicating a disk failure in the near future. How near, nobody knows.
 

I've tried "Verify" and "Repair" in Raid.

No user level repair is possible, at least that I'm aware.

Bad sectors should be handled by the disk itself, storing a bad sectors list and remapping them to good sectors. This should happen automatically until the list of bad sectors is full.

But this seems to not be happening, as "5 Reallocated_Sector_Ct   0x0033   200   200   140" is still away from the threshold (low values in the VALUE WORST columns, approaching the THRESH value indicate a possible issue).

Other way to deal with bad sectors is at the filesystem level, running a program (badblocks) that scans the whole disk for bad sectors and then letting the filesystem avoid use them. There are no provisions in Alt-F to do this automatically, and the scan is taking days.
It is not very used nowadays, as the drive should remap bad sectors automatically and the remap list is big enough to hold bad sectors developed during the expected life of the disk.
Your disk has 3232 power-on hours, 5 months of continuous usage, so it is almost news.

Something that worries me is your disk temperature, 50ºC is a high value, is the fan running? low, medium or at high speed? Try to solve that first.

I would stop most services and retry the test.
While is is running don't use the box, take notice of the test expected time to completion and only access the box after that. As the test is executed within the drive, by the disk firmware itself, not Alt-F, you will not see any led blinking.

Try googling for SMART and your disk disk model and error.

Luck

 
 
smartctl 6.0 2012-10-10 r3643 [armv5tel-linux-2.6.35.14] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA Device Model: WDC WD10EALX-009BA0 Serial Number: WD-WCATR5128813 LU WWN Device Id: 5 0014ee 2b00b38f9 Firmware Version: 15.01H15 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Apr 27 10:41:21 2013 ICT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 195 170 021 Pre-fail Always - 3225 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 104
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

Doesn't seems to exist problems here.
 

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3364
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       209
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       186
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       862
194 Temperature_Celsius     0x0022   097   094   000    Old_age   Always       -       50

50ºC? This seems excessive to me.

Pornchai R.

unread,
Apr 27, 2013, 11:25:14 AM4/27/13
to al...@googlegroups.com
Thanks a lot for your promptly reply and your response make me feel better.

About High temperature, I think it's because here in Thailand is in summer and it's hot 35 c now.

I solve the issue by make High Fan Speed start early.
System Temperature / Fan Speed relationship
Low Temp.°CLow Fan SpeedRPM
High Temp.°CHigh Fan SpeedRPM


Now the temperature is 48.5c and fan speed is around 5200.

Cheers,
Pornchai

João Cardoso

unread,
Apr 27, 2013, 2:55:06 PM4/27/13
to al...@googlegroups.com


On Saturday, April 27, 2013 4:25:14 PM UTC+1, Pornchai R. wrote:
Thanks a lot for your promptly reply and your response make me feel better.

I'm not so sure about that, as you have several bad blocks (at least 2).
This means that some file(s) have corrupted data, and when you will try to read or write them you will get an error, possibly not being able to read the file.

You might live with that, or not. I would recommend you to try to fix the bad blocks, but that is somehow an advanced topic:


You might also setup regular SMART tests, System->Services->smart->Configure/Start.
The problem is that all future tests will stop as soon as the first error shows up.
So you have to fix the bad block or access the files with bad block in them, which will deploy the drive remapping feature.
As you don't know what file(s) has bad block, the only easy solution is to access all them.

 

About High temperature, I think it's because here in Thailand is in summer and it's hot 35 c now.

I solve the issue by make High Fan Speed start early.
System Temperature / Fan Speed relationship
Low Temp.°CLow Fan SpeedRPM
High Temp.°CHigh Fan SpeedRPM


Now the temperature is 48.5c and fan speed is around 5200.

According to the drive datasheet, max temp is 60ºC, so you are "safe". 

Probably the 1.5ºC decrease in temperature will not justify the high speed fan noise or fan wear, only you can tell.
Reply all
Reply to author
Forward
0 new messages