fixing error in disk on RAID 1

206 views

failurereadsmart

Skip to first unread message

Pornchai R.

unread,

Apr 26, 2013, 11:54:21 PM4/26/13

to al...@googlegroups.com

Hi Joao,

First of all, I love the firmware and your quickly support.

I'm start try more feature in the firmware and recently,I use Disk Utilities to check my disks

After use "start short test" to test a harddisk, there are errors as show below.

I thought there are bad sectors, and I would like to fix them but I don't know how to do it. Could I use "ForceFix" in Filesystem Maintenance against md0?

I've tried "Verify" and "Repair" in Raid.


smartctl 6.0 2012-10-10 r3643 [armv5tel-linux-2.6.35.14] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD10EALX-009BA0
Serial Number:    WD-WCATR5128813
LU WWN Device Id: 5 0014ee 2b00b38f9
Firmware Version: 15.01H15
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Apr 27 10:41:21 2013 ICT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   170   021    Pre-fail  Always       -       3225
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1049
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3364
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       209
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       186
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       862
194 Temperature_Celsius     0x0022   097   094   000    Old_age   Always       -       50
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       5
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3364         1790849419
# 2  Short offline       Completed: read failure       90%      3325         1790849419
# 3  Extended offline    Completed: read failure       90%      3254         1790849419
# 4  Short offline       Completed: read failure       90%      3254         1790849419
# 5  Short offline       Completed: read failure       90%      3232         1790848116

My NAS information

DisksBay    Dev. Model Capacity Power Status Temp Health
right sda WDC WD10EALX-009BA0    1000.2 GB active or idle 49°C/120.2°F passed
left sdb WDC WD10EALX-009BA0 1000.2 GB     active or idle 50°C/122°F passed
usb sdc STORAGE DEVICE 8.2 GB -- -- --
RAIDDev.  Capacity    Level   State   Status   Action   Done   ETA
md0   931.0 GB    raid1   clean     OK      idle

Bay	Dev.	Model	Capacity	Power Status	Temp	Health
right	sda	WDC WD10EALX-009BA0	1000.2 GB	active or idle	49°C/120.2°F	passed
left	sdb	WDC WD10EALX-009BA0	1000.2 GB	active or idle	50°C/122°F	passed
usb	sdc	STORAGE DEVICE	8.2 GB	--	--	--

Dev.	Capacity	Level	State	Status	Action	Done	ETA
md0	931.0 GB	raid1	clean	OK	idle

João Cardoso

unread,

Apr 27, 2013, 10:42:48 AM4/27/13

to al...@googlegroups.com

On Saturday, April 27, 2013 4:54:21 AM UTC+1, Pornchai R. wrote:

Hi Joao,
First of all, I love the firmware and your quickly support.

I'm start try more feature in the firmware and recently,I use Disk Utilities to check my disks

After use "start short test" to test a harddisk, there are errors as show below.

I thought there are bad sectors,

probably yes, although "read error" might be something else, don't know.

The error occurs very early in the test and at around the same area, around sector 1790849419.

and I would like to fix them but I don't know how to do it. Could I use "ForceFix" in Filesystem Maintenance against md0?

No, SMART works at a very, very low level.

SMART reveals errors or tendencies that are not correctable, indicating a disk failure in the near future. How near, nobody knows.

I've tried "Verify" and "Repair" in Raid.

No user level repair is possible, at least that I'm aware.

Bad sectors should be handled by the disk itself, storing a bad sectors list and remapping them to good sectors. This should happen automatically until the list of bad sectors is full.

But this seems to not be happening, as "5 Reallocated_Sector_Ct 0x0033 200 200 140" is still away from the threshold (low values in the VALUE WORST columns, approaching the THRESH value indicate a possible issue).

Other way to deal with bad sectors is at the filesystem level, running a program (badblocks) that scans the whole disk for bad sectors and then letting the filesystem avoid use them. There are no provisions in Alt-F to do this automatically, and the scan is taking days.

It is not very used nowadays, as the drive should remap bad sectors automatically and the remap list is big enough to hold bad sectors developed during the expected life of the disk.

Your disk has 3232 power-on hours, 5 months of continuous usage, so it is almost news.

Something that worries me is your disk temperature, 50ºC is a high value, is the fan running? low, medium or at high speed? Try to solve that first.

I would stop most services and retry the test.

While is is running don't use the box, take notice of the test expected time to completion and only access the box after that. As the test is executed within the drive, by the disk firmware itself, not Alt-F, you will not see any led blinking.

Try googling for SMART and your disk disk model and error.

Luck


smartctl 6.0 2012-10-10 r3643 [armv5tel-linux-2.6.35.14] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD10EALX-009BA0
Serial Number:    WD-WCATR5128813
LU WWN Device Id: 5 0014ee 2b00b38f9
Firmware Version: 15.01H15
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Apr 27 10:41:21 2013 ICT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   170   021    Pre-fail  Always       -       3225
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       104

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

Doesn't seems to exist problems here.


  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3364
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       209
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       186
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       862
194 Temperature_Celsius     0x0022   097   094   000    Old_age   Always       -       50

50ºC? This seems excessive to me.

Pornchai R.

unread,

Apr 27, 2013, 11:25:14 AM4/27/13

to al...@googlegroups.com

Thanks a lot for your promptly reply and your response make me feel better.

About High temperature, I think it's because here in Thailand is in summer and it's hot 35 c now.

I solve the issue by make High Fan Speed start early.

System Temperature / Fan Speed relationship

Low Temp.	°C		Low Fan Speed	RPM
High Temp.	°C		High Fan Speed	RPM

Now the temperature is 48.5c and fan speed is around 5200.

Cheers,

Pornchai

João Cardoso

unread,

Apr 27, 2013, 2:55:06 PM4/27/13

to al...@googlegroups.com

On Saturday, April 27, 2013 4:25:14 PM UTC+1, Pornchai R. wrote:

Thanks a lot for your promptly reply and your response make me feel better.

I'm not so sure about that, as you have several bad blocks (at least 2).

This means that some file(s) have corrupted data, and when you will try to read or write them you will get an error, possibly not being able to read the file.

You might live with that, or not. I would recommend you to try to fix the bad blocks, but that is somehow an advanced topic:

http://smartmontools.sourceforge.net/badblockhowto.html

You might also setup regular SMART tests, System->Services->smart->Configure/Start.

The problem is that all future tests will stop as soon as the first error shows up.

So you have to fix the bad block or access the files with bad block in them, which will deploy the drive remapping feature.

As you don't know what file(s) has bad block, the only easy solution is to access all them.

About High temperature, I think it's because here in Thailand is in summer and it's hot 35 c now.

I solve the issue by make High Fan Speed start early.
System Temperature / Fan Speed relationship
Low Temp. °C Low Fan Speed RPM
High Temp. °C High Fan Speed RPM

Now the temperature is 48.5c and fan speed is around 5200.

According to the drive datasheet, max temp is 60ºC, so you are "safe".

http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771436.pdf

Probably the 1.5ºC decrease in temperature will not justify the high speed fan noise or fan wear, only you can tell.

Reply all

Reply to author

Forward

0 new messages