It is not resyncing... the status page says it is? And presents an ETA and
completion bar? That is moving? odd...
Green means that it is a spare drive, not in use, but ready to be.
When a device is in red it mean that it failed.
Black means that it is in use.
For some reason /dev/sda2 (possibly the left disk) was removed from the array
and become a spare. This should not have happened.
Anybody else had a similar problem? Have you ejected the left disk (Disk-
>Utils)?
*IF* all your data on the degraded array is still available:
You should go to Disk->RAID and in the 'RAID Maintenance' section,
Under 'Component Operations', Partition, select sdb2 (the spare), then
Under 'Component Operations', Operations, select 'add'
Notice: this is the standard way of adding a spare drive when one disk fails.
I can't fully diagnose your present situation. I'm just telling you what *I*
would do. You might prefer to just reboot and go to dlink firmware.
Now, a lenghly rebuild will start.
To avoid the lengly rebuild in the future, for this situation, add a write-
intent bitmap to the array when rebuild finishes (see my previous post).
I have already done this on a dlink-created raid1 array and, on return to the
dlink firmware, the raid array was successefuly assembled.
Just for reference, compare your results with mine:
# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Wed Mar 30 15:26:23 2011
Raid Level : raid1
Array Size : 38085824 (36.32 GiB 39.00 GB)
Used Dev Size : 38085824 (36.32 GiB 39.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Mar 30 19:28:14 2011
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 36b27a78:6ece8ec8:d026b8df:cc91afe2 (local to host
nas.homenet)
Events : 0.222
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
And still degraded? With sdb2 in green?
The commands output you posted was done after the resync finished, right?
> After that, with
> both disk plugged in, I rebooted in Alt-F, and waited again for 4
> hours for resyncing. It's where we are there now.
What I don't understand is how resync finish (or better, started) with only
one disk.
As the cmd output you posted,
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
> > *IF* all your data on the degraded array is still available:
> It seems to be there, because I did not (and could not) make changes
> to files on the disks.
Yes you could. RAID is available/writable even when rebuilding/resyncinc,
although slower.
> > You should go to Disk->RAID and in the 'RAID Maintenance' section,
> >
> > Under 'Component Operations', Partition, select sdb2 (the spare), then
> > Under 'Component Operations', Operations, select 'add'
> >
> > Notice: this is the standard way of adding a spare drive when one disk
> > fails. I can't fully diagnose your present situation. I'm just telling
> > you what *I* would do. You might prefer to just reboot and go to dlink
> > firmware.
> >
> > Now, a lenghly rebuild will start.
>
> OK, I'll try.
> I got an error message:
> "Adding the sdb2 partition to the md0 RAID device failed:
> mdadm: Cannot open /dev/sdb2: Device or resource busy"
-unmount the filesystem: Disk->Filesystem->FS Operations, select 'unmount'
(if it says only 'mount', then it is already un-mounted, skip)
-stop the array: Disk->RAID->Raid Maintenance, under 'Array' hit "Stop"
It now shows two devices under 'Components', sda2 and sdb2, right? any in
green?
-start the array: Disk->RAID->Raid Maintenance, under Array, hit "Start"
Working now? Still degraded? any component in green?
I tried to reproduce what happened with you, but couldn't:
-eject left disk (Disk->Utils)
array enter degraded state, only sda3 under components
physically ejected left disk
wait a few seconds
physically inserted left disk
still degraded, but no sd* in green, only sda3 appeared
-add left disk to array (Disk->Raid)
array leaved degraded state. No rebuild/resync (I have intent bitmap)
By now you must be undertanding how raid1 works. It has two or more device
components (disk partitions), and each component can be in use, failed,
removed or spare.
The arrays enters the degraded state when one of the two active components
fails and/or is removed. If a spare is available at that moment, rebuilding
automaticaly starts on it. If no spare is available, one must make a partition
of equal size on a new disk, insert the disk and add the new partition disk to
the array.
Perhaps all this was caused because dlink firmware needs a reboot in order to
complete the resync? It was not Alt-F that put one component as spare.
I managed to reproduce your exact situaction (*):
Status page:
=============
|
md0 |
36.3 GB |
raid1 |
clean |
|
degraded |
recover |
35% |
15.4min |
Raid page:
=======
|
md0 |
36.3 GB |
raid1 |
sda3 sdb3 |
|
recovering |
(sdb3 is in green)
/proc/mdstat:
=========
md0 : active raid1 sdb3[2] sda3[0]
38085824 blocks [2/1] [U_]
[=========>...........] recovery = 47.1% (17968896/38085824) finish=8.9min speed=37330K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>
mdadm --examine
=============
this 2 8 19 2 spare /dev/sdb3
0 0 8 3 0 active sync /dev/sda3
1 1 0 0 1 faulty removed
2 2 8 19 2 spare /dev/sdb3
mdadm --detail
===========
State : active, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
...
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
2 8 19 1 spare rebuilding /dev/sdb3
I'm now waiting for recovering to finish.
It finish and went to the OK status, no component appears in green (spares).
So I expect the same to happen to you.
It remains to be explained how the spare component appeared in the first place. As I said, Alt-F doesn't do that except when creating an array with more disks than necessary. Also, recovering should start immediately when a raid becomes degraded and a spare is available.
(*) How I reproduced your situation: (don't do this at home :-)
fail, then remove, then --zero-superblock, then add; all to sdb3.
What does the kernel log says at the end of the recovery?
As I said in my previous post, I was able to reproduce (through manipulation) your situation, and after recovering finish my array went to the OK status.
My kernel log shows no errors at the end:
md/raid1:md0: active with 1 out of 2 mirrors
...
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda3
disk 1, wo:1, o:1, dev:sdb3
md: recovery of RAID array md0
...
md: md0: recovery done.
RAID1 conf printout:
--- wd:2 rd:2
disk 0, wo:0, o:1, dev:sda3
disk 1, wo:0, o:1, dev:sdb3
I was able to reproduce your situation by removing the RAID information (metadata) from sdb3 (sdb2 in your case); this is the same as adding a pristine partition to the raid array, see Topic "Windows cannot see my dns-323 with Alt-F firmware" after post #8.
I'm almost sure that your array current situation is a consequence of an incomplete recovery when using dlink firmware, but I don't really want to play with your data remotely.
> (*) How I reproduced your situation: (don't do this at home :-)
> fail, then remove, then --zero-superblock, then add; all to sdb3.
If you want more detailed instruction about this procedure, that removes raid information from your sdb2 partition please say so.
It is the same as replacing your left disk (sdb) with a brand new disk
How is the health of your disks? Disk->Utilities->Health->Show Status
I recommend doing a long SMART test, Disk->Utilities->Health->"Start long
test". It can take a while to complete, it is done in the background, the
disks should still be usable.
You can start testing both disks simultaneously. Don't see the log until the
indicated time elapses.
Perhaps you should disable the spindow timeout, enter 0 and submit, before
starting the SMART test.
The kernel log shows that for sda, which usually is your right disk (please
confirm) there are many error. Just two:
md/raid1:md0: sda: unrecoverable I/O read error
media error
How old are the disks?
The http://pastebin.com/AhemxfUz log (sda?)
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 14556
1938838581
fails with an read error at about 90% of the test, at sector 1938838581
Even short tests fail at the same sector.
The http://pastebin.com/MxuHBg0a log (sdb?)
The test did not complete:
Short offline Aborted by host 90% 14645
Extended offline Aborted by host 90% 14645
Short offline Completed without error 00% 11
Only a short test successfuly completed when the drive was purchased, when it
was working for 11 hours.
The kernel log show a lot of errors, all for sda. Starting from the last:
md/raid1:md0: sda: unrecoverable I/O read error
This has probably prevented recovering to succeed, that is why sdb is always
keep as a spare drive (in green) and the degraded state is never left.
Other kernel log errors, sorted by sector:
end_request: I/O error, dev sda, sector 1938838581
end_request: I/O error, dev sda, sector 1938838658
end_request: I/O error, dev sda, sector 1938838786
end_request: I/O error, dev sda, sector 1938838914
end_request: I/O error, dev sda, sector 1938839042
end_request: I/O error, dev sda, sector 1938839170
end_request: I/O error, dev sda, sector 1938839298
end_request: I/O error, dev sda, sector 1938839426
end_request: I/O error, dev sda, sector 1938839554
end_request: I/O error, dev sda, sector 1938839682
end_request: I/O error, dev sda, sector 1938839810
end_request: I/O error, dev sda, sector 1938839938
end_request: I/O error, dev sda, sector 1938845186
end_request: I/O error, dev sda, sector 1938845442
end_request: I/O error, dev sda, sector 1938845698
If you notice, the first sector where the error is reported is the same
reported by smart.
So, it looks like sda (is it your right disk?) has media (surface) errors near
its end.
This can be a disk error or a kernel bug. The only way to be sure is to do a
'smart' test on sdb and see if it complete without errors. If it does, the
problem is not the kernel but the disk.
If it is a kernel bug, the best is for you to return to dlink firmware.
If it is a disk problem, rebuild/resync/recover will never succeed.
Fortunately the error is at the disk end, and unless it is full of data, the
data is still intact. How full is your disk? Status page.
What are your options? Several. But if an error has developed in a disk then
more error are likely to appear. This is not always true, the defect can be
localized on the disk surface and does not spread like a virus.
-Do a backup/buy a new disk
-See if WD has a disk repairing tool for your disk model. Bad sectors can be
remaped/reallocated to good ones (the 'smart' log says that it is still
possible:
5 Reallocated_Sector_Ct 0x0033 200 200 140)
-See if the data is still intact in sdb2. Yes, it is possible. If it is, then
sda could be reformated to eliminate the defective area and a new array built
(or a new disk used)
-Format the sdb disk as a standard disk and copy all data from the degraded
array to sdb. This should only be done if the smart test on sdb completes
without error. Then sda could be reformated to eliminate the defective area
and a new array built (or a new disk used)
-Try to shrink the raid array by 10 or 20%, so that the zone of the disk in
error is out of the raid. I don't know if this will succeed. I have already
shrink raid arrays, but there was no disk on error. Then sdb2 could be added
to the array.
> > How old are the disks?
>
> Both are around 2 years old. One a bit more, the other a bit less.
>
> Are you telling me that finally ALT-F had reasons to complain and stay
> degraded?
I have presented the facts that the logs show.
> Weird, why D-Link firmware never complained ?
You are the only one to doubt, see the Topic "SATA issues"
> I even had smartd installed with ffp.
And have you looked at the logs? Was it doing regular tests?
> Is there any other test I could do or
> repair operations, or my disks need to be changed? Thanks for your
> help because this is way beyound my knowledge.
-I would repeat the smart test on sdb to make sure it is OK and the problem is
not a kernel bug. Remember to disable diskspindown before start the test.
-I would see if a WD disktool exists.
If I didn't have the expertise, I would switch to Dlink firmware to see what
happens there.
Luck
I said:
> So, it looks like sda (is it your right disk?) has media (surface) errors near its end.
> This can be a disk error or a kernel bug. The only way to be sure is to do a
> 'smart' test on sdb and see if it complete without errors. If it does, the
> problem is not the kernel but the disk.
Stupid of me. I need to sleep a little bit more.
SMART tests are executed internally by the drive itself, the kernel has nothing to do with it. The smart programs only instructs the drive to perform the test, and latter collect the results.
So the problem is certainly in the drive.
But this is a community, I'm sure that someone else might be able to clarify:Anyone has done a short or long smart test (Disk->Utils->Health) on disks with 1TB or greater capacity?