Newbie question - Raid1 in degraded state

bobcote

unread,

Mar 30, 2011, 10:05:43 AM3/30/11

to Alt-F

Hi,
I'm trying Alt-F (not flashed) but the raid1 array I had from D-Link
firmware stays in degraded state even after the 4 hours of
"resyncing". As far as I know, my 2 disks are OK, so why is it and
stays in degraded state?
Thank you

Joao Cardoso

unread,

Mar 30, 2011, 10:50:14 AM3/30/11

to Alt-F

More information is needed to diagnose the problem.

-Copy/paste (*) the raid section in the the status page, as in
md0 36.3 GB raid1 clean OK resync
23% 13.1min

Has the ETA completed? If not, you have to wait, unless it is stuck.

-Copy/paste (*) the RAID Maintenance, in Disk->Raid, as in
md0 36.3 GB raid1 sda3 sdb3 resyncing

Does two devices (sda3 sdb3 above) appear in the RAID components? Or
only one? Or two, but one of them in green? Which one?

If only one component appears, that is the reason why the RAID is
degraded, but as you say that it is resyncing, two components must
appear.

I don't know why the RAID is resyncing, perhaps mdadm has detected
some problem?
Or for some reason one of the disks "appeared" first than the other
and Alt-F hotplugger has decided to assemble the RAID in degraded mode
-- then, when the second disk "appeared", Alt-F hotplugger has added
that disk to the array, and it started resyncing both disks?
The second case is more likely, and to avoid that in the future you
can add a write-intent bitmap to the RAID (Disks->RAID->RAID
Operations->Add bitmap).

(*) Alternatively, telnet/ssh the box as user 'root' and post the
output of the following commands:

cat /proc/mdstat
mdadm --detail /dev/md0

Dwight Hubbard

unread,

Mar 30, 2011, 1:50:53 PM3/30/11

to al...@googlegroups.com

Are you reloading the status page or have the autoupdate checkbox at the bottom of the page checked?

bobcote

unread,

Mar 30, 2011, 1:55:06 PM3/30/11

to Alt-F

Once in a while, yes I relaoded the status page to see if resyncing
was finished.

bobcote

unread,

Mar 30, 2011, 1:55:15 PM3/30/11

to Alt-F

# cat /proc/mdstat

Personalities : [raid1]
md0 : active raid1 sda2[1] sdb2[2](S)
975715712 blocks [2/1] [_U]

unused devices: <none>

-------

# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sun Aug 2 19:31:22 2009
Raid Level : raid1
Array Size : 975715712 (930.52 GiB 999.13 GB)
Used Dev Size : 975715712 (930.52 GiB 999.13 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Mar 30 13:45:14 2011
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

UUID : 92af4174:ed40ce0d:5fb32e17:d9887b84
Events : 0.2463450

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 2 1 active sync /dev/sda2

2 8 18 - spare /dev/sdb2

------

In the RAID Maintenance box, the component sdb2 is written in green
while sda2 is in black. What does it mean?

Joao Cardoso

unread,

Mar 30, 2011, 3:05:04 PM3/30/11

to al...@googlegroups.com

On Wednesday, March 30, 2011 18:55:15 bobcote wrote:
> # cat /proc/mdstat
>
> Personalities : [raid1]
> md0 : active raid1 sda2[1] sdb2[2](S)
> 975715712 blocks [2/1] [_U]

It is not resyncing... the status page says it is? And presents an ETA and
completion bar? That is moving? odd...

Green means that it is a spare drive, not in use, but ready to be.
When a device is in red it mean that it failed.
Black means that it is in use.

For some reason /dev/sda2 (possibly the left disk) was removed from the array
and become a spare. This should not have happened.

Anybody else had a similar problem? Have you ejected the left disk (Disk-
>Utils)?

*IF* all your data on the degraded array is still available:

You should go to Disk->RAID and in the 'RAID Maintenance' section,

Under 'Component Operations', Partition, select sdb2 (the spare), then
Under 'Component Operations', Operations, select 'add'

Notice: this is the standard way of adding a spare drive when one disk fails.
I can't fully diagnose your present situation. I'm just telling you what *I*
would do. You might prefer to just reboot and go to dlink firmware.

Now, a lenghly rebuild will start.

To avoid the lengly rebuild in the future, for this situation, add a write-
intent bitmap to the array when rebuild finishes (see my previous post).
I have already done this on a dlink-created raid1 array and, on return to the
dlink firmware, the raid array was successefuly assembled.

Just for reference, compare your results with mine:

# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90

Creation Time : Wed Mar 30 15:26:23 2011
Raid Level : raid1
Array Size : 38085824 (36.32 GiB 39.00 GB)
Used Dev Size : 38085824 (36.32 GiB 39.00 GB)

Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Mar 30 19:28:14 2011
State : active
Active Devices : 2

Working Devices : 2
Failed Devices : 0

Spare Devices : 0

UUID : 36b27a78:6ece8ec8:d026b8df:cc91afe2 (local to host
nas.homenet)
Events : 0.222

Number Major Minor RaidDevice State

0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3

bobcote

unread,

Mar 30, 2011, 3:51:03 PM3/30/11

to Alt-F

> It is not resyncing... the status page says it is? And presents an ETA and
> completion bar? That is moving? odd...

I know, as you told me to wait until resyncing finishes before
executing those commands. The status is now idle.

> For some reason /dev/sda2 (possibly the left disk) was removed from the array
> and become a spare. This should not have happened.
>
> Anybody else had a similar problem? Have you ejected the left disk (Disk-Utils)?

I removed the left disk yesterday (just to be sure) before trying Alt-
F. Alt-F sure showed me that my raid1 array was degraded. I browsed
around Alt-F UI to make myself an opinion. I then shut down the NAS,
put back the left disk and rebooted came back to D-Link firmware.
Again, D-Link firmware also did a resyncing (don't remember if it was
that term that was used). I waited like 3 hours. I thought that
everything was back to normal, and it looked like so. After that, with
both disk plugged in, I rebooted in Alt-F, and waited again for 4
hours for resyncing. It's where we are there now.

> *IF* all your data on the degraded array is still available:

It seems to be there, because I did not (and could not) make changes
to files on the disks.

> You should go to Disk->RAID and in the 'RAID Maintenance' section,
>
> Under 'Component Operations', Partition, select sdb2 (the spare), then
> Under 'Component Operations', Operations, select 'add'
>
> Notice: this is the standard way of adding a spare drive when one disk fails.
> I can't fully diagnose your present situation. I'm just telling you what *I*
> would do. You might prefer to just reboot and go to dlink firmware.
>
> Now, a lenghly rebuild will start.

OK, I'll try.

> To avoid the lengly rebuild in the future, for this situation, add a write-
> intent bitmap to the array when rebuild finishes (see my previous post).

OK, I will.

Thanks a lot

bobcote

unread,

Mar 30, 2011, 3:56:20 PM3/30/11

to Alt-F

On Mar 30, 3:05 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> *IF* all your data on the degraded array is still available:
>
> You should go to Disk->RAID and in the 'RAID Maintenance' section,
> Under 'Component Operations', Partition, select sdb2 (the spare), then
> Under 'Component Operations', Operations, select 'add'
> Notice: this is the standard way of adding a spare drive when one disk fails.

I got an error message:

"Adding the sdb2 partition to the md0 RAID device failed:
mdadm: Cannot open /dev/sdb2: Device or resource busy"

Joao Cardoso

unread,

Mar 30, 2011, 5:13:32 PM3/30/11

to al...@googlegroups.com

On Wednesday, March 30, 2011 20:51:03 bobcote wrote:
> > It is not resyncing... the status page says it is? And presents an ETA
> > and completion bar? That is moving? odd...
>
> I know, as you told me to wait until resyncing finishes before
> executing those commands. The status is now idle.

And still degraded? With sdb2 in green?
The commands output you posted was done after the resync finished, right?

> After that, with
> both disk plugged in, I rebooted in Alt-F, and waited again for 4
> hours for resyncing. It's where we are there now.

What I don't understand is how resync finish (or better, started) with only
one disk.
As the cmd output you posted,

Active Devices : 1

Working Devices : 2
Failed Devices : 0

Spare Devices : 1

> > *IF* all your data on the degraded array is still available:
> It seems to be there, because I did not (and could not) make changes
> to files on the disks.

Yes you could. RAID is available/writable even when rebuilding/resyncinc,
although slower.

> > You should go to Disk->RAID and in the 'RAID Maintenance' section,
> >
> > Under 'Component Operations', Partition, select sdb2 (the spare), then
> > Under 'Component Operations', Operations, select 'add'
> >
> > Notice: this is the standard way of adding a spare drive when one disk
> > fails. I can't fully diagnose your present situation. I'm just telling
> > you what *I* would do. You might prefer to just reboot and go to dlink
> > firmware.
> >
> > Now, a lenghly rebuild will start.
>
> OK, I'll try.

> I got an error message:

> "Adding the sdb2 partition to the md0 RAID device failed:
> mdadm: Cannot open /dev/sdb2: Device or resource busy"

-unmount the filesystem: Disk->Filesystem->FS Operations, select 'unmount'
(if it says only 'mount', then it is already un-mounted, skip)

-stop the array: Disk->RAID->Raid Maintenance, under 'Array' hit "Stop"
It now shows two devices under 'Components', sda2 and sdb2, right? any in
green?

-start the array: Disk->RAID->Raid Maintenance, under Array, hit "Start"

Working now? Still degraded? any component in green?

I tried to reproduce what happened with you, but couldn't:

-eject left disk (Disk->Utils)
array enter degraded state, only sda3 under components
physically ejected left disk
wait a few seconds
physically inserted left disk
still degraded, but no sd* in green, only sda3 appeared

-add left disk to array (Disk->Raid)
array leaved degraded state. No rebuild/resync (I have intent bitmap)

By now you must be undertanding how raid1 works. It has two or more device
components (disk partitions), and each component can be in use, failed,
removed or spare.
The arrays enters the degraded state when one of the two active components
fails and/or is removed. If a spare is available at that moment, rebuilding
automaticaly starts on it. If no spare is available, one must make a partition
of equal size on a new disk, insert the disk and add the new partition disk to
the array.

Perhaps all this was caused because dlink firmware needs a reboot in order to
complete the resync? It was not Alt-F that put one component as spare.

bobcote

unread,

Mar 30, 2011, 5:14:12 PM3/30/11

to Alt-F

On Mar 30, 3:56 pm, bobcote <baril...@gmail.com> wrote:
> I got an error message:
>
> "Adding the sdb2 partition to the md0 RAID device failed:
> mdadm: Cannot open /dev/sdb2: Device or resource busy"

I thought maybe I had to stop the raid array before trying that, but
doing so, I loose operations buttons, so I restarted it, and now I
will have to wait 3 hours again because it is "recovering"...
Pffff, maybe i'll get it correctly in a... week.

Joao Cardoso

unread,

Mar 30, 2011, 5:21:50 PM3/30/11

to Alt-F

No, only 3 hours.

You end-up doing what I just suggest you to do.
Is any component in green now? Hope not...

Please repeat the command while it is recovering:

cat /proc/mdstat
mdadm --detail /dev/md0

You have only one RAID, right? If you created a raid1 with a size
lower than the disk capacity, then the dlink firmware creates as JBOD
with the remaining space.

bobcote

unread,

Mar 30, 2011, 5:28:04 PM3/30/11

to Alt-F

On Mar 30, 5:13 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> On Wednesday, March 30, 2011 20:51:03 bobcote wrote:
> > > It is not resyncing... the status page says it is? And presents an ETA
> > > and completion bar? That is moving? odd...
>
> > I know, as you told me to wait until resyncing finishes before
> > executing those commands. The status is now idle.
>
> And still degraded? With sdb2 in green?
> The commands output you posted was done after the resync finished, right?

Yes, after.

> > > *IF* all your data on the degraded array is still available:
> > It seems to be there, because I did not (and could not) make changes
> > to files on the disks.
>
> Yes you could. RAID is available/writable even when rebuilding/resyncinc,
> although slower.

Well, as I sais, I did try, but md0 was "ro" (read only) in Filesystem
Maintenance, so I presume I could not make changes.

> -unmount the filesystem: Disk->Filesystem->FS Operations, select 'unmount'
> (if it says only 'mount', then it is already un-mounted, skip)

Ok, did that.

> -stop the array: Disk->RAID->Raid Maintenance, under 'Array' hit "Stop"
> It now shows two devices under 'Components', sda2 and sdb2, right? any in
> green?

Both are in black.

> -start the array: Disk->RAID->Raid Maintenance, under Array, hit "Start"
>
> Working now? Still degraded? any component in green?

Back to same situation.
It is in recovering state and sdb2 is in green.

bobcote

unread,

Mar 30, 2011, 5:32:15 PM3/30/11

to Alt-F

> You have only one RAID, right? If you created a raid1 with a size
> lower than the disk capacity, then the dlink firmware creates as JBOD
> with the remaining space.

Both disks are exactly the same brand, model, size.

------

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda2[1] sdb2[2]

975715712 blocks [2/1] [_U]
[>....................] recovery = 3.0% (29327488/975715712)
finish=179.9min speed=87665K/sec

unused devices: <none>

-------

# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sun Aug 2 19:31:22 2009
Raid Level : raid1
Array Size : 975715712 (930.52 GiB 999.13 GB)
Used Dev Size : 975715712 (930.52 GiB 999.13 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Mar 30 17:24:07 2011
State : clean, degraded, recovering

Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Rebuild Status : 4% complete

UUID : 92af4174:ed40ce0d:5fb32e17:d9887b84

Events : 0.2463465

Number Major Minor RaidDevice State

2 8 18 0 spare rebuilding /dev/sdb2

1 8 2 1 active sync /dev/sda2

------

Thank you again

bobcote

unread,

Mar 30, 2011, 5:43:15 PM3/30/11

to Alt-F

> Well, as I sais, I did try, but md0 was "ro" (read only) in Filesystem
> Maintenance, so I presume I could not make changes.

Oups, I meant, I did NOT try.

Joao Cardoso

unread,

Mar 30, 2011, 6:15:16 PM3/30/11

to Alt-F

I would expect sdb2 to stop being a spare at the end of the
recovering, but it looks like the original situation.

Three more commands output, please

mdadm --examine /dev/sda2
mdadm --examine /dev/sdb2
cat /etc/mdadm.conf

If the original situation repeats itself, i.e., recovering finish,
still degraded and sdb2 in green, try to "fail" sdb2, then "remove"
sdb2, then "add" sdb2, all under "Component Operations" and with the
raid started (don't stop it)

bobcote

unread,

Mar 30, 2011, 6:21:42 PM3/30/11

to Alt-F

# mdadm --examine /dev/sda2
/dev/sda2:
Magic : a92b4efc
Version : 0.90.00

UUID : 92af4174:ed40ce0d:5fb32e17:d9887b84

Creation Time : Sun Aug 2 19:31:22 2009
Raid Level : raid1

Used Dev Size : 975715712 (930.52 GiB 999.13 GB)
Array Size : 975715712 (930.52 GiB 999.13 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0

Update Time : Wed Mar 30 18:11:15 2011
State : clean

Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Checksum : 34d44075 - correct
Events : 2463471

Number Major Minor RaidDevice State

this 1 8 2 1 active sync /dev/sda2

0 0 0 0 0 removed
1 1 8 2 1 active sync /dev/sda2
2 2 8 18 2 spare /dev/sdb2

--------------------

# mdadm --examine /dev/sdb2
/dev/sdb2:
Magic : a92b4efc
Version : 0.90.00

UUID : 92af4174:ed40ce0d:5fb32e17:d9887b84

Creation Time : Sun Aug 2 19:31:22 2009
Raid Level : raid1

Used Dev Size : 975715712 (930.52 GiB 999.13 GB)
Array Size : 975715712 (930.52 GiB 999.13 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0

Update Time : Wed Mar 30 18:11:15 2011
State : clean

Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Checksum : 34d44081 - correct
Events : 2463471

Number Major Minor RaidDevice State

this 2 8 18 2 spare /dev/sdb2

0 0 0 0 0 removed
1 1 8 2 1 active sync /dev/sda2
2 2 8 18 2 spare /dev/sdb2

------------

# cat /etc/mdadm.conf
ARRAY /dev/md0 UUID=92af4174:ed40ce0d:5fb32e17:d9887b84
spares=1

Joao Cardoso

unread,

Mar 30, 2011, 7:49:32 PM3/30/11

to al...@googlegroups.com

I managed to reproduce your exact situaction (*):

Status page:

=============

md0

36.3 GB

raid1

clean

degraded

recover

35%

15.4min

Raid page:

=======

md0

36.3 GB

raid1

sda3 sdb3

recovering

(sdb3 is in green)

/proc/mdstat:

=========

md0 : active raid1 sdb3[2] sda3[0]

38085824 blocks [2/1] [U_]

[=========>...........] recovery = 47.1% (17968896/38085824) finish=8.9min speed=37330K/sec

bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

mdadm --examine

=============

this 2 8 19 2 spare /dev/sdb3

0 0 8 3 0 active sync /dev/sda3

1 1 0 0 1 faulty removed

2 2 8 19 2 spare /dev/sdb3

mdadm --detail

===========

State : active, degraded, recovering

Active Devices : 1

Working Devices : 2

Failed Devices : 0

Spare Devices : 1

...

Number Major Minor RaidDevice State

0 8 3 0 active sync /dev/sda3

2 8 19 1 spare rebuilding /dev/sdb3

I'm now waiting for recovering to finish.

It finish and went to the OK status, no component appears in green (spares).

So I expect the same to happen to you.

It remains to be explained how the spare component appeared in the first place. As I said, Alt-F doesn't do that except when creating an array with more disks than necessary. Also, recovering should start immediately when a raid becomes degraded and a spare is available.

(*) How I reproduced your situation: (don't do this at home :-)

fail, then remove, then --zero-superblock, then add; all to sdb3.

bobcote

unread,

Mar 31, 2011, 9:19:52 AM3/31/11

to Alt-F

> On Mar 30, 6:15 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> If the original situation repeats itself, i.e., recovering finish,
> still degraded and sdb2 in green, try to "fail" sdb2, then "remove"
> sdb2, then "add" sdb2, all under "Component Operations" and with the
> raid started (don't stop it)

Tried that but the original situation still repeats itself, i.e.,
recovering finish, still degraded and sdb2 in green.

Joao Cardoso

unread,

Mar 31, 2011, 11:55:46 AM3/31/11

to al...@googlegroups.com

What does the kernel log says at the end of the recovery?

As I said in my previous post, I was able to reproduce (through manipulation) your situation, and after recovering finish my array went to the OK status.

My kernel log shows no errors at the end:

md/raid1:md0: active with 1 out of 2 mirrors

...

RAID1 conf printout:

--- wd:1 rd:2

disk 0, wo:0, o:1, dev:sda3

disk 1, wo:1, o:1, dev:sdb3

md: recovery of RAID array md0

...

md: md0: recovery done.

RAID1 conf printout:

--- wd:2 rd:2

disk 0, wo:0, o:1, dev:sda3

disk 1, wo:0, o:1, dev:sdb3

I was able to reproduce your situation by removing the RAID information (metadata) from sdb3 (sdb2 in your case); this is the same as adding a pristine partition to the raid array, see Topic "Windows cannot see my dns-323 with Alt-F firmware" after post #8.

I'm almost sure that your array current situation is a consequence of an incomplete recovery when using dlink firmware, but I don't really want to play with your data remotely.

> (*) How I reproduced your situation: (don't do this at home :-)

> fail, then remove, then --zero-superblock, then add; all to sdb3.

If you want more detailed instruction about this procedure, that removes raid information from your sdb2 partition please say so.

It is the same as replacing your left disk (sdb) with a brand new disk

bobcote

unread,

Mar 31, 2011, 1:40:20 PM3/31/11

to Alt-F

On Mar 31, 11:55 am, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> On Thursday, March 31, 2011 14:19:52 bobcote wrote:
> > > On Mar 30, 6:15 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> > > If the original situation repeats itself, i.e., recovering finish,
> > > still degraded and sdb2 in green, try to "fail" sdb2, then "remove"
> > > sdb2, then "add" sdb2, all under "Component Operations" and with the
> > > raid started (don't stop it)
>
> > Tried that but the original situation still repeats itself, i.e.,
> > recovering finish, still degraded and sdb2 in green.
>
> What does the kernel log says at the end of the recovery?

Kernel log is here: http://pastebin.com/d5LjQRQN

For the rest, I will wait for your comments about my kernel log,
because yeah I will probably need some help/info, thought I think I
understand what you posted there: http://groups.google.com/group/alt-f/msg/12bcbd473d1c3ec9
What I don't understand is if you tell me to do as is that post or to
do the --zero-superblock thing, in which later case, no, I don't know
how to do.

Joao Cardoso

unread,

Mar 31, 2011, 2:30:18 PM3/31/11

to Alt-F

> > What does the kernel log says at the end of the recovery?
>
> Kernel log is here:http://pastebin.com/d5LjQRQN

Your kernel log is *full* of device errors. I advise you to do nothing
else.
Are you on a rev-A1 board?

bobcote

unread,

Mar 31, 2011, 3:12:31 PM3/31/11

to Alt-F

Under my NAS I see B1

Joao Cardoso

unread,

Mar 31, 2011, 4:07:35 PM3/31/11

to al...@googlegroups.com

On Thursday, March 31, 2011 20:12:31 bobcote wrote:
> Under my NAS I see B1

How is the health of your disks? Disk->Utilities->Health->Show Status

I recommend doing a long SMART test, Disk->Utilities->Health->"Start long
test". It can take a while to complete, it is done in the background, the
disks should still be usable.

You can start testing both disks simultaneously. Don't see the log until the
indicated time elapses.

Perhaps you should disable the spindow timeout, enter 0 and submit, before
starting the SMART test.

The kernel log shows that for sda, which usually is your right disk (please
confirm) there are many error. Just two:

md/raid1:md0: sda: unrecoverable I/O read error
media error

How old are the disks?

bobcote

unread,

Apr 1, 2011, 9:32:35 AM4/1/11

to Alt-F

On Mar 31, 4:07 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> On Thursday, March 31, 2011 20:12:31 bobcote wrote:
> > Under my NAS I see B1
>
> How is the health of your disks? Disk->Utilities->Health->Show Status
>
> I recommend doing a long SMART test, Disk->Utilities->Health->"Start long
> test". It can take a while to complete, it is done in the background, the
> disks should still be usable.
>
> You can start testing both disks simultaneously. Don't see the log until the
> indicated time elapses.
>
> Perhaps you should disable the spindow timeout, enter 0 and submit, before
> starting the SMART test.

After all night, here are the results.

sda2 in the right bay: http://pastebin.com/AhemxfUz
sdb2 in the left bay: http://pastebin.com/MxuHBg0a

>
> How old are the disks?

Both are around 2 years old. One a bit more, the other a bit less.

Are you telling me that finally ALT-F had reasons to complain and stay
degraded? Weird, why D-Link firmware never complained ? I even had
smartd installed with ffp. Is there any other test I could do or
repair operations, or my disks need to be changed? Thanks for your
help because this is way beyound my knowledge.

Joao Cardoso

unread,

Apr 1, 2011, 2:02:57 PM4/1/11

to al...@googlegroups.com

On Friday, April 01, 2011 14:32:35 bobcote wrote:
> On Mar 31, 4:07 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> > On Thursday, March 31, 2011 20:12:31 bobcote wrote:
> > > Under my NAS I see B1
> >
> > How is the health of your disks? Disk->Utilities->Health->Show Status
> >
> > I recommend doing a long SMART test, Disk->Utilities->Health->"Start long
> > test". It can take a while to complete, it is done in the background, the
> > disks should still be usable.
> >
> > You can start testing both disks simultaneously. Don't see the log until
> > the indicated time elapses.
> >
> > Perhaps you should disable the spindow timeout, enter 0 and submit,
> > before starting the SMART test.
>
> After all night, here are the results.
>
> sda2 in the right bay: http://pastebin.com/AhemxfUz
> sdb2 in the left bay: http://pastebin.com/MxuHBg0a

The http://pastebin.com/AhemxfUz log (sda?)

Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 14556
1938838581

fails with an read error at about 90% of the test, at sector 1938838581
Even short tests fail at the same sector.

The http://pastebin.com/MxuHBg0a log (sdb?)

The test did not complete:

Short offline Aborted by host 90% 14645
Extended offline Aborted by host 90% 14645
Short offline Completed without error 00% 11

Only a short test successfuly completed when the drive was purchased, when it
was working for 11 hours.

The kernel log show a lot of errors, all for sda. Starting from the last:

md/raid1:md0: sda: unrecoverable I/O read error

This has probably prevented recovering to succeed, that is why sdb is always
keep as a spare drive (in green) and the degraded state is never left.

Other kernel log errors, sorted by sector:

end_request: I/O error, dev sda, sector 1938838581
end_request: I/O error, dev sda, sector 1938838658
end_request: I/O error, dev sda, sector 1938838786
end_request: I/O error, dev sda, sector 1938838914
end_request: I/O error, dev sda, sector 1938839042
end_request: I/O error, dev sda, sector 1938839170
end_request: I/O error, dev sda, sector 1938839298
end_request: I/O error, dev sda, sector 1938839426
end_request: I/O error, dev sda, sector 1938839554
end_request: I/O error, dev sda, sector 1938839682
end_request: I/O error, dev sda, sector 1938839810
end_request: I/O error, dev sda, sector 1938839938
end_request: I/O error, dev sda, sector 1938845186
end_request: I/O error, dev sda, sector 1938845442
end_request: I/O error, dev sda, sector 1938845698

If you notice, the first sector where the error is reported is the same
reported by smart.

So, it looks like sda (is it your right disk?) has media (surface) errors near
its end.

This can be a disk error or a kernel bug. The only way to be sure is to do a
'smart' test on sdb and see if it complete without errors. If it does, the
problem is not the kernel but the disk.
If it is a kernel bug, the best is for you to return to dlink firmware.
If it is a disk problem, rebuild/resync/recover will never succeed.

Fortunately the error is at the disk end, and unless it is full of data, the
data is still intact. How full is your disk? Status page.

What are your options? Several. But if an error has developed in a disk then
more error are likely to appear. This is not always true, the defect can be
localized on the disk surface and does not spread like a virus.

-Do a backup/buy a new disk

-See if WD has a disk repairing tool for your disk model. Bad sectors can be
remaped/reallocated to good ones (the 'smart' log says that it is still
possible:
5 Reallocated_Sector_Ct 0x0033 200 200 140)

-See if the data is still intact in sdb2. Yes, it is possible. If it is, then
sda could be reformated to eliminate the defective area and a new array built
(or a new disk used)

-Format the sdb disk as a standard disk and copy all data from the degraded
array to sdb. This should only be done if the smart test on sdb completes
without error. Then sda could be reformated to eliminate the defective area
and a new array built (or a new disk used)

-Try to shrink the raid array by 10 or 20%, so that the zone of the disk in
error is out of the raid. I don't know if this will succeed. I have already
shrink raid arrays, but there was no disk on error. Then sdb2 could be added
to the array.

> > How old are the disks?
>
> Both are around 2 years old. One a bit more, the other a bit less.
>
> Are you telling me that finally ALT-F had reasons to complain and stay
> degraded?

I have presented the facts that the logs show.

> Weird, why D-Link firmware never complained ?

You are the only one to doubt, see the Topic "SATA issues"

> I even had smartd installed with ffp.

And have you looked at the logs? Was it doing regular tests?

> Is there any other test I could do or
> repair operations, or my disks need to be changed? Thanks for your
> help because this is way beyound my knowledge.

-I would repeat the smart test on sdb to make sure it is OK and the problem is
not a kernel bug. Remember to disable diskspindown before start the test.
-I would see if a WD disktool exists.

If I didn't have the expertise, I would switch to Dlink firmware to see what
happens there.

Luck

Joao Cardoso

unread,

Apr 1, 2011, 3:58:19 PM4/1/11

to Alt-F

On Apr 1, 7:02 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
...
> Luck

This is not a goodbye! I'm really interested in the developments.

Joao Cardoso

unread,

Apr 2, 2011, 10:17:45 AM4/2/11

to al...@googlegroups.com

I said:

> So, it looks like sda (is it your right disk?) has media (surface) errors near its end.

> This can be a disk error or a kernel bug. The only way to be sure is to do a
> 'smart' test on sdb and see if it complete without errors. If it does, the
> problem is not the kernel but the disk.

Stupid of me. I need to sleep a little bit more.

SMART tests are executed internally by the drive itself, the kernel has nothing to do with it. The smart programs only instructs the drive to perform the test, and latter collect the results.

So the problem is certainly in the drive.

bobcote

unread,

Apr 4, 2011, 2:08:37 PM4/4/11

to Alt-F

On Apr 1, 2:02 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
>
> The http://pastebin.com/MxuHBg0alog (sdb)

>
> The test did not complete:
>
> Short offline Aborted by host 90% 14645
> Extended offline Aborted by host 90% 14645
> Short offline Completed without error 00% 11
>
> Only a short test successfuly completed when the drive was purchased, when it
> was working for 11 hours.

> -I would repeat the smart test on sdb to make sure it is OK and the problem is
> not a kernel bug. Remember to disable diskspindown before start the test.

I don't know why, but smart tests on sdb (the disk that may still be
without errors) never complete. It always says: "Aborted by host".
I again tried to run a test on sdb (waited 15 hours even if it was
stated to wait only 4), and it still says "Aborted by host".
The new smart results: http://pastebin.com/UJz9JUTk

> Fortunately the error is at the disk end, and unless it is full of data, the data is still intact.
> How full is your disk? Status page.

My disks are 69% full (634 GB on 915 GB)

I don't have right now a windows desktop to plug into my disks and
test and repair with Western Digital's Data Lifeguard Diagnostic
utiliy. I will have to wait next weekend for a friend to help me with
that.

I ordered a new disk yesterday to replace sda, but sdb also begins to
make me worry.

Joao Cardoso

unread,

Apr 4, 2011, 5:04:47 PM4/4/11

to al...@googlegroups.com

> don't know why, but smart tests on sdb (the disk that may still be

> without errors) never complete. It always says: "Aborted by host".

Stop all services (System->Utilities->Services) and don't go to the status page.

Several programs might try to read SMART data, and that can make the internal test to be aborted (although the fact that all tests stop at 90% is strange). Start with a short test. The sda test also failed on a short test.

> My disks are 69% full (634 GB on 915 GB)

Then your data might be OK, if it is not fragmented.

If you have powered-off the box disregard the following.

Meanwhile you could also check if the data is still available in sdb, but you have to use the command line.

Using the web interface, stop the raid.

Then telnet/ssh the box and type the following commands:

mkdir /mnt/sdb2

mount -r /dev/sdb2 /mnt/sdb2 # mount read-only

ls /mnt/sdb2 # no output? are your diretories still there?

df -h # shows sdb2 69% full?

umount /mnt/sdb2 # don't forget this!

> I don't have right now a windows desktop to plug into my disks and

> test and repair with Western Digital's Data Lifeguard Diagnostic

> utiliy.

Yes, that's better to double check. After all Alt-F might be wrong.

But this is a community, I'm sure that someone else might be able to clarify:

Anyone has done a short or long smart test (Disk->Utils->Health) on disks with 1TB or greater capacity?

Joao Cardoso

unread,

Apr 8, 2011, 10:14:08 AM4/8/11

to al...@googlegroups.com

But this is a community, I'm sure that someone else might be able to clarify:

Anyone has done a short or long smart test (Disk->Utils->Health) on disks with 1TB or greater capacity?

As nobody rise up, I did a long smart test on a 1TB disk that completed to 100% without any error.

bobcote

unread,

Apr 13, 2011, 11:17:06 AM4/13/11

to Alt-F

On Apr 4, 5:04 pm, Joao Cardoso <whoami.jc...@gmail.com> wrote:
> > I don't have right now a windows desktop to plug into my disks and
> > test and repair with Western Digital's Data Lifeguard Diagnostic
> > utiliy.
>
> Yes, that's better to double check. After all Alt-F might be wrong.

On the disk sda (serial WD-WCAU45468983) which had errors according to
kernel log, Western Digital Data Lifeguard Diagnostic tool under
Windows (which looks to be some smart diagnostic tool as it has short
and extended tests) gave me errors which it could not repair. So I
thought I could do the same test under DOS and maybe that tool could
repair my disk as Western Digital states DOS version has more chances
to repair some disk errors. No errors occurred in the DOS version.
Weird. I re-did the test in Windows and this time no errors occurred,
very weird!
I did not wanted to waste more time so I bought a new 2 TB disk (model
WD20EARS).
I'm now trying to create and mount a raid array on that new disk, see
topic http://groups.google.com/group/alt-f/browse_thread/thread/6a6eaf4ac76c2d31/dbb157e8d1ded3ae

Reply all

Reply to author

Forward