Daily crashes, incorrect RAID behaviour

Carsten Otto

unread,

Aug 15, 2006, 7:37:19 AM8/15/06

to linux-kernel

Hello!

System specs below (iCH7R, software raid 5)

My problems continue, even with a new and good power supply.
1) The system loses a disk about every week, only a hard reboot solves that
2) In the last three nights the system lost all disk access and
trashed the file systems

Regarding 1)
The system works normally and suddenly one disk does not respond.
After a soft reboot the BIOS does not recognize the disk, here a hard
reboot helps. Whenever I start my normal system in this situation, my
file systems get trashed. I think the software raid thinks the failed
disks (which lost several hours of write accesses) is OK and then
merges the data. When I delete the disk (or create another raid on the
partitions) I can add the disk without problems. This might be a bug,
at least it is _very_ annoying.

Regarding 2)
The system works as usual, but stops whenever disk access is needed
(some cached webpages work, but ssh login does not). On the screen I
see some scrolling messages telling me:
DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
many ext3 errors ("Something % 4 != 0, inode ..., something ..., )
After a reboot with the failed disk removed (to avoid the problem of
1) the system's file system is totally corrupt, fsck.ext3 finds a lot
of errors.
In my opinion this should not happen in a raid 5, am I correct?

Sidenotes:
The hard disk all are OK, I checked them.
The system choses the failing disks at random. I do not see a pattern here.
After I reported similar problems here I got the hint to get a better
power supply. I did that (600W now) but that does not help.
However, after the upgrade to the new power supply the system worked
fine for almost two weeks (then the weekly crashes started).

System specs:
Kernel 2.6.17.8 and newer
Software raid 5
Asus P5LB2 with iCH7R
Pentium D 805 (Dual Core)
2 GB PC533
4x Maxtor 300GB (Sata2)
1x Samsung 200GB (Pata)
Intel PCIe network

Thanks vor _every_ hint, I am desperate. The system is quite new and
only makes problems.
--
Carsten Otto
carste...@gmail.com
www.c-otto.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Michael Tokarev

unread,

Aug 15, 2006, 8:34:17 AM8/15/06

to Carsten Otto

Carsten Otto wrote:
> Hello!
>
> System specs below (iCH7R, software raid 5)
>
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot solves that

We've seen this in alot of cases in the past. The issue was in a single
batch of seagate 9gig drives (yes, old) - from time to time, one disk
just disappears from the system completely, only power-off-on cycle
forces it to reappear. This happens without any pattern, ie, randomly -
sometimes a disk can disappear after several minutes after a power-on,
without any system load; and some times, it works just fine for several
months.

We tried to replace (RMA) the bad drives one by one, with the same
scenario all the time: they test the drive for a day, and call us back
saying everything's ok; we grab the drive, and return it back the next
day (because we *know* it's NOT Ok), and they send it for replacement.
The replaced drives (even refurbished ones) all works ok (we replaced
about 20 drives in total, all from the same batch).

I talked with seagate techs about this issue, but there was no conclusion
(he said it's "typical mishandling", like static elictricity etc, but
that does not match the behaviour at all). And since the drives are very
old now (but quite some of them are still in production ;), and was already
quite old when the problem started happening (about 6 years ago).. it's
simpler to trash them, replacing with more modern drives.

That was only one batch of drives. And the drives was excellent (for their
age anyway): no single disk failure in many years, not even single bad block
on about 50 drives! If not counting those sporadic disappearing of course ;)
And Seagate guys says this is something they've never hear before, too.

That all to say: sometimes disk drives do strange things. Rare, very rare,
but that happens... ;)

/mjt

Alan Cox

unread,

Aug 15, 2006, 8:36:46 AM8/15/06

to Carsten Otto

Ar Maw, 2006-08-15 am 13:36 +0200, ysgrifennodd Carsten Otto:
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my

Rule of thumb (and a good one). If the soft reboot and BIOS cannot
recover the disk then the disk is the problem. There isn't really
anything we can tell the drive to do which should make it take a hike
and ignore a reset sequence. (Should.. however..)

> DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk

Pity the exact text is essential.

> However, after the upgrade to the new power supply the system worked
> fine for almost two weeks (then the weekly crashes started).

I assume you've run memtest86 and also checked temperatures look good
around all the disks.

Carsten Otto

unread,

Aug 15, 2006, 8:42:52 AM8/15/06

to Alan Cox

> Rule of thumb (and a good one). If the soft reboot and BIOS cannot
> recover the disk then the disk is the problem. There isn't really
> anything we can tell the drive to do which should make it take a hike
> and ignore a reset sequence. (Should.. however..)

Makes sense. I will focus my attention on the disks now (which makes
sense not only because of your information).

> > DriveReadySeekComplete (I do not recall the exact words, sorry) for one disk
> Pity the exact text is essential.

Here is the exact message I saw a few weeks ago (posted in here):

ata4: handling error/timeout
ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key=0x0
ASC=0x0 ASCQ=0x0
Info fid=0x0

To my knowledge this time it did not look different at all.

> I assume you've run memtest86 and also checked temperatures look good
> around all the disks.

Of course. I even replaced the mainboard (screwdriver accident..) and
power supply (too weak). And I now know that the sata cables I used at
first did not cause the problems :)

Thanks,

--
Carsten Otto
carste...@gmail.com
www.c-otto.de

Jan Engelhardt

unread,

Aug 15, 2006, 9:10:13 AM8/15/06

to Carsten Otto

>> > DriveReadySeekComplete (I do not recall the exact words, sorry) for
>> > one disk
>> Pity the exact text is essential.
>
> Here is the exact message I saw a few weeks ago (posted in here):
>
> ata4: handling error/timeout
> ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
> ata4: status=0x50 { DriveReady SeekComplete }
> sdd: Current: sense key=0x0
> ASC=0x0 ASCQ=0x0
> Info fid=0x0

Although I do not want to accuse libata, is it possible that a libata bug
is around?

Jan Engelhardt
--

Ralf Müller

unread,

Aug 15, 2006, 9:46:25 AM8/15/06

to linux-kernel

On Tuesday 15 August 2006 13:36, you wrote:
> My problems continue, even with a new and good power supply.
> 1) The system loses a disk about every week, only a hard reboot
> solves that 2) In the last three nights the system lost all disk
> access and trashed the file systems
>
> Regarding 1)
> The system works normally and suddenly one disk does not respond.
> After a soft reboot the BIOS does not recognize the disk, here a hard
> reboot helps. Whenever I start my normal system in this situation, my
> file systems get trashed. I think the software raid thinks the failed
> disks (which lost several hours of write accesses) is OK and then
> merges the data. When I delete the disk (or create another raid on
> the partitions) I can add the disk without problems. This might be a
> bug, at least it is _very_ annoying.

> 4x Maxtor 300GB (Sata2)

I have a similar problem with maybe the same type of disk. My analysis
of the problem is still not that complete so I did not asked on the
kernel mailing list yet.

The disk type we use here in a ten disk RAID6 is:
Maxtor 7V300F0 300GB Sata2
on Promise Sata 300 TX4 controllers

Once in a week or two a random disk is not responding anymore and needs
a complete power off/on cycle to recover. After power cycle the disk
works without problems, doesn't report any SMART problems ...
I'm quite sure it is no problem with power supply, motherboard,
backplane or controllers. Still open are cabling and disks. Nearly the
same setup of hardware - just with different disks - is running smooth
since about 8 month in a different system.

If this is the same disk type we maybe should return the disks to our
hardware vendors as it may be a disk problem.

The only messages I get are like that:
Aug 13 17:25:10 backup-core kernel: ata7: command timeout
Aug 13 17:25:10 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:10 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:25:42 backup-core kernel: ata7: command timeout
Aug 13 17:25:42 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:25:42 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: ata7: command timeout
Aug 13 17:26:43 backup-core kernel: ata7: translated ATA stat/err
0xff/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Aug 13 17:26:43 backup-core kernel: ata7: status=0xff { Busy }
Aug 13 17:26:43 backup-core kernel: end_request: I/O error, dev sdg,
sector 2104383

Regards
Ralf

--
Van Roy's Law: -------------------------------------------------------
An unbreakable toy is useful for breaking other toys.

Carsten Otto

unread,

Aug 15, 2006, 11:31:34 AM8/15/06

to linux-kernel

Okay, after Ralf's message I found this newsgroup post:
http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&

> You should be aware that currently
> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> bug. The current version shipping is VA111630, an update is available to
> VA111670 which merely reduces the frequency of timeouts that get the drive
> kicked out from the array.

I got a new firmware from Maxtor today. My disks now have firmware
VA111900, before that I had VA111630. Let's see what happens...

PS: Maxtor's hotline guy had no record about firmware related
problems. I'd like to report those (with the two additional
references), but now the hotline has technical difficulties...

Bye,

Mike Dresser

unread,

Aug 15, 2006, 2:29:23 PM8/15/06

to Carsten Otto

On Tue, 15 Aug 2006, Carsten Otto wrote:

> Okay, after Ralf's message I found this newsgroup post:
> http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=de&
>
>> You should be aware that currently
>> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
>> bug. The current version shipping is VA111630, an update is available to
>> VA111670 which merely reduces the frequency of timeouts that get the drive
>> kicked out from the array.

I'm running 680 now, and the 15 drives have been up for something like two
months or so without issues at all.. Seems like the firmware fixes the
problem.

Mike

Alistair John Strachan

unread,

Aug 15, 2006, 3:27:56 PM8/15/06

to Mike Dresser

On Tuesday 15 August 2006 19:28, Mike Dresser wrote:
> On Tue, 15 Aug 2006, Carsten Otto wrote:
> > Okay, after Ralf's message I found this newsgroup post:
> > http://groups.google.de/group/linux.debian.user/msg/f12dec920523a629?hl=d
> >e&
> >
> >> You should be aware that currently
> >> Maxtor Maxline III's(7v300F0's) do not work properly due to a firmware
> >> bug. The current version shipping is VA111630, an update is available
> >> to VA111670 which merely reduces the frequency of timeouts that get the
> >> drive kicked out from the array.
>
> I'm running 680 now, and the 15 drives have been up for something like two
> months or so without issues at all.. Seems like the firmware fixes the
> problem.

Still, the RAID rebuild problem is worrying. I might have to deliberately
fault my RAID5 partitions and see if they rebuild correctly..

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Andrew Baker

unread,

Aug 19, 2006, 7:45:50 AM8/19/06

to linux-...@vger.kernel.org

We too are having the same problem and the only obviously common factor is
Maxtor SATA HDD.

We have two identical systems - 64 bit - 2 x Dual Opterons, 8Gb Ram running
Novell/SUSE SLES10. Both systems are showing the problem.

In our case the RAID controller is

3ware Escalade 9550SX - 8LP

And the HDD are:

Maxtor MaxLine III (7V250F0) 250GB SATA II

The symptoms here are almost exactly as you describe. A disc "drops out" once
every week or two and the only way to clear the problem is a power cycle - or
remove and replace the HDD (our system is hot-swap).

Regards

Andrew

Justin Piszcz

unread,

Aug 19, 2006, 7:48:05 AM8/19/06

to Andrew Baker

I had the same problem with a 3ware 2 port IDE raid controller, 7006-2.
One drive would always drop out under heavy I/O. Made me sick. Moved to
SW raid, all problems went away.

Justin.

Andrew Baker

unread,

Aug 19, 2006, 2:53:53 PM8/19/06

to linux-...@vger.kernel.org

For various complex reasons,
Software RAID is not a viable option on these systems.