WDC HDD RAID Failure / Intel SATA controller (random, every 2-3 weeks)

news.tpi.pl

unread,

Dec 8, 2009, 11:28:46 PM12/8/09

to

I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
only one drive at random partition (MD1, MD2).

SMART is CLEAN for both drives. There are no errors for both short and long
smart tests, both drives.
BADBLOCKS returns no errors for read / write safe and write desructible
modes, both drives.

The error is always same:
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Full error logs below, it happened yesterday SDB on MD2 failed. I made
rebuild that took > 20 hours, during the rebuild SDA on MD1 failed. So after
first rebuild finished i was forced to rebuid MD1.

Kernel: 2.6.27.10-grsec-xxxx-grs-ipv4-64

Kernel / libata bug? Any comments?

FIRST sdb fails:
Dec 7 09:50:14 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
Dec 7 09:50:14 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag
0
Dec 7 09:50:14 kernel: res 51/04:01:01:00:00/10:00:57:00:00/a0
Emask 0x1 (device error)
Dec 7 09:50:14 kernel: ata2.00: status: { DRDY ERR }
Dec 7 09:50:14 kernel: ata2.00: error: { ABRT }
Dec 7 09:50:14 kernel: ata2.00: configured for UDMA/133
Dec 7 09:50:14 kernel: ata2: EH complete
Dec 7 09:50:14 kernel: sd 1:0:0:0: [sdb] 1465149168 512-byte hardware
sectors (750156 MB)
Dec 7 09:50:14 kernel: end_request: I/O error, dev sdb, sector 1464099661
Dec 7 09:50:14 kernel: md: super_written gets error=-5, uptodate=0
Dec 7 09:50:14 kernel: raid1: Disk failure on sdb2, disabling device.
Dec 7 09:50:14 kernel: raid1: Operation continuing on 1 devices.
Dec 7 09:50:14 kernel: sd 1:0:0:0: [sdb] Write Protect is off
Dec 7 09:50:14 kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Dec 7 09:50:14 kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

THEN (during rebuild) drive SDA failed:
Dec 8 09:00:30 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
Dec 8 09:00:30 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a
0 tag 0
Dec 8 09:00:30 kernel: res 51/04:01:01:00:00/10:00:57:00:00/a0
Emask 0x1 (device error)
Dec 8 09:00:30 kernel: ata1.00: status: { DRDY ERR }
Dec 8 09:00:30 kernel: ata1.00: error: { ABRT }
Dec 8 09:00:30 kernel: ata1.00: configured for UDMA/133
Dec 8 09:00:30 kernel: ata1: EH complete
Dec 8 09:00:30 kernel: sd 0:0:0:0: [sda] 1465149168 512-byte hardware
sectors (750156 MB)
Dec 8 09:00:30 kernel: end_request: I/O error, dev sda, sector 1049030
3
Dec 8 09:00:30 kernel: md: super_written gets error=-5, uptodate=0
Dec 8 09:00:30 kernel: raid1: Disk failure on sda1, disabling device.
Dec 8 09:00:30 kernel: raid1: Operation continuing on 1 devices.

HARDWARE:

ata1.00: ATA-8: WDC WD7501AALS-00J7B1, 05.00K05, max UDMA/133
ata1.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata1.00: configured for UDMA/133
ata2.00: ATA-8: WDC WD7501AALS-00J7B0, 05.00K05, max UDMA/133
ata2.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata2.00: configured for UDMA/133

00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA
IDE Controller (rev 01) (prog-if 8f [Master SecP SecO PriP PriO])
Subsystem: Intel Corporation Device d613
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 19
Region 0: I/O ports at f0e0 [size=8]
Region 1: I/O ports at f0d0 [size=4]
Region 2: I/O ports at f0c0 [size=8]
Region 3: I/O ports at f0b0 [size=4]
Region 4: I/O ports at f0a0 [size=16]
Capabilities: [70] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: ata_piix

philo

unread,

Dec 9, 2009, 8:35:40 AM12/9/09

to

news.tpi.pl wrote:
> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
> only one drive at random partition (MD1, MD2).
>
> SMART is CLEAN for both drives. There are no errors for both short and long
> smart tests, both drives.
> BADBLOCKS returns no errors for read / write safe and write desructible
> modes, both drives.

I'd go further than that and run the manufacturer's diagnostic on the
drive in question.

If the diagnostic finds any errors, obviously you will have to replace
the drive.

OTOH: Even if the manufacturer's diagnostic does not find any errors...
I'd err on the side of caution and replace the drive.

Obviously I assume all data are backed up!

>
> The error is always same:
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>
> Full error logs below, it happened yesterday SDB on MD2 failed. I made
> rebuild that took > 20 hours, during the rebuild SDA on MD1 failed. So after
> first rebuild finished i was forced to rebuid MD1.
>
> Kernel: 2.6.27.10-grsec-xxxx-grs-ipv4-64
>
> Kernel / libata bug? Any comments?
>
> FI

<snip>

Hactar

unread,

Dec 9, 2009, 2:07:34 PM12/9/09

to

In article <hfo93c$3gf$2...@news.eternal-september.org>,

philo <ph...@privacy.invalid> wrote:
> news.tpi.pl wrote:
> > I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
> > only one drive at random partition (MD1, MD2).
> >
> > SMART is CLEAN for both drives. There are no errors for both short and long
> > smart tests, both drives.
> > BADBLOCKS returns no errors for read / write safe and write desructible
> > modes, both drives.
>
> I'd go further than that and run the manufacturer's diagnostic on the
> drive in question.
>
> If the diagnostic finds any errors, obviously you will have to replace
> the drive.
>
> OTOH: Even if the manufacturer's diagnostic does not find any errors...
> I'd err on the side of caution and replace the drive.

So, no matter what the manufacturer's diagnostic says, you'd replace the
drive. Why bother running it at all? I replace my drives too when they
start to act up, because I figure that's the beginning of the end.

> Obviously I assume all data are backed up!

As it should be.

--
-eben QebWe...@vTerYizUonI.nOetP royalty.mine.nu:81

This message was created using recycled electrons.

philo

unread,

Dec 9, 2009, 5:24:28 PM12/9/09

to

Hactar wrote:
> In article <hfo93c$3gf$2...@news.eternal-september.org>,
> philo <ph...@privacy.invalid> wrote:
>> news.tpi.pl wrote:
>>> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem random,
>>> only one drive at random partition (MD1, MD2).
>>>
>>> SMART is CLEAN for both drives. There are no errors for both short and long
>>> smart tests, both drives.
>>> BADBLOCKS returns no errors for read / write safe and write desructible
>>> modes, both drives.
>> I'd go further than that and run the manufacturer's diagnostic on the
>> drive in question.
>>
>> If the diagnostic finds any errors, obviously you will have to replace
>> the drive.
>>
>> OTOH: Even if the manufacturer's diagnostic does not find any errors...
>> I'd err on the side of caution and replace the drive.
>
> So, no matter what the manufacturer's diagnostic says, you'd replace the
> drive. Why bother running it at all? I replace my drives too when they
> start to act up, because I figure that's the beginning of the end.
>

If the drive is going to be replaced under warranty,
the mfg will want the diagnostic error code.

But, no matter what I'd replace it.
I have seen drives that passed the mfg's diagnostic
but were definitely bad. (rare though)

news.tpi.pl

unread,

Dec 11, 2009, 7:35:35 AM12/11/09

to

Yes, data is backed up.

But i can' t replace the drive (no manufacturer will 2 HDDs back because of
some bugs reported by kernel, when the drive is looking 100% healthy and
there are no errors).

Any other ideas?

Uzytkownik "philo" <ph...@privacy.invalid> napisal w wiadomosci
news:hfp82u$ucd$1...@news.eternal-september.org...

Jon Solberg

unread,

Dec 11, 2009, 7:48:43 AM12/11/09

to

On 2009-12-11, news.tpi.pl <pslawek> wrote:
>
> Uzytkownik "philo" <ph...@privacy.invalid> napisal w wiadomosci
> news:hfp82u$ucd$1...@news.eternal-september.org...
>> Hactar wrote:
>>> In article <hfo93c$3gf$2...@news.eternal-september.org>,
>>> philo <ph...@privacy.invalid> wrote:
>>>> news.tpi.pl wrote:
>>>>

>>>> [snipped]

>>>>
>>>> Obviously I assume all data are backed up!
>>>
>>> As it should be.
>

> Yes, data is backed up.
>
> But i can' t replace the drive (no manufacturer will 2 HDDs back
> because of some bugs reported by kernel, when the drive is looking
> 100% healthy and there are no errors).
>
> Any other ideas?

I can't help you with your original problem but, pretty please, don't
top post. It makes it unnecessarily hard to follow the thread.

Refer to
http://www.google.se/#hl=sv&source=hp&q=why+top+posting+is+bad&btnG=Google-s%C3%B6kning&meta=&aq=f&oq=why+top+posting+is+bad&fp=af2e0ae02f7c4ab7
for example for more information on postings styles.

Thanks.

--
Jon Solberg (remove "nospam." from email address).

AZ Nomad

unread,

Dec 11, 2009, 9:27:10 AM12/11/09

to

On Fri, 11 Dec 2009 13:35:35 +0100, news.tpi.pl <pslawek> wrote:
>Yes, data is backed up.

>But i can' t replace the drive (no manufacturer will 2 HDDs back because of
>some bugs reported by kernel, when the drive is looking 100% healthy and
>there are no errors).

>Any other ideas?

Replace them one at a time. Tell WD that the drive is dead.

philo

unread,

Dec 11, 2009, 10:09:07 AM12/11/09

to

news.tpi.pl wrote:
> Yes, data is backed up.
>
> But i can' t replace the drive (no manufacturer will 2 HDDs back because of
> some bugs reported by kernel, when the drive is looking 100% healthy and
> there are no errors).
>
> Any other ideas?
>

First off...I am not sure if I understood your first post correctly.
I thought the error was only on *one* of the drives. I may have mis-read
you. Is the error just on *one* drive...or does the error occur on both
drives (but one at a time)?

If the error can occur on either drive...then *maybe* the problem is
with the controller.

philo

unread,

Dec 11, 2009, 12:22:36 PM12/11/09

to

Hard drive manufacturers will want the drive to be tested first with
their diagnostic utility and they will want the error code.

OTOH: I once did get a drive RMA'ed that did not give an error code...
yet I had carefully documented the exact problem.

news.tpi.pl

unread,

Dec 15, 2009, 12:35:43 AM12/15/09

to

> First off...I am not sure if I understood your first post correctly.
> I thought the error was only on *one* of the drives. I may have mis-read
> you. Is the error just on *one* drive...or does the error occur on both
> drives (but one at a time)?

Random partition and drives, but the error happens more often @ SDA.

Just got some other error, this time the drive wasn't disconnected from the
array.

Dec 14 17:02:32 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6 frozen
Dec 14 17:02:32 kernel: ata1.00: cmd 25/00:08:cd:a7:36/00:00:57:00:00/e0 tag
0 dma 4096 in
Dec 14 17:02:32 kernel: res 40/00:00:09:4f:c2/10:00:57:00:00/00
Emask 0x4 (timeout)
Dec 14 17:02:32 kernel: ata1.00: status: { DRDY }
Dec 14 17:02:37 kernel: ata1: link is slow to respond, please be patient
(ready=0)
Dec 14 17:02:42 kernel: ata1: device not ready (errno=-16), forcing
hardreset
Dec 14 17:02:42 kernel: ata1: soft resetting link
Dec 14 17:02:42 kernel: ata1.00: configured for UDMA/133
Dec 14 17:02:42 kernel: ata1: EH complete
Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] 1465149168 512-byte hardware
sectors (750156 MB)
Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write Protect is off
Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache:

enabled, doesn't support DPO or FUA

Could be intel chipset driver bug, what do you think?
http://lkml.indiana.edu/hypermail/linux/kernel/0808.3/2716.html

philo

unread,

Dec 15, 2009, 7:56:43 AM12/15/09

to

it *could* be a bug

but really it's going to need some investigating to narrow down

news.tpi.pl

unread,

Dec 15, 2009, 1:51:31 PM12/15/09

to

> it *could* be a bug
>
> but really it's going to need some investigating to narrow down

Ok so how it can be done?

philo

unread,

Dec 15, 2009, 2:27:02 PM12/15/09

to

There is really only one way to know for sure

and that is by experimentation.

Of course that would involve experimenting with different drivers
a different kernel perhaps and different hardware.

As long as you are 100% certain all data are backed up
you can afford to experiment.

If it was my own machine I'd probably try a different controller
and not use RAID...
but I can't presume to tell you what to do with your own system