RAID1 problem - server freezes on md data-check

George Chelidze

unread,

Jan 4, 2010, 3:50:02 AM1/4/10

to

Hello,

I'v got an HP ML110 Intel Dual-Core E2160 server with 2 HDDs:

GB0250EAFYK - HP 250GB 3G SATA 7.2K 3.5" MDL 250 GB SATA Hard Drive
GB0250C8045 - HP 250GB 7.2K SATA Hard Disk Drive

So, I use SATA 3.0-Gb/s and SATA 1.5 Gb/s for RAID-1 configuration. I
have configured 4 MD volumes and it's running fine for some time,
however every now and then servers freezes. At that time I can ping the
server from the network, however I can't ssh into the server, even a
keyboard us useless, so I have to hard reset the server. Below are the
last messages from my kern.log:

Jan 3 00:57:01 barambo1 kernel: [986475.159596] md: data-check of RAID
array md0
Jan 3 00:57:01 barambo1 kernel: [986475.159600] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Jan 3 00:57:01 barambo1 kernel: [986475.159602] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
data-check.
Jan 3 00:57:01 barambo1 kernel: [986475.159606] md: using 128k window,
over a total of 3903680 blocks.
Jan 3 00:57:01 barambo1 kernel: [986475.162041] md: delaying data-check
of md1 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.164449] md: delaying data-check
of md2 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.164455] md: delaying data-check
of md1 until md2 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166695] md: delaying data-check
of md3 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166699] md: delaying data-check
of md1 until md3 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166705] md: delaying data-check
of md2 until md3 has finished (they share one or more physical units)
Jan 3 00:58:13 barambo1 kernel: [986547.257883] md: md0: data-check
done.
Jan 3 00:58:13 barambo1 kernel: [986547.276663] md: delaying data-check
of md1 until md3 has finished (they share one or more physical units)
Jan 3 00:58:13 barambo1 kernel: [986547.276668] md: data-check of RAID
array md3
Jan 3 00:58:13 barambo1 kernel: [986547.276671] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Jan 3 00:58:13 barambo1 kernel: [986547.276674] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
data-check.
Jan 3 00:58:13 barambo1 kernel: [986547.276678] md: using 128k window,
over a total of 122126016 blocks.
Jan 3 00:58:13 barambo1 kernel: [986547.276681] md: delaying data-check
of md2 until md3 has finished (they share one or more physical units)

OS is Debian 5.0.3 Lenny stable with linux-image-2.6.30-bpo.2-686
kernel. I had the same results with linux-image-2.6.26-2-686 stock
kernel. My basic question is can this happen because I use 2 different
drives? I have a chance to replace GB0250C8045 with GB0250EAFYK or
GB0250EAFYK with GB0250C8045 and have 2 identical drives. Is it a good
idea and will it solve my problem?

Thank you in advance for any input,

Best Regards,

George Chelidze

--
To UNSUBSCRIBE, email to debian-is...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Ross Halliday

unread,

Jan 4, 2010, 5:40:02 AM1/4/10

to

The total locking up sounds like a problem that someone who develops the
software might be able to help with (I am reminded of a bug that Ubuntu
featured where checkarray would completely freeze or reboot certain
systems on Linux 2.6.24 or so). Aside from any bugs that checkarray
function is definitely a pain on a production system.

You can try changing the disks out so that both of them run at 3.0 Gbps,
this may speed up the process. Otherwise I would suggest checking out
the help for /usr/share/mdadm/checkarray and modifying the system cron
job (see /etc/cron.d/mdadm) so the checks are timed and staggered per
array the way you like.

Cheers

---
Ross Halliday
Network Operations
WTC Communications

Thomas Goirand

unread,

Jan 4, 2010, 8:00:02 AM1/4/10

to

Ross Halliday wrote:
> Aside from any bugs that checkarray
> function is definitely a pain on a production system.

Well, it's even more a pain to have no monthly check at all, and have
your drive silently die without a warning. Also, my findings is that
most of the time, such lock-up happens only on certain kind of
controllers, or with defective (half working) HDD.

Thomas

George Chelidze

unread,

Jan 5, 2010, 12:30:02 AM1/5/10

to

First let me say thank you to all who shared their experience and
knowledge. It was really helpful.

Yesterday I managed to replace 1.5Gb/s drive with 3.0Gb/s drive and now
both drives are identical. The replacement required to rebuild an array
and it passed but with one exception: at the end of reconstruction
process I got "task * blocked for more than 120 seconds" messages in my
logs:

Jan 4 23:38:35 barambo1 kernel: [12517.683173] INFO: task
kjournald:1088 blocked for more than 120 seconds.
Jan 4 23:38:35 barambo1 kernel: [12517.683227] "echo 0
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 4 23:38:35 barambo1 kernel: [12517.683310] kjournald D 0735ccb1
0 1088 2
Jan 4 23:38:35 barambo1 kernel: [12517.683313] f78ef0c0 00000046
f7817c28 0735ccb1 00000a43 f78ef24c c180bfc0 00000000
Jan 4 23:38:35 barambo1 kernel: [12517.683319] f7495edc 008e3f0c
000006b2 00000000 008e3f0c f7495edc 008e3f0c f7939f18
Jan 4 23:38:35 barambo1 kernel: [12517.683326] c180bfc0 01451000
f7939f18 c1801688 c02b8a70 f7939f10 00000000 c019098e
Jan 4 23:38:35 barambo1 kernel: [12517.683332] Call Trace:
Jan 4 23:38:35 barambo1 kernel: [12517.683339] [<c02b8a70>]
io_schedule+0x49/0x80
Jan 4 23:38:35 barambo1 kernel: [12517.683343] [<c019098e>]
sync_buffer+0x30/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683347] [<c02b8c5e>]
__wait_on_bit+0x33/0x58
Jan 4 23:38:35 barambo1 kernel: [12517.683351] [<c019095e>]
sync_buffer+0x0/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683355] [<c019095e>]
sync_buffer+0x0/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683358] [<c02b8ce2>]
out_of_line_wait_on_bit+0x5f/0x67
Jan 4 23:38:35 barambo1 kernel: [12517.683364] [<c01319c9>]
wake_bit_function+0x0/0x3c
Jan 4 23:38:35 barambo1 kernel: [12517.683369] [<c019092a>]
__wait_on_buffer+0x16/0x18
Jan 4 23:38:35 barambo1 kernel: [12517.683373] [<f894fd7a>]
journal_commit_transaction+0x6cf/0xb3d [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683386] [<c0129b2c>]
lock_timer_base+0x19/0x35
Jan 4 23:38:35 barambo1 kernel: [12517.683393] [<f8952468>] kjournald
+0xa5/0x1c6 [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683402] [<c013199c>]
autoremove_wake_function+0x0/0x2d
Jan 4 23:38:35 barambo1 kernel: [12517.683406] [<f89523c3>] kjournald
+0x0/0x1c6 [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683414] [<c01318db>] kthread
+0x38/0x5d
Jan 4 23:38:35 barambo1 kernel: [12517.683417] [<c01318a3>] kthread
+0x0/0x5d
Jan 4 23:38:35 barambo1 kernel: [12517.683421] [<c01044f3>]
kernel_thread_helper+0x7/0x10
Jan 4 23:38:35 barambo1 kernel: [12517.683426] =======================

(please check attached file with similar messages for different
processes) However, after several minutes server returned to it's normal
state and since then working fine. Now it's running
linux-image-2.6.26-2-686 stock kernel. Any ideas?

Best Regards,

George Chelidze

kern.log.gz

Thomas Goirand

unread,

Jan 5, 2010, 7:30:02 AM1/5/10

to

George Chelidze wrote:
> First let me say thank you to all who shared their experience and
> knowledge. It was really helpful.
>
> Yesterday I managed to replace 1.5Gb/s drive with 3.0Gb/s drive and now
> both drives are identical. The replacement required to rebuild an array
> and it passed but with one exception: at the end of reconstruction
> process I got "task * blocked for more than 120 seconds" messages in my
> logs:
>
> Jan 4 23:38:35 barambo1 kernel: [12517.683173] INFO: task
> kjournald:1088 blocked for more than 120 seconds.

I never had this, however, my understanding is that this is related to
ext3 journaling filesystem (as this is kjournald that is blocked for 2
minutes), not to RAID (which would be mdadm, mdX_raidY and the like...),
and that it shouldn't be blocking anything apart writing the journal.
Was the server in a frozen state when this appeared in your log?

Just my 2 cents guess here,

Thomas

micah anderson

unread,

Jan 5, 2010, 10:20:03 AM1/5/10

to

On Mon, 04 Jan 2010 20:34:09 +0800, Thomas Goirand <tho...@goirand.fr> wrote:
> Ross Halliday wrote:
> > Aside from any bugs that checkarray
> > function is definitely a pain on a production system.

I have this same problem with the Lenny kernels on certain machines. I
have not been able to identify anything specific that is identical on
the machines where this happens yet. Essentially, on these systems, the
monthly raid check requires a reboot as the drive subsystem becomes so
blocked that the load goes over 500 and the raid resync never
completes. I can wait for days for it and it wont finish.

If I reboot the system and sync the raid arrays before anything starts
to use that particular partition, then everything works fine.

On these systems I disable the monthly raid check, its not the right
solution obviously, but it sucks to wake up on Sunday morning to find
multiple outages due to this scheduled raid check.

> Well, it's even more a pain to have no monthly check at all, and have
> your drive silently die without a warning. Also, my findings is that
> most of the time, such lock-up happens only on certain kind of
> controllers, or with defective (half working) HDD.

I agree silent drive death is bad, but in a raid mirror setup, if one of
the drives dies, wont you be fine?

I am pretty certain its not a particular type of controller, because I
have a number of duplicate hardware machines, some have this problem,
some do not. The 'half working' HDD was my theory as well, but smart
tests, badblocks doesn't seem to do anything.

m

Peter Vratny

unread,

Jan 5, 2010, 11:20:02 AM1/5/10

to

micah anderson wrote:
> I have this same problem with the Lenny kernels on certain machines. I
> have not been able to identify anything specific that is identical on
> the machines where this happens yet. Essentially, on these systems, the

here it is the same. the problem introduced with lenny. all our
maschines where this happens are IBM Blades HS20 with IDE, Hardware-Raid
disabled (using md and ext3).

> On these systems I disable the monthly raid check, its not the right
> solution obviously, but it sucks to wake up on Sunday morning to find
> multiple outages due to this scheduled raid check.

thats what we did, too (you are lucky that your monitoring lets you
sleep until the morning :-)).

>> Well, it's even more a pain to have no monthly check at all, and have
>> your drive silently die without a warning. Also, my findings is that
>> most of the time, such lock-up happens only on certain kind of
>> controllers, or with defective (half working) HDD.
>
> I agree silent drive death is bad, but in a raid mirror setup, if one of
> the drives dies, wont you be fine?
>
> I am pretty certain its not a particular type of controller, because I
> have a number of duplicate hardware machines, some have this problem,
> some do not. The 'half working' HDD was my theory as well, but smart
> tests, badblocks doesn't seem to do anything.

I second this. Imho its a problem of the kernel (resp. some driver). i
hoped this would end with some upgrade, it did not (we're using stock
kernel).

ys
Peter

--
"Wer nichts zu verbergen hat, hat bereits alles verloren"
http://klicklich.at

signature.asc

Thomas Goirand

unread,

Jan 5, 2010, 12:20:02 PM1/5/10

to

Peter Vratny wrote:
> I second this. Imho its a problem of the kernel (resp. some driver). i
> hoped this would end with some upgrade, it did not (we're using stock
> kernel).

So you guys are saying this is in the sata driver? If so, then what's
the SATA controler that you are running? It would be interesting to know
if you all got the same hardware (and then using the same driver).

Thomas Goirand

unread,

Jan 5, 2010, 12:20:02 PM1/5/10

to

micah anderson wrote:
> I am pretty certain its not a particular type of controller, because I
> have a number of duplicate hardware machines, some have this problem,
> some do not. The 'half working' HDD was my theory as well, but smart
> tests, badblocks doesn't seem to do anything.
>
> m

There has been some very interesting statistics that google has
published on their webfarm about SMART. The result were that SMART
catches about 60% of the failures, and that on the other 40%, it doesn't
sees anything. So it's not a very reliable test (I'm not saying it's
useless, just that it wont catch all failures).

As for the badblocks, how did you check for them?

Altogether, I really think that RAID and HDD failure are really a big
issue that us, providers, have to deal with. I wish there was some
reliable solutions out there, considering the imperfection of RAID
(hardware OR software, both have issues...).

Peter Vratny

unread,

Jan 5, 2010, 5:10:02 PM1/5/10

to

Thomas Goirand wrote:
> So you guys are saying this is in the sata driver? If so, then what's
> the SATA controler that you are running? It would be interesting to know
> if you all got the same hardware (and then using the same driver).

No the opposit, that's why I mentioned that it's IDE (PATA) on our Blades...

found 3 of them in a quick research, all with this controller:

[ 2.258305] SvrWks CSB6: IDE controller (0x1166:0x0213 rev 0xb0) at
PCI slot 0000:00:0f.1

yours

signature.asc

Ross Halliday

unread,

Jan 5, 2010, 7:30:03 PM1/5/10

to

(Sorry Thomas - hit Reply instead of Reply All by mistake)

> -----Original Message-----
> From: Thomas Goirand [mailto:tho...@goirand.fr]
> Sent: Monday, January 04, 2010 7:34 AM
> To: debia...@lists.debian.org
> Subject: Re: RAID1 problem - server freezes on md data-check
>
> Ross Halliday wrote:
> > Aside from any bugs that checkarray
> > function is definitely a pain on a production system.
>
> Well, it's even more a pain to have no monthly check at all, and have
> your drive silently die without a warning. Also, my findings is that
> most of the time, such lock-up happens only on certain kind of
> controllers, or with defective (half working) HDD.
>
> Thomas

Yes and no: as I see it, RAID1 has been less about protecting the data
itself and more a 'hot spare' idea so that if one disk bites the dust
there is instantaneous failover. It's a very basic design and I would
say holds true to its name: "Redundant Array of Independent Disks".
Technologies like RAID5 which have parity checking will tell you the
instant one disk is behaving badly and kill it from the array - this is
more suited to protecting against partial failures and data corruption.

I have to seriously question the value of this once-a-month check as the
other 27-30 days of the month your disk could be half-dead, spewing
corrupt data and you'd never know until it was Sunday at 1:06 AM. It
seems like a sort of after-thought hack that renders your disks unusable
for a few hours. It would make more sense if the check was run nightly,
but you would probably see a lot of upset people complaining that
backups take forever and after-hours performance would take a massive
hit. Perhaps there is some other way to check data integrity more than
approximately once per full moon that doesn't destroy I/O performance?

I'm not trying to start a war or anything, and I apologize if I sound
like a complete idiot. Of course any corrections are welcome. However
the above is my opinion built on my knowledge of RAID systems, some of
it probably inaccurate as I am just a lowly sys admin :)

Cheers

---
Ross Halliday
Network Operations
WTC Communications

George Chelidze

unread,

Jan 6, 2010, 12:50:02 AM1/6/10

to

> So you guys are saying this is in the sata driver? If so, then what's
> the SATA controler that you are running? It would be interesting to know
> if you all got the same hardware (and then using the same driver).

HP Embedded 6 Port SATA Controller with Embedded RAID

After more investigation I found that second drive was unable to
negotiate 3.0Gb/s speed and dropped to 1.5Gb/s.

[ 3.433967] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
...
[ 4.518560] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

That's another issue which triggered a question:

Will it make RAID more stable if I force both drives to be 1.5Gb/s? For
seagate drives it can be done by setting a jumper, no idea how it's
possible on WD.

Anders Breindahl

unread,

Jan 6, 2010, 10:20:01 AM1/6/10

to

Hello Ross,

> > Ross Halliday wrote:
> > > Aside from any bugs that checkarray
> > > function is definitely a pain on a production system.
> >
> > Well, it's even more a pain to have no monthly check at all, and have

> > your drive silently die without a warning. [...]

[...]

> I have to seriously question the value of this once-a-month check as the
> other 27-30 days of the month your disk could be half-dead, spewing
> corrupt data and you'd never know until it was Sunday at 1:06 AM. It
> seems like a sort of after-thought hack that renders your disks unusable
> for a few hours. It would make more sense if the check was run nightly,
> but you would probably see a lot of upset people complaining that
> backups take forever and after-hours performance would take a massive
> hit. Perhaps there is some other way to check data integrity more than
> approximately once per full moon that doesn't destroy I/O performance?

Since the re-check is comparable to (identical to?) the re-sync that's done
after a spare has been added, you ought to be able to control the I/O rate
using the usual /proc/sys/dev/raid/speed_limit_{min,max}-controls. I, however,
have never seen those controls actually have any effect.

The re-check is however a guard against silent failures, in that it guarantees
that all logical chunks have been touched upon, on all member disks. Had it not
been there, some server somewhere might have that the i'th block on the j'th
participant disk has somehow become unreadable---except this isn't known yet,
because nobody has accessed it for years.

When the k'th (for j!=k) participant disk experiences e.g. a head crash, the
recombination algorithm (be it raid1, raid5, whatever) cannot reconstruct the
logical data that has come to depend on the i'th block in the j'th disk. The
result is that silent failure combines with a non-silent failure, and there
isn't enough information available to reconstruct the stored data. Data loss
results.

As a sidenote, the sort of corruption we're talking about here is something
that the disk reports (as opposed to some checksum that doesn't add up in the
md layer). So the same purpose as the first-Sunday-of-month check could be
fulfilled by issuing the command to the disk to do a self-check:

mdadm --detail /dev/md0 | egrep -o "/dev/sd[a-z]+" | sort -u | \
while read disk; do smartctl -t long $disk; done

This should not cause I/O death, since no data is ever communicated over the
I/O channel. But the disk itself will see that its CRC's don't add up, and
you'd be getting mails from smartd. Of course, the md layer won't take notice.

Regards, skrewz.

Henrique de Moraes Holschuh

unread,

Jan 10, 2010, 12:40:02 PM1/10/10

to

On Wed, 06 Jan 2010, George Chelidze wrote:
> > So you guys are saying this is in the sata driver? If so, then what's
> > the SATA controler that you are running? It would be interesting to know
> > if you all got the same hardware (and then using the same driver).
>
> HP Embedded 6 Port SATA Controller with Embedded RAID
>
> After more investigation I found that second drive was unable to
> negotiate 3.0Gb/s speed and dropped to 1.5Gb/s.
>
> [ 3.433967] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ...
> [ 4.518560] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>
>
> That's another issue which triggered a question:
>
> Will it make RAID more stable if I force both drives to be 1.5Gb/s? For
> seagate drives it can be done by setting a jumper, no idea how it's
> possible on WD.

If that is about the same model of disk, connected to the same SATA host,
you do realise something is broken, don't you?

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

Henrique de Moraes Holschuh

unread,

Jan 10, 2010, 12:50:01 PM1/10/10

to

On Tue, 05 Jan 2010, Ross Halliday wrote:
> Technologies like RAID5 which have parity checking will tell you the
> instant one disk is behaving badly and kill it from the array - this is
> more suited to protecting against partial failures and data corruption.

You want RAID6 for that. The math is simple: you need n>1, n odd, to use a
simple majority vote to know which data set is correct. RAID5 cannot do
that. RAID1 can, but not with 2 devices.

RAID5 is only useful for known failure (i.e. you get information about WHICH
component device is bad, e.g., through sector IO errors). If it is silent
corruption, you're screwed.

> I have to seriously question the value of this once-a-month check as the
> other 27-30 days of the month your disk could be half-dead, spewing

So do I. At work use short SMART scans *daily* to locate bad sectors (it is
good at finding a cluster of weakening sectors, but not perfect), and a
repair scrub once a week to reduce bitrot. But you're better off with a
scrub once a month, than never scrubbing at all.

But that has nothing to do with silent corruption protection. If that is
what you're afraid, your problem needs a very different solution than RAID
could give you.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

George Chelidze

unread,

Jan 11, 2010, 5:20:03 AM1/11/10

to

> > After more investigation I found that second drive was unable to
> > negotiate 3.0Gb/s speed and dropped to 1.5Gb/s.
> >
> > [ 3.433967] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > ...
> > [ 4.518560] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> >
> >
> > That's another issue which triggered a question:
> >
> > Will it make RAID more stable if I force both drives to be 1.5Gb/s? For
> > seagate drives it can be done by setting a jumper, no idea how it's
> > possible on WD.
>
> If that is about the same model of disk, connected to the same SATA host,
> you do realise something is broken, don't you?

$ lsscsi
[0:0:0:0] disk ATA GB0250EAFYK HPG1 /dev/sda
[0:0:1:0] disk ATA GB0250EAFYK HPG1 /dev/sdb
[2:0:0:0] cd/dvd HL-DT-ST DVD-RAM GH40L LA00 /dev/sr0

And yes, I realize that something is "broken", whether it's hardware or
some misconfiguration of anything, but I am not good enough in all these
SATA things. What kind of tests are appropriate for this situation? Do
you think it's more HDD related problem, or maybe SATA host, or
something else?

Thanks in advance,

George

Henrique de Moraes Holschuh

unread,

Feb 7, 2010, 8:20:01 AM2/7/10

to

On Mon, 11 Jan 2010, George Chelidze wrote:
> > > After more investigation I found that second drive was unable to
> > > negotiate 3.0Gb/s speed and dropped to 1.5Gb/s.
> > >
> > > [ 3.433967] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > > ...
> > > [ 4.518560] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > >
> > >
> > > That's another issue which triggered a question:
> > >
> > > Will it make RAID more stable if I force both drives to be 1.5Gb/s? For
> > > seagate drives it can be done by setting a jumper, no idea how it's
> > > possible on WD.
> >
> > If that is about the same model of disk, connected to the same SATA host,
> > you do realise something is broken, don't you?
>
> $ lsscsi
> [0:0:0:0] disk ATA GB0250EAFYK HPG1 /dev/sda
> [0:0:1:0] disk ATA GB0250EAFYK HPG1 /dev/sdb
> [2:0:0:0] cd/dvd HL-DT-ST DVD-RAM GH40L LA00 /dev/sr0
>
> And yes, I realize that something is "broken", whether it's hardware or
> some misconfiguration of anything, but I am not good enough in all these
> SATA things. What kind of tests are appropriate for this situation? Do
> you think it's more HDD related problem, or maybe SATA host, or
> something else?

It could be anything. You could ask for help on the linux-ide ML. If you
do that, you should provide them with:

1. A succint description of the problem. Be direct and concise.

2. Full kernel logs showing the boot, and the problem

3. lspci -v and lspci -vvv

4. hdparm -i and hdparm -I output for each device.

5. smartctl -d ata -a output for each device.

6. Look at your HDs, and make sure any "jumpers" are in the correct
position, and disclose their configuration in the report.

And any extra info they ask of you. I hope they will be able to help you.

But the usual first suspect are the SATA cables, it is probably worth a try
to replace them before you report any problems to linux-ide, as they might
ask you to do just that...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh