9.1-stable: ATI IXP600 AHCI: CAM timeout

Oliver Fromme

unread,

May 29, 2013, 4:10:37 AM5/29/13

to freebsd...@freebsd.org

Hi,

Yesterday I have downloaded the latest 9.1 snapshot (May 15th)
from ftp.freebsd.org and installed it on a machine that was
previously running Linux. It works fine, except that I get
many the following when there is heavy disk I/O, e.g. when
building world or ports:

ahcich0: Timeout on slot 23 port 0
ahcich0: is 00000000 cs f07fffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004bc17
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command

It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally
often (it's a gmirror), sometimes even at exactly the same
time so the messages for ada0 and ada1 are interleaved in
the dmesg output.

The worst thing is that the whole system seems to freeze
completely for about 10 seconds each time it happens.
Other than that, I haven't seen any ill effects, i.e. no
processes dying and no panics (so far). But the system is
quite unusable because of the freezes.

I'm pretty sure the hardware has no defects. The machine
was running Linux fine until recently.

Are there any known issues with FreeBSD + ATI IXP600?

The kernel is the default GENERIC from the snapshot, the
only additional modules loaded are geom_mirror and linux.ko.
The dmesg messages related to disks are copied below, and
the full dmesg can be found here:
http://www.secnetix.de/olli/tmp/dmesg.nox.txt

Best regards
Oliver

FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013
ro...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
.
ahci0: <ATI IXP600 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe7ff800-0xfe7ffbff irq 22 at device 18.0 on pci0
ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
.
.
(aprobe0:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:15:0): CAM status: Command timeout
(aprobe0:ahcich0:0:15:0): Error 5, Retries exhausted
(aprobe1:ahcich1:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe1:ahcich1:0:15:0): CAM status: Command timeout
(aprobe1:ahcich1:0:15:0): Error 5, Retries exhausted
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6
.
GEOM_MIRROR: Device mirror/gm0 launched (2/2).
.
Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]...
.
ahcich0: Timeout on slot 23 port 0
ahcich0: is 00000000 cs f07fffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004bc17
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 12 port 0
ahcich1: is 00000000 cs ffff8fff ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 0004ee17
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 80 85 e3 40 04 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: Timeout on slot 2 port 0
ahcich1: is 00000000 cs 00000000 ss 0000001c rs 0000001c tfd 40 serr 00000000 cmd 0004e417
ahcich0: Timeout on slot 12 port 0
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is 00000000 cs 00000000 ss 00007000 rs 00007000 tfd 40 serr 00000000 cmd 0004ee17
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
pid 40615 (try), uid 0: exited on signal 10 (core dumped)
ahcich1: Timeout on slot 7 port 0
ahcich1: is 00000000 cs fffff07f ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004ac17
ahcich0: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00
Timeout on slot 19 port 0
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is 00000000 cs ff07ffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004b817
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 12 port 0
ahcich1: is 00000000 cs 00000000 ss 0000f000 rs 0000f000 tfd 40 serr 00000000 cmd 0004ef17
ahcich0: Timeout on slot 24 port 0
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 d8 78 e4 40 04 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is 00000000 cs 00000000 ss 0f000000 rs 0f000000 tfd 40 serr 00000000 cmd 0004fb17
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 d8 78 e4 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 1 port 0
ahcich1: is 00000000 cs 00000000 ss 0000003e rs 0000003e tfd 40 serr 00000000 cmd 0004e517
ahcich0: Timeout on slot 13 port 0
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 30 e0 e4 40 04 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is 00000000 cs 00000000 ss 0003e000 rs 0003e000 tfd 40 serr 00000000 cmd 0004f117
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 30 e0 e4 40 04 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
.
.

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

"The most important decision in [programming] language design
concerns what is to be left out." -- Niklaus Wirth
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Steven Hartland

unread,

May 29, 2013, 9:12:12 AM5/29/13

to Oliver Fromme, freebsd...@freebsd.org

Have you checked your sata cables and psu outputs?

Both of these could be the underlying cause of poor signalling.

----- Original Message -----
From: "Oliver Fromme" <ol...@lurza.secnetix.de>
To: <freebsd...@FreeBSD.ORG>
Sent: Wednesday, May 29, 2013 2:05 PM
Subject: Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

Now I have some more information ...

The problem disappears when I disable NCQ, i.e. set the
number of tags to 1 with camcontrol. Using binary search
I found out that the problem also disappears with 2 tags,
but with 3 tags I get the same amout of errors as with
the default of 32 tags.

Interestingly, the problems also disappears when I reduce
the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s),
even if the NCQ tags are left at the default of 32.

Now the question is: Is it better to reduce the NCQ tags
from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps
to 1.5 Gbps? What is more likely to impact performance
on a mixed server with shell users, apache, sendmail, DNS
and a few other things?

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

In my experience the term "transparent proxy" is an oxymoron (like jumbo
shrimp). "Transparent" proxies seem to vary from the distortions of a
funhouse mirror to barely translucent. I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
-- R. Kevin Oberman, Network Engineer

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postm...@multiplay.co.uk.

Oliver Fromme

unread,

May 29, 2013, 9:17:18 AM5/29/13

to freebsd...@freebsd.org

Now I have some more information ...

The problem disappears when I disable NCQ, i.e. set the
number of tags to 1 with camcontrol. Using binary search
I found out that the problem also disappears with 2 tags,
but with 3 tags I get the same amout of errors as with
the default of 32 tags.

Interestingly, the problems also disappears when I reduce
the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s),
even if the NCQ tags are left at the default of 32.

Now the question is: Is it better to reduce the NCQ tags
from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps
to 1.5 Gbps? What is more likely to impact performance
on a mixed server with shell users, apache, sendmail, DNS
and a few other things?

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

In my experience the term "transparent proxy" is an oxymoron (like jumbo
shrimp). "Transparent" proxies seem to vary from the distortions of a
funhouse mirror to barely translucent. I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
-- R. Kevin Oberman, Network Engineer

Oliver Fromme

unread,

May 29, 2013, 10:21:55 AM5/29/13

to freebsd...@freebsd.org, kil...@multiplay.co.uk

Steven Hartland wrote:
> Have you checked your sata cables and psu outputs?
>
> Both of these could be the underlying cause of poor signalling.

I can't easily check that because it is a cheap rented
server in a remote location.

But I don't believe it is bad cabling or PSU anyway, or
otherwise the problem would occur intermittently all the
time if the load on the disks is sufficiently high.
But it only occurs at tags=3 and above. At tags=2 it does
not occur at all, no matter how hard I hammer on the disks.

At the moment I'm inclined to believe that it is either
a bug in the HDD firmware or in the controller. The disks
aren't exactly new, they're 400 GB Samsung ones that are
several years old. I think it's not uncommon to have bugs
in the NCQ implementation in such disks.

The only thing that puzzles me is the fact that the problem
also disappears completely when I reduce the SATA rev from
II to I, even at tags=32.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

"People still program in C. People keep writing shell scripts. *Most*
people don't realize the shortcomings of the tools they are using because
they a) don't reflect on their workflows and they are b) too lazy to check
out alternatives to realize there is help." -- Simon 'corecode' Schubert

Ian Lepore

unread,

May 29, 2013, 11:16:38 AM5/29/13

to Oliver Fromme, kil...@multiplay.co.uk, freebsd...@freebsd.org

On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
> Steven Hartland wrote:
> > Have you checked your sata cables and psu outputs?
> >
> > Both of these could be the underlying cause of poor signalling.
>
> I can't easily check that because it is a cheap rented
> server in a remote location.
>
> But I don't believe it is bad cabling or PSU anyway, or
> otherwise the problem would occur intermittently all the
> time if the load on the disks is sufficiently high.
> But it only occurs at tags=3 and above. At tags=2 it does
> not occur at all, no matter how hard I hammer on the disks.
>
> At the moment I'm inclined to believe that it is either
> a bug in the HDD firmware or in the controller. The disks
> aren't exactly new, they're 400 GB Samsung ones that are
> several years old. I think it's not uncommon to have bugs
> in the NCQ implementation in such disks.
>
> The only thing that puzzles me is the fact that the problem
> also disappears completely when I reduce the SATA rev from
> II to I, even at tags=32.
>

It seems to me that you dismiss signaling problems too quickly.
Consider the possibilities... A bad cable leads to intermittant errors
at higher speeds. When NCQ is disabled or limited the software handles
these errors pretty much transparently. When NCQ is not limitted and
there are many outstanding requests, suddenly the error handling in the
software breaks down somehow and a minor recoverable problem becomes an
in-your-face error.

I'm not saying any of the foregoing is true, just that you should
consider the possibility that you're dealing with multiple problems
which are only loosely coupled, but together can seem like a single more
serious problem. You don't know enough yet to casually dismiss
anything.

-- Ian

Oliver Fromme

unread,

May 29, 2013, 2:44:31 PM5/29/13

to freebsd...@freebsd.org, i...@freebsd.org

Well ... I also can't dismiss the possibility that there is
a mouse in the machine that is pulling the SATA cables twice
every minute. :-)

But seriously ... I don't see how bad cabling could cause
errors at tags=3 and no errors at all at tags=2. It shouldn't
make a difference for the cables if there are two or three
tags used. And by the way, it doesn't make a difference at
all whether I use tags=3 or tags=32; the rate of errors is the
same in both cases (about two per minute during buildword).

I have googled a bit; the Samsung HD401LJ and HD403LJ don't
seem to be innocent ... There are lots of pages mentioning
problems with NCQ and SATA I vs. II.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

"A misleading benchmark test can accomplish in minutes
what years of good engineering can never do." -- Dilbert (2009-03-02)

Adam McDougall

unread,

May 29, 2013, 3:53:46 PM5/29/13

to freebsd...@freebsd.org

On 05/29/13 10:21, Oliver Fromme wrote:
> Steven Hartland wrote:
> > Have you checked your sata cables and psu outputs?
> >
> > Both of these could be the underlying cause of poor signalling.
>
> I can't easily check that because it is a cheap rented
> server in a remote location.
>
> But I don't believe it is bad cabling or PSU anyway, or
> otherwise the problem would occur intermittently all the
> time if the load on the disks is sufficiently high.
> But it only occurs at tags=3 and above. At tags=2 it does
> not occur at all, no matter how hard I hammer on the disks.
>
> At the moment I'm inclined to believe that it is either
> a bug in the HDD firmware or in the controller. The disks
> aren't exactly new, they're 400 GB Samsung ones that are
> several years old. I think it's not uncommon to have bugs
> in the NCQ implementation in such disks.
>
> The only thing that puzzles me is the fact that the problem
> also disappears completely when I reduce the SATA rev from
> II to I, even at tags=32.
>
> Best regards
> Oliver
>
>

Jeremy Chadwick knows of some hardware faults with IXP600/700,
there may be more information on the freebsd-fs mailing list archives or
if you can discuss with him:

http://docs.freebsd.org/cgi/mid.cgi?20130414194440.GB38338

That email mentions port multipliers but the problems may extend beyond.

Jeremy Chadwick

unread,

May 29, 2013, 6:54:33 PM5/29/13

to Oliver Fromme, freebsd...@freebsd.org

On Wed, May 29, 2013 at 10:09:14AM +0200, Oliver Fromme wrote:
> Hi,
>
> Yesterday I have downloaded the latest 9.1 snapshot (May 15th)
> from ftp.freebsd.org and installed it on a machine that was
> previously running Linux. It works fine, except that I get
> many the following when there is heavy disk I/O, e.g. when
> building world or ports:
>
> ahcich0: Timeout on slot 23 port 0
> ahcich0: is 00000000 cs f07fffff ss ffffffff rs ffffffff tfd c0 serr 00000000 cmd 0004bc17
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command

The messages above indicate two things:

1) The AHCI driver is reporting an internal timeout when trying to speak
to the underlying device (disk) attached to whatever maps to ahcich0;
this is an "AHCI-level timeout", and the 2nd line shows all of the
AHCI-level status conditions at that time,

2) CAM reports what it was trying to do when that happened,
specifically issue WRITE_FPDMA_QUEUED (an NCQ-based write to ada0),
which timed out after 30 seconds (kern.cam.ada.default_timeout).

> It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally
> often (it's a gmirror), sometimes even at exactly the same
> time so the messages for ada0 and ada1 are interleaved in
> the dmesg output.

Both surprising and not surprising (to me anyway), on numerous levels.

> The worst thing is that the whole system seems to freeze
> completely for about 10 seconds each time it happens.
> Other than that, I haven't seen any ill effects, i.e. no
> processes dying and no panics (so far). But the system is
> quite unusable because of the freezes.

There isn't much you can do about this. I get the impression from your
statement this is the first time you've ever encountered an I/O timeout
in your life? :-) This is just how it works -- pretty much the entire
I/O subsystem (for the device(s) involved) "stalls" until a response to
the CDB gets received. It's like this on all OSes, all systems; it's
how I/O works.

The AHCI driver may have different timeout settings; I haven't looked.

The same CDB gets re-submit to the controller 5 times
(kern.cam.ada.retry_count will say 4, but it starts at 0 if I remember
right), in hopes that the I/O transaction will eventually go through.

Repeated device timeouts with no successful responses will eventually
cause CAM or AHCI (I forget which driver/subsystem) to drop the disk.
In your case, this could mean ada0 and ada1 eventually getting dropped,
which would induce a panic since you're using them for your root
filesystem. (I wonder if there are readers of this thread who are
starting to see why I use a single disk for my main OS drive...)

> I'm pretty sure the hardware has no defects. The machine
> was running Linux fine until recently.
>
> Are there any known issues with FreeBSD + ATI IXP600?

This is opening a can of worms, which I've discussed in the past.
Please see my posts to freebsd-fs and/or freebsd-stable archives
(another person in this thread mentioned it as well).

Fact: there is still not enough low-level, hard evidence at this time to
determine if the problem is with the AHCI driver, the AMD/ATI IXP600
controller, or Samsung disks. The situations I have dealt with in the
past always were inconclusive. There have been reports of problems with
non-Samsung disks as well, but the report count there is extremely low
in comparison.

Fact: You will find complaints on Linux lists about both the controller
and the drives as well (in combo). Take that to mean whatever you wish.
Use Google and search for "SB600 HD403LJ Linux" or "SB600 Samsung Linux"
and see for yourself.

Fact: Samsung's SpinPoint series has had a troubling past of firmware
bugs. Things have gotten better on their newer-ish drives, but the
"slightly older" ones, to me, seemed more like a learning experience
for engineers. I am not picking on Samsung exclusively; all drive
vendors have had problems historically, there is no such thing as a
"reliable" drive vendor in this day and age. You go with whatever works
for you/whatever your experiences justify.

All that said:

There is some code in sys/dev/ahci/ahci.c that indicates "one-off"
behaviour for the SB600/IXP600, pertaining to NCQ. However this was
committed a long time ago (r196777 and r196796). I look at this
code and I can think of one problem with it, but answers to my below
questions will provide what I need.

> The kernel is the default GENERIC from the snapshot, the
> only additional modules loaded are geom_mirror and linux.ko.
> The dmesg messages related to disks are copied below, and
> the full dmesg can be found here:
> http://www.secnetix.de/olli/tmp/dmesg.nox.txt
>
> Best regards
> Oliver
>
> FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013
> ro...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64

> ..

> ahci0: <ATI IXP600 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe7ff800-0xfe7ffbff irq 22 at device 18.0 on pci0
> ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
> ahcich0: <AHCI channel> at channel 0 on ahci0
> ahcich1: <AHCI channel> at channel 1 on ahci0
> ahcich2: <AHCI channel> at channel 2 on ahci0
> ahcich3: <AHCI channel> at channel 3 on ahci0
> ..

> ..

> (aprobe0:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe0:ahcich0:0:15:0): CAM status: Command timeout
> (aprobe0:ahcich0:0:15:0): Error 5, Retries exhausted
> (aprobe1:ahcich1:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe1:ahcich1:0:15:0): CAM status: Command timeout
> (aprobe1:ahcich1:0:15:0): Error 5, Retries exhausted
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada0: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
> ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada0: Command Queueing enabled
> ada0: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
> ada0: Previously was known as ad4
> ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
> ada1: <SAMSUNG HD403LJ CT100-12> ATA-8 SATA 2.x device
> ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada1: Command Queueing enabled
> ada1: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
> ada1: Previously was known as ad6

> ..

> GEOM_MIRROR: Device mirror/gm0 launched (2/2).

> ..

> Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]...

> ..

> ..

It's worth pointing out that all of the events you provided are writes.
In my experience, historically, that has usually been the case. If a
drive firmware screws around when handling an NCQ write, taking too long
to do something (think firmware bug), this can happen. If that's the
case, the fact it happens on 2 disks of the same type thus wouldn't
surprise me.

I've mentioned in the past that I know of a few situations where this
can happen, particularly with 4KByte sector drives, depending on how the
user set up the system. In this case, the Samsung HD403LJ is supposedly
a 512-byte sector drive, but the drive probably complies with an older
ATA specification and thus only provides the logical sector size in ATA
IDENTIFY output, thus the system must assume physical=logical
(camcontrol and smartmontools will both say something to the effect of
"512 bytes logical/physical").

I would appreciate the following:

1. smartctl -x {ada0,ada1} output using a recent version of
smartmontools (6.1 if possible please),

2. camcontrol identify {ada0,ada1} -v output (note the -v),

3. If you are running smartd(8) or not,

4. pciconf -lvbc output.

Anecdotal story:

A lot of people forget the infamous nVidia nForce 4 vs. Maxtor NCQ issue
that circulated "PC enthusiast" sites during the mid-2000s. Neither
company wanted to own up to the problem, blaming each other instead.
There was never any official statement made as to where the problem was,
only that nVidia updated their nForce 4 controller drivers with some
sort of workaround (details were not disclosed), and Maxtor also quietly
added a document to their website stating that you could get a firmware
from Technical Support that would address the problem as well. I had
a combination of the two at the time, which is why I remember it. Still to
this day nobody knows who was really responsible. I won't get into the
whole political/societal aspects of why vendors always blame one another
rather than solve real problems.

There is no way at this time (in real-time or via loader.conf) to
disable NCQ within the AHCI driver. It is possible to add an entry to
the AHCI quirks table for your controller that sets AHCI_Q_NONCQ, if you
want to try that. I can give you a patch for that, but I need to see
the output from the above (4) commands first -- it may not be necessary
to try, depending on the results.

I have probably left out key/important informations within this mail,
which is an indicator of how tired I have grown of seeing it come up.
:-(

--
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

Mike Pumford

unread,

Jun 3, 2013, 10:08:21 AM6/3/13

to freebsd...@freebsd.org

It could also be a software bug in the way CAM handles the failure of
NCQ commands. When command queueing is used on a SCSI drive and a queued
command fails only that command fails. A queued command failure on a
SATA device fails ALL currently queued commands. I've not looked at the
code but do the SATA CAM drivers do the right thing here?

Less commands queued makes it less likely that multiple commands will be
in progress when a failure occurs. A lower link rate also makes you
more immune to signal failures.

Mike

Jeremy Chadwick

unread,

Jun 3, 2013, 4:22:43 PM6/3/13

to Mike Pumford, Alexander Motin, freebsd...@freebsd.org

Quoting T13/2015-D ATA8-ACS2 WD spec:

"If an error occurs while the device is processing an NCQ command, then
the device shall return command aborted for all NCQ commands that are in
the queue and shall return command aborted for any new commands, except
a READ LOG EXT command requesting log address 10h, until the device
completes a READ LOG EXT command requesting log address 10h (i.e.,
reading the NCQ Command Error log) without error."

While I can't easily provide an answer to your question, I can tell you
that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for
certain scenarios (the code is in function ahci_issue_recovery()).

The one person who can answer this question is mav@, who is now CC'd.

> Less commands queued makes it less likely that multiple commands
> will be in progress when a failure occurs. A lower link rate also
> makes you more immune to signal failures.

He isn't seeing SATA-level signal/link failure; the AHCI driver would
complain about that, and those messages aren't there. Unless, of
course, those messages are only visible when verbose booting is enabled
(I hope not).

--
| Jeremy Chadwick j...@koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |

| Making life hard for others since 1977. PGP 4BD6C0CB |

Alexander Motin

unread,

Jun 4, 2013, 2:06:05 AM6/4/13

to Jeremy Chadwick, freebsd...@freebsd.org, Mike Pumford

I am not aware about any flows in present CAM ATA error recovery logic.
READ LOG EXT sending indeed implemented on ahci(4) driver level (same as
siis(4) and mvs(4)) since it was complicated/impossible to do in shared
code because higher levels have no idea about tags allocation done by
lower-level drivers.

> The one person who can answer this question is mav@, who is now CC'd.
>
>> Less commands queued makes it less likely that multiple commands
>> will be in progress when a failure occurs. A lower link rate also
>> makes you more immune to signal failures.
>
> He isn't seeing SATA-level signal/link failure; the AHCI driver would
> complain about that, and those messages aren't there. Unless, of
> course, those messages are only visible when verbose booting is enabled
> (I hope not).

Just a curious history point: I had one old system on NVIDIA MCP55
chipset where Linux worked well before, but FreeBSD had problems with
SATA -- all disk transfers were really slow, but without reporting any
errors, and after some point system started to hang. That series of
chipsets had long history of problems, so for some time I was looking
for some way to handle it in software. But after many experiments I've
accidentally found out that disabling 6 small but very powerful fans
workarounded the problem. I've checked PSU voltages, and they were fine.
Switching fans to separate PSU also helped. Finally I've just replaced
system's main PSU with different one and problems have gone. My best
guess was that capacitors in that PSU due to old age were unable to
filter fan's electric noise that started to interfere with SATA and
later other signals. Now the same PSU works perfectly fine in the same
case with smaller Atom-based motherbard without any issues.

I am not telling that ahci(4) driver is perfect, but hardware issues are
always possible even if system worked fine before that.

--
Alexander Motin