Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

SMART problems in 2.6.22

6 views
Skip to first unread message

Bruce Allen

unread,
Jul 8, 2007, 8:44:21 PM7/8/07
to Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
Mark, David, Doug, Tejin, Alan, Jeff, LKML,

I'm afraid that there may be some problem with SMART + libata in the
2.6.22 kernel. An hour ago I discovered that I missed a month of
correspondence (some LKML, some private) about this problem which Alan,
Tejun, Jeff, Mark and others copied to me -- it was automatically shoved
into one of my mailboxes by my mail client. Sorry about that. So I am
trying to catch up to see if there is some real problem or not.

Here is a typical bug report that worries me:
http://article.gmane.org/gmane.linux.utilities.smartmontools/4712

Here is another similar report:
http://thread.gmane.org/gmane.linux.utilities.smartmontools/4713

And another report:
http://www.mail-archive.com/debian-b...@lists.debian.org/msg358354.html

From some of the earlier threads that I missed (below) I have the
impression that the problem may be a very simple one, namely that starting
with 2.6.22 one needs to run a command to enable SMART when a box is first
booted -- the kernel no longer does this as part of the init/setup of the
disks. But that is NOT consistent with the first two reports above, which
show 'SMART ENABLED'.

Here are some of the earlier threads that I completely missed:

http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/0849.html
http://www.mail-archive.com/linux-...@vger.kernel.org/msg164863.html

Before I go off half-cocked, could anyone shed some light on this? Is
there a real problem here or just something dumb?

Cheers,
Bruce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bruce Allen

unread,
Jul 8, 2007, 9:15:39 PM7/8/07
to Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
Here is another similar report:

http://article.gmane.org/gmane.linux.utilities.smartmontools/4704/match=diamondmax

Again, this indicates that SMART is enabled. But it's not clear what the
kernel version here is. The report indicates that the problem started
with an FC7 kernel upgrade

Bruce

Jeff Garzik

unread,
Jul 8, 2007, 10:10:26 PM7/8/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
On the base point, libata has never enabled SMART on its own. That's
always up to the BIOS, etc.

It's possible that the recent addition of ACPI support will cause disks
to be in different modes than previously expected. ACPI supplies ATA
taskfiles to be pushed to the disk, and who knows what's in there...

Jeff

David Greaves

unread,
Jul 9, 2007, 7:55:37 AM7/9/07
to Bruce Allen, Mark Lord, Douglas Gilbert, Tejun Heo, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
Hi Bruce

>> From some of the earlier threads that I missed (below) I have the
> impression that the problem may be a very simple one, namely that
> starting with 2.6.22 one needs to run a command to enable SMART when a
> box is first booted -- the kernel no longer does this as part of the
> init/setup of the disks. But that is NOT consistent with the first two
> reports above, which show 'SMART ENABLED'.
>
> Here are some of the earlier threads that I completely missed:
>
> http://www.mail-archive.com/linux-...@vger.kernel.org/msg164863.html
This is mine and although it's a 'real' problem, it is something that's easy to
hack around by having the suspend script turn on smart after it is resumed. (Of
course I can't use resume until a skge wol bug is fixed so I won't see/test this
unless asked too.)

The smart init scripts run '-s on' when the system boots anyway for my system -
this problem only occurs for me during suspend/resume. Maybe smartd should
detect that as Alan says.

Please let me know if there's anything else you need.

David

Bruce Allen

unread,
Jul 9, 2007, 1:36:21 PM7/9/07
to Jeff Garzik, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
On Sun, 8 Jul 2007, Jeff Garzik wrote:

Jeff, thanks for the quick feedback.

> On the base point, libata has never enabled SMART on its own. That's
> always up to the BIOS, etc.

OK, clear.

> It's possible that the recent addition of ACPI support will cause disks
> to be in different modes than previously expected. ACPI supplies ATA
> taskfiles to be pushed to the disk, and who knows what's in there...

Is there a simple way I can have affected users test this? Is there a
kernel boot flag or sysctl setting or something else they can use to
disable the ACPI stuff so see if the problem then goes away?

Cheers,
Bruce

Jeff Garzik

unread,
Jul 9, 2007, 1:53:30 PM7/9/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Linux Kernel Mailing List, Jan Dvorak, Smartmontools Mailing List, Klaus Fuerstberger, Bruce Allen
Bruce Allen wrote:
> On Sun, 8 Jul 2007, Jeff Garzik wrote:
>
> Jeff, thanks for the quick feedback.
>
>> On the base point, libata has never enabled SMART on its own. That's
>> always up to the BIOS, etc.
>
> OK, clear.
>
>> It's possible that the recent addition of ACPI support will cause
>> disks to be in different modes than previously expected. ACPI
>> supplies ATA taskfiles to be pushed to the disk, and who knows what's
>> in there...
>
> Is there a simple way I can have affected users test this? Is there a
> kernel boot flag or sysctl setting or something else they can use to
> disable the ACPI stuff so see if the problem then goes away?

The 'noacpi' module option.

Jeff

Bruce Allen

unread,
Jul 9, 2007, 1:59:50 PM7/9/07
to Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, David Greaves, Mark Lord, Douglas Gilbert, Tejun Heo, Alan Cox, Smartmontools Mailing List, Bruce Allen
Hi David,

>> http://www.mail-archive.com/linux-...@vger.kernel.org/msg164863.html

> This is mine and although it's a 'real' problem, it is something that's easy
> to hack around by having the suspend script turn on smart after it is
> resumed. (Of course I can't use resume until a skge wol bug is fixed so I
> won't see/test this unless asked too.)
>
> The smart init scripts run '-s on' when the system boots anyway for my system
> - this problem only occurs for me during suspend/resume. Maybe smartd should
> detect that as Alan says.

OK, that should be easy to do. So let's forget about the 'SMART disabled'
issue. This is easy to fix in multiple ways and is not a LKML issue.

David: can you reproduce the more serious problem
http://article.gmane.org/gmane.linux.utilities.smartmontools/4712 reported
by Jan Dvorak?

Jeff: this is the problem that really has me concerned.

Jan: what happens if you replace '-d ata' with '-d sat'? This option
should be available in the 5.37 release of smartmontools that you are
using unless the Suse package maintainer is playing games with the version
numbers.

Unfortunately I don't think this will fix the problem, as the bug report
by Klaus Fuerstberger
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975 is using '-d sat'.

Jeff: the fact that both links given above are reporting the same bug in
two different settings, and the fact that the bug goes away when reverting
2.6.22 to 2.6.21 still has me concerned.

Cheers,
Bruce

David Greaves

unread,
Jul 9, 2007, 2:01:08 PM7/9/07
to Bruce Allen, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Mark Lord, Douglas Gilbert, Tejun Heo, Alan Cox, Smartmontools Mailing List, Bruce Allen
Bruce Allen wrote:
> Hi David,
>
>>> http://www.mail-archive.com/linux-...@vger.kernel.org/msg164863.html
>
>> This is mine and although it's a 'real' problem, it is something
>> that's easy to hack around by having the suspend script turn on smart
>> after it is resumed. (Of course I can't use resume until a skge wol
>> bug is fixed so I won't see/test this unless asked too.)
>>
>> The smart init scripts run '-s on' when the system boots anyway for my
>> system - this problem only occurs for me during suspend/resume. Maybe
>> smartd should detect that as Alan says.
>
> OK, that should be easy to do. So let's forget about the 'SMART
> disabled' issue. This is easy to fix in multiple ways and is not a LKML
> issue.
Sure.


> David: can you reproduce the more serious problem
> http://article.gmane.org/gmane.linux.utilities.smartmontools/4712
> reported by Jan Dvorak?

Sorry, I haven't seen that problem.

David

Bruce Allen

unread,
Jul 9, 2007, 2:03:24 PM7/9/07
to Jan Dvorak, Klaus Fuerstberger, Linux Kernel Mailing List, Jeff Garzik, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Smartmontools Mailing List, Alan Cox, Kai Makisara, Bruce Allen
Hi Jeff,

>>> It's possible that the recent addition of ACPI support will cause disks to
>>> be in different modes than previously expected. ACPI supplies ATA
>>> taskfiles to be pushed to the disk, and who knows what's in there...
>>
>> Is there a simple way I can have affected users test this? Is there a
>> kernel boot flag or sysctl setting or something else they can use to
>> disable the ACPI stuff so see if the problem then goes away?
>
> The 'noacpi' module option.

OK, thanks.

Klaus, Jan: could you please see if your problem with 2.6.22 goes away
with noacpi passed as a flag to libata?

Jeff: I will add the noacpi test suggestion into the Debian bug report
here http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975 to try to
ensure that Klaus sees it.

Cheers,
Bruce

Jeff Garzik

unread,
Jul 9, 2007, 2:43:30 PM7/9/07
to Bruce Allen, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, David Greaves, Mark Lord, Douglas Gilbert, Tejun Heo, Alan Cox, Smartmontools Mailing List, Bruce Allen
Bruce Allen wrote:
> http://article.gmane.org/gmane.linux.utilities.smartmontools/4712
> reported by Jan Dvorak?


Relevant lspci and dmesg output would be useful... that gives enhanced
error diagnostics.

Jeff

Kai Makisara

unread,
Jul 9, 2007, 5:11:56 PM7/9/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
On Sun, 8 Jul 2007, Bruce Allen wrote:

> Mark, David, Doug, Tejin, Alan, Jeff, LKML,
>
> I'm afraid that there may be some problem with SMART + libata in the 2.6.22
> kernel. An hour ago I discovered that I missed a month of correspondence
> (some LKML, some private) about this problem which Alan, Tejun, Jeff, Mark and
> others copied to me -- it was automatically shoved into one of my mailboxes by
> my mail client. Sorry about that. So I am trying to catch up to see if there
> is some real problem or not.
>
> Here is a typical bug report that worries me:
> http://article.gmane.org/gmane.linux.utilities.smartmontools/4712
>
> Here is another similar report:
> http://thread.gmane.org/gmane.linux.utilities.smartmontools/4713
>
> And another report:
> http://www.mail-archive.com/debian-b...@lists.debian.org/msg358354.html
>
> >From some of the earlier threads that I missed (below) I have the impression
> that the problem may be a very simple one, namely that starting with 2.6.22
> one needs to run a command to enable SMART when a box is first booted -- the
> kernel no longer does this as part of the init/setup of the disks. But that is
> NOT consistent with the first two reports above, which show 'SMART ENABLED'.
>
> Here are some of the earlier threads that I completely missed:
>
> http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/0849.html

I have done some more debugging on this one. An easy way to reproduce the
problem is to use 'smartctl -H /dev/sdb'. If I enable debugging with '-r
ioctl,2', I find the following difference between outputs using 2.6.21.1
(works OK) and 2.6.22 (fails):

--- sm-2.6.21.1b.log 2007-07-09 23:47:28.000000000 +0300
+++ sm-2.6.22.log 2007-07-09 23:39:56.000000000 +0300
@@ -11,7 +11,7 @@
status=0x0
[ata pass-through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
- info=0x0 duration=0 milliseconds resid=0
+ info=0x0 duration=4 milliseconds resid=0
Incoming data, len=512 [only first 256 bytes shown]:
00 5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00
10 00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20
@@ -97,11 +97,11 @@
scsi_status=0x2, host_status=0x0, driver_status=0x8
info=0x1 duration=48 milliseconds resid=0
>>> Sense buffer, len=22:
- 00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 00 00 00
- 10 00 4f 00 c2 00 50
+ 00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 01 00 00
+ 10 00 00 00 00 00 50
status=2: [desc] sense_key=0 asc=0 ascq=0
Values from ATA status return descriptor are:
- 00 09 0c 00 00 00 00 00 00 00 4f 00 c2 00 50
+ 00 09 0c 00 00 00 01 00 00 00 00 00 00 00 50
REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS returned 0

REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK
@@ -110,9 +110,13 @@
info=0x1 duration=40 milliseconds resid=0
>>> Sense buffer, len=22:
00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 00 00 00
- 10 00 4f 00 c2 00 50
+ 10 00 00 00 00 00 50
status=2: [desc] sense_key=0 asc=0 ascq=0
Values from ATA status return descriptor are:
- 00 09 0c 00 00 00 00 00 00 00 4f 00 c2 00 50
-REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned 0
-
+ 00 09 0c 00 00 00 00 00 00 00 00 00 00 00 50
+Error SMART Status command failed
+Please get assistance from http://smartmontools.sourceforge.net/
+Values from ATA status return descriptor are:
+ 00 09 0c 00 00 00 00 00 00 00 00 00 00 00 50
+REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK returned -1
+A mandatory SMART command failed: exiting. To continue, add one or more
'-T permissive' options.


The log shows that the sense data returned by the commands differ: with
2.6.22 the bytes 4f and 2c (tf.lbam and tf.lbah) are not returned. Both of
the status commands fail to return these bytes but the tests in smartctl
are more strict for the second case. This is why the second status command
seems to be failing.

Next I added printks to the function ata_qc_complete() in libata-core.c.
The changed code from 2.6.22 at line 5222 looked like this:

/* read result TF if requested */
if (qc->flags & ATA_QCFLAG_RESULT_TF) {
if (qc->tf.feature == 0xda)
printk("ata_qc_complete before: %02x %02x %02x %02x\n",
qc->result_tf.feature,
qc->result_tf.lbam, qc->result_tf.lbah,
qc->result_tf.command);
fill_result_tf(qc);
if (qc->tf.feature == 0xda)
printk("ata_qc_complete %ld: %02x %02x %02x %02x\n",
qc->flags & ATA_QCFLAG_RESULT_TF,
qc->result_tf.feature,
qc->result_tf.lbam, qc->result_tf.lbah,
qc->result_tf.command);
}

The output from 2.6.21.6 looks like this:

Jul 9 18:37:44 kai kernel: [ 193.443874] ata_qc_complete before: 00 00 00 40
Jul 9 18:37:44 kai kernel: [ 193.443880] ata_qc_complete 16: 00 4f c2 50
Jul 9 18:37:44 kai kernel: [ 193.462802] ata_qc_complete before: 00 4f c2 40
Jul 9 18:37:44 kai kernel: [ 193.462807] ata_qc_complete 16: 00 4f c2 50

i.e., the bytes are returned.

The output from 2.6.22 is different:

Jul 9 18:44:35 kai kernel: [ 147.765965] ata_qc_complete before: 00 00 00 40
Jul 9 18:44:35 kai kernel: [ 147.765970] ata_qc_complete 16: 00 00 00 50
Jul 9 18:44:35 kai kernel: [ 147.784890] ata_qc_complete before: 00 00 00 40
Jul 9 18:44:35 kai kernel: [ 147.784894] ata_qc_complete 16: 00 00 00 50

The lbam and lbah bytes are not returned but the command byte is.

fill_result_tf() basically calls the tf_read() method of the low-level
driver. sata_nc has been changed between 2.6.21 and 2.6.22-rc1 and this
particular smartctl problem may or may not be specific to CK804. Disabling
adma did not change the results.

--
Kai

Adam Spiers

unread,
Jul 9, 2007, 7:57:56 PM7/9/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Jeff Garzik, Linux Kernel Mailing List, Klaus Fuerstberger, Jan Dvorak, Smartmontools Mailing List, Bruce Allen
On Sun, Jul 08, 2007 at 08:14:10PM -0500, Bruce Allen wrote:
> On Sun, 8 Jul 2007, Bruce Allen wrote:
> > I'm afraid that there may be some problem with SMART + libata in the 2.6.22
> > kernel. An hour ago I discovered that I missed a month of correspondence
> > (some LKML, some private) about this problem which Alan, Tejun, Jeff, Mark
> > and others copied to me -- it was automatically shoved into one of my
> > mailboxes by my mail client. Sorry about that. So I am trying to catch up
> > to see if there is some real problem or not.
> >
> > Here is a typical bug report that worries me:
> > http://article.gmane.org/gmane.linux.utilities.smartmontools/4712
> >
> > Here is another similar report:
> > http://thread.gmane.org/gmane.linux.utilities.smartmontools/4713
> >
> > And another report:
> > http://www.mail-archive.com/debian-b...@lists.debian.org/msg358354.html
> >
> > From some of the earlier threads that I missed (below) I have the impression
> > that the problem may be a very simple one, namely that starting with 2.6.22
> > one needs to run a command to enable SMART when a box is first booted -- the
> > kernel no longer does this as part of the init/setup of the disks. But that
> > is NOT consistent with the first two reports above, which show 'SMART
> > ENABLED'.

[snipped]

> Here is another similar report:
>
> http://article.gmane.org/gmane.linux.utilities.smartmontools/4704/match=diamondmax
>
> Again, this indicates that SMART is enabled. But it's not clear what the
> kernel version here is. The report indicates that the problem started
> with an FC7 kernel upgrade

That was me, and the kernel in question is 2.6.21-1.3194.fc7. I tried
Jeff's noacpi suggestion, and here is the outcome. I am sure it comes
as no surprise that his patch to support the boot-time parameter
libata.noacpi is not included in this kernel:

Kernel command line: ro root=/dev/vg0/fc-root rhgb selinux=0 nodmraid libata.noacpi=1
Unknown boot option `libata.noacpi=1': ignoring

However, the module option is there:

# modinfo libata
filename: /lib/modules/2.6.21-1.3194.fc7/kernel/drivers/ata/libata.ko
version: 2.20
license: GPL
description: Library module for ATA devices
author: Jeff Garzik
srcversion: 44DAFFD701701A15EB2D574
depends: scsi_mod
vermagic: 2.6.21-1.3194.fc7 SMP mod_unload 686 4KSTACKS
parm: atapi_enabled:Enable discovery of ATAPI devices (0=off, 1=on) (int)
parm: atapi_dmadir:Enable ATAPI DMADIR bridge support (0=off, 1=on) (int)
parm: fua:FUA support (0=off, 1=on) (int)
parm: ignore_hpa:Ignore HPA (0=keep BIOS setting 1=ignore it) (int)
parm: ata_probe_timeout:Set ATA probing timeout (seconds) (int)
parm: noacpi:Disables the use of ACPI in suspend/resume when set (int)

And when used via:

# cat /etc/modprobe.d/libata
options libata noacpi=1

I still see the same problem:

smartctl version 5.37 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model: Maxtor 6L250S0
Serial Number: L50A1B8H
Firmware Version: BANC1G10
User Capacity: 251,000,193,024 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Mon Jul 9 23:39:25 2007 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Error SMART Status command failed


Please get assistance from http://smartmontools.sourceforge.net/

Register values returned from SMART Status command are:
CMD=0x50
FR =0x00
NS =0x00
SC =0x00
CL =0xc2
CH =0x00
SEL=0x00


A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Regarding Kai's very recent analysis elsewhere in this thread:

> sata_nc has been changed between 2.6.21 and 2.6.22-rc1 and this
> particular smartctl problem may or may not be specific to CK804.

I should note that this particular machine is indeed using that chipset:

00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3)

HTH,
Adam

Douglas Gilbert

unread,
Jul 10, 2007, 12:26:25 AM7/10/07
to Kai Makisara, Bruce Allen, Mark Lord, David Greaves, Tejun Heo, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock

Kai,
Thanks for the analysis.
Some background documents for those interested:
The SCSI to ATA Translation (SAT) draft:
http://www.t10.org/ftp/t10/drafts/sat/sat-r09.pdf
in which the relevant section is 12.2.6 (page 110)
table 93.
A modern descriptor-based SCSI sense buffer is being used
to convey the "ATA (status) return descriptor" back from
the ATA device after the command has been completed.
My SAT code in smartmontools requests this descriptor
so it should be returned irrespective of whether the
ATA command succeeded or failed.

Now from the ATA side the command being executed is
"SMART RETURN STATUS B0h/DAh, non-data". For
reference I use this draft from www.t13.org :
D1699r3f-ATA8-ACS.pdf . See that command's
_description_ section. That explains that 4f and c2
in the LBA field indicates the disk is healthy. "threshold
exceeded" is indicated by putting f4 and 2c in the same
positions. [Whoever specified that must have hated people
with dyslexia.] No ATA command error is indicated ("abort"
is the only one listed for that ATA command) in the reports
that I have seen.

So when smartmontools sees 0 and 0 in those positions it
pulls out the red card for that device. My guess is that
libata in lk 2.6.22 is corrupting those FIS device to
host register values.


Doug Gilbert

Mark Lord

unread,
Jul 10, 2007, 2:23:26 PM7/10/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Kai Makisara, Jeff Garzik, Linux Kernel Mailing List, Klaus Fuerstberger, Jan Dvorak, Smartmontools Mailing List, Bruce Allen
Adam Spiers wrote:
> On Sun, Jul 08, 2007 at 08:14:10PM -0500, Bruce Allen wrote:
>> On Sun, 8 Jul 2007, Bruce Allen wrote:
>>..

>> http://article.gmane.org/gmane.linux.utilities.smartmontools/4704/match=diamondmax
>>
>> Again, this indicates that SMART is enabled. But it's not clear what the
>> kernel version here is. The report indicates that the problem started
>> with an FC7 kernel upgrade

So it's a bug in the sata_nv.c port driver.

In particular, I see this in the bug report:

>Jun 30 10:23:42 atlantic kernel: ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1501000 status 0x1540 next cpb count 0x0 next cpb idx 0x0

That looks a bit strange, because the driver goes to some effort
to prevent these kind of commands from ever being issued "in ADMA mode",
precisely because there's no way to do a tf_read in that mode.

Mmm.. buggy somewhere in there.

JohnyDog

unread,
Jul 10, 2007, 2:28:49 PM7/10/07
to
> Jan: what happens if you replace '-d ata' with '-d sat'? This option
> should be available in the 5.37 release of smartmontools that you are
> using unless the Suse package maintainer is playing games with the version
> numbers.

Sorry for the delay - booting with noapic or acpi=off or disable APIC/
ACPI in bios has no effect on the bug,
With -d sat only the output formating changes -

MART support is: Available - device has SMART capability.
SMART support is: Enabled

Error SMART Status command failed

Please get assistance from http://smartmontools.sourceforge.net/

Values from ATA status return descriptor are:

00 09 0c 00 00 00 01 00 00 00 00 00 00 00 50

A mandatory SMART command failed: exiting. To continue, add one or
more '-T permissive' options.


Also, other than the error, everything works fine (with adding -T
permissive).

Jan Dvorak

Bruce Allen

unread,
Jul 10, 2007, 9:32:21 PM7/10/07
to Jeff Garzik, Kai Makisara, Douglas Gilbert, Mark Lord, David Greaves, Tejun Heo, Alan Cox, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
On Tue, 10 Jul 2007, Douglas Gilbert wrote:

> Kai Makisara wrote:

>> I have done some more debugging on this one. An easy way to reproduce
>> the

<SNIP>

>> The log shows that the sense data returned by the commands differ: with
>> 2.6.22 the bytes 4f and 2c (tf.lbam and tf.lbah) are not returned. Both
>> of the status commands fail to return these bytes but the tests in
>> smartctl are more strict for the second case. This is why the second
>> status command seems to be failing.

<SNIP>

> Kai, Thanks for the analysis.

<SNIP>

> So when smartmontools sees 0 and 0 in those positions it pulls out the
> red card for that device. My guess is that libata in lk 2.6.22 is
> corrupting those FIS device to host register values.

Kai, Doug: thank you very much for tracking down the source of this
problem.

Jeff: OK, from what I am reading here I think that this is a genuine
libata/kernel bug. But I'm out of my depth here, so the ball is in your
court. Hopefully you'll understand what's going on and how to fix it.

Cheers,
Bruce

Kai Makisara

unread,
Jul 16, 2007, 6:49:14 AM7/16/07
to Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Tejun Heo, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
On Tue, 10 Jul 2007, Kai Makisara wrote:

> On Sun, 8 Jul 2007, Bruce Allen wrote:
>
> > Mark, David, Doug, Tejin, Alan, Jeff, LKML,
> >
> > I'm afraid that there may be some problem with SMART + libata in the 2.6.22
> > kernel. An hour ago I discovered that I missed a month of correspondence
> > (some LKML, some private) about this problem which Alan, Tejun, Jeff, Mark and
> > others copied to me -- it was automatically shoved into one of my mailboxes by
> > my mail client. Sorry about that. So I am trying to catch up to see if there
> > is some real problem or not.
> >

..


> > http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/0849.html
>
> I have done some more debugging on this one. An easy way to reproduce the
> problem is to use 'smartctl -H /dev/sdb'. If I enable debugging with '-r
> ioctl,2', I find the following difference between outputs using 2.6.21.1
> (works OK) and 2.6.22 (fails):
>

..


> The log shows that the sense data returned by the commands differ: with
> 2.6.22 the bytes 4f and 2c (tf.lbam and tf.lbah) are not returned. Both of
> the status commands fail to return these bytes but the tests in smartctl
> are more strict for the second case. This is why the second status command
> seems to be failing.
>
> Next I added printks to the function ata_qc_complete() in libata-core.c.
> The changed code from 2.6.22 at line 5222 looked like this:
>

..


> The output from 2.6.21.6 looks like this:
>
> Jul 9 18:37:44 kai kernel: [ 193.443874] ata_qc_complete before: 00 00 00 40
> Jul 9 18:37:44 kai kernel: [ 193.443880] ata_qc_complete 16: 00 4f c2 50
> Jul 9 18:37:44 kai kernel: [ 193.462802] ata_qc_complete before: 00 4f c2 40
> Jul 9 18:37:44 kai kernel: [ 193.462807] ata_qc_complete 16: 00 4f c2 50
>
> i.e., the bytes are returned.
>
> The output from 2.6.22 is different:
>
> Jul 9 18:44:35 kai kernel: [ 147.765965] ata_qc_complete before: 00 00 00 40
> Jul 9 18:44:35 kai kernel: [ 147.765970] ata_qc_complete 16: 00 00 00 50
> Jul 9 18:44:35 kai kernel: [ 147.784890] ata_qc_complete before: 00 00 00 40
> Jul 9 18:44:35 kai kernel: [ 147.784894] ata_qc_complete 16: 00 00 00 50
>
> The lbam and lbah bytes are not returned but the command byte is.
>

The other system with the Maxtor disk fails in a slightly different way
(it correctly returns the c2 byte but not in the correct location):

[ 162.896173] ata_qc_complete before: 00 00 00 40
[ 162.896179] ata_qc_complete 16: 00 c2 00 50

My earlier 'git bisect' suggested that this problem surfaced after the
patch

1e999736cafdffc374f22eed37b291129ef82e4e is first bad commit
commit 1e999736cafdffc374f22eed37b291129ef82e4e
Author: Alan Cox <al...@lxorguk.ukuu.org.uk>
Date: Wed Apr 11 00:23:13 2007 +0100

libata: HPA support

I have now done some further tests to see what is happening.
It turned out that after commenting the call (at line 1956 in
drivers/ata/libata-core.c in 2.6.22)

if (ata_id_hpa_enabled(dev->id))
dev->n_sectors = ata_hpa_resize(dev);

'smartctl -H' worked again without problems. This applied to both of the
systems where I see the problem. The disks in both systems support hpa but
nothing is hidden. Next I commented only the call to
ata_read_native_max_address_ext() in ata_hpa_resize(). This was enough
to remove the problem (as was expected).

So, the question is: why does calling ata_read_native_max_address_ext()
when booting the system cause the SMART RETURN STATUS fail much later?

Tejun Heo

unread,
Jul 16, 2007, 7:22:36 AM7/16/07
to Kai Makisara, Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
Please try the patch in the following message.

http://article.gmane.org/gmane.linux.ide/20799/raw

--
tejun

Kai Makisara

unread,
Jul 16, 2007, 7:59:30 AM7/16/07
to Tejun Heo, Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
On Mon, 16 Jul 2007, Tejun Heo wrote:

> Please try the patch in the following message.
>
> http://article.gmane.org/gmane.linux.ide/20799/raw
>

This solves the 'smartctl -H' problem both of my systems (one with Nvidia
CK804 and one with MCP51).

Tested-by: Kai Makisara <Kai.Ma...@kolumbus.fi>

Thanks for pointing out the patch.

--
Kai

Klaus Fuerstberger

unread,
Jul 16, 2007, 9:43:30 AM7/16/07
to Kai Makisara, Tejun Heo, Bruce Allen, Mark Lord, David Greaves, Douglas Gilbert, Alan Cox, Jeff Garzik, Linux Kernel Mailing List, Jan Dvorak, Klaus Fuerstberger, Bruce Allen, Robert Hancock
Kai Makisara said the following on 16.07.2007 13:58:

>> Please try the patch in the following message.
>> http://article.gmane.org/gmane.linux.ide/20799/raw

> This solves the 'smartctl -H' problem both of my systems (one with Nvidia
> CK804 and one with MCP51).

This patch also solved the problem I reported here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975

Thanks!

Bye Klaus

Bruce Allen

unread,
Jul 16, 2007, 5:31:24 PM7/16/07
to Tejun Heo, pe...@vandrovec.name, Kai Makisara, Klaus Fuerstberger, Jeff Garzik, Mark Lord, David Greaves, Douglas Gilbert, Alan Cox, Linux Kernel Mailing List, Jan Dvorak, Bruce Allen, Robert Hancock
Tejun: thanks for pointing out this patch.

Kai, Klaus: thanks for testing the patch!

Petr: thanks for fixing the SMART 2.6.22 problems!

Jeff: two user (Kai, Klaus) both saw the SMART STATUS problem disappear
when they tested this libata patch. I hope you stick it into your own
source tree.

thread_exit(:-);

Cheers,
Bruce

0 new messages