Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#625922: SATA devices get reset without real hardware failure

64 views
Skip to first unread message

Javier Ortega Conde (Malkavian)

unread,
Oct 17, 2011, 6:40:01 PM10/17/11
to
This bug (in general, not just this on this web) have been in GNU/Linux since
a long time with various disks, mainboards, SATA controllers, distros and
kernels (maybe since changes after 2.6.24).

In https://bugzilla.redhat.com/show_bug.cgi?id=684599 David Zeuthen says
"it's most probably caused by this commit
http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=560de575148b7efda3b34a7f7073abd483c5f08e
"

Possible workarounds readed to this bug:
-1: Add "libata.atapi_passthru16=0" to the kernel boot options (because some
devices may not support 16-byte ATA commands) (
https://bugzilla.redhat.com/show_bug.cgi?id=684599 )
-2: (Same as 1) Add options libata atapi_passthru16=0 to
/etc/modprobe.d/modprobe.conf and add FILES="/etc/modprobe.d/modprobe.conf" to
/etc/mkinitcpio.conf ( https://bbs.archlinux.org/viewtopic.php?pid=895404 )
-3: Somebody called Fujisan said in 2009 "adding 'acpi=off noapic' to the
kernel in /etc.grub.conf seems to have solved the problem for me" (
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=462425 ). Raman
Gupta and Andreas M. Kirchwitz say in other forums that adding 'acpi=off'
doesn't work ( https://bugzilla.redhat.com/show_bug.cgi?id=549981 )
-4: (Similar to 3) Completely disable ACPI in mainboard BIOS. (
http://lists.debian.org/debian-user/2010/01/msg00023.html )
-5: Gaetan Cambier says "add the option line to grub to disable ncq :
'libata.force=noncq' for me, with this, i have no froze". (
https://bugzilla.redhat.com/show_bug.cgi?id=549981 ). Others reply that it
doesn't work for them. PsYcHoK9 sys it works for him but John Doe replies that
not for him ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892 ).
-6: Reartes Guillermo says "booting with the kernel parameter: pcie_aspm=off ?
For me it worked (nvidia)". Raman Gupta replies that "I tried this and it did
not fix the problem." ( https://bugzilla.redhat.com/show_bug.cgi?id=549981 )
-7: A. Mani says "For the SB600 controller, the right thing to do is to
restrict all drives to 1.5Gbps by jumpers or with a boot option." Raman Gupta
replies "I also tried this -- but with this setting all drives attached to my
Marvell controller could not even be started by the kernel -- permanent
"failed to IDENTIFY" errors." (
https://bugzilla.redhat.com/show_bug.cgi?id=549981 )
-8: DjznBR (djzn-br) says he have trying some things WITHOUT success it and
finally one that works. Doesn't work: TURNED HDPARM OFF, CHANGED CABLE,
EXPERIMENTED AHCI & RAID MODES, DISABLED NCQ, COMPILED KERNEL WITH
CONFIG_SATA_PMP DISABLED, TRYING NOW LIBATA.FORCE=1.5GBPS, changed the cables
to different routes... SATA1 -> SATA2 SATA2 -> SATA3 ---- Works (but still
gives "softreset failed (device not ready)" messages in dmesg and afterwards
recover without data loss) : Added option for kernel in grub configuration
"libata.noacpi=1". Also says "libata.force=norst ... prevents soft and hard
link resettings. If you have that switch on, when this bug comes up, there is
a system lock down (because obviously the kernel prevented the soft & hard
resetting." ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892 )


Same problem in my old PC/Server Pentium II MMX with Debian 6.0.3 (stable)
with kernel 2.6.32-5-686 and libata version 3.00 in an "IBM-DTLA-305010" 10Gb
IDE disk (configured by debian as sda) in an old mainboard . No RAID used, but
only soft reset, and no hard reset, so I don't lose data. Could send logs, but
I think they wouldn't give any more info.

Same problem in my desktop PC every 2 or 3 months in Debian testing with
kernels 3.0.0-1-amd64, 3.0.0-rc2-amd64, 2.6.39-2-amd64, 2.6.39-amd64,
2.6.38-2-amd64, 2.6.38-amd64 and maybe others older, and libata 3.00 in two
Seagate 7200.11 "ST3500320AS" 500Gb SATA2 disks (with last firmware) from a
RAID10. Fortunately the other two Western Digital "WDC WD1002FAEX-00Z3A0" 1Tb
SATA3 disks don't fail, but I have to reboot and re-add disk to reconstruct
raid. Could send logs, but I think they wouldn't give any more info.

Possibly these are the same bug: #539059, #603061, #524876

Same bug in other distros and kernels:
-Archlinux with udev-165 and udev-166:
https://bbs.archlinux.org/viewtopic.php?pid=895404
-Fedora with kernel 2.6.38-0.rc8.git0.1.fc15.x86_64 and udev-166 in a DVD
reader: https://bugzilla.redhat.com/show_bug.cgi?id=684599
-Fedora 13 with kernel 2.6.33.8-149.fc13.i686.PAE or Fedora 13 64bit on a Mac
Mini
-Fedora 14 with kernels 2.6.31.6-166.fc12@x86_64, 2.6.32.11-99.fc12.x86_64,
2.6.35.9-64.fc14.x86_64, 2.6.35.10-72.fc14.i686 and 2.6.35.10-74.fc14.x86_64
and 2.6.35.11-83.fc14.x86_64 and 2.6.35.14-95.fc14.x86_64:
https://bugzilla.redhat.com/show_bug.cgi?id=549981
-Fedora 15 (updated from Fedora 14):
https://bugzilla.redhat.com/show_bug.cgi?id=549981
-Centos5.5-x64 with kernel 2.6.18-194-x64:
https://bugzilla.redhat.com/show_bug.cgi?id=549981
-RHEL5 with vanilla kernel 2.6.37.3:
https://bugzilla.redhat.com/show_bug.cgi?id=549981
-Ubuntu since 8.10 64bit with kernels 2.6.27-7, 2.6.28-15-generic, 2.6.31-14-
generic, 2.6.31-15-generic (on a Macbook2), 2.6.38-7-generic (kernel-ppa):
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892
-Ubuntu 10.04: https://bugzilla.redhat.com/show_bug.cgi?id=549981


--
Bye: Javier Ortega Conde (Malkavian)
________________________________________________________________________
The Malkavian's webpage: Many things http://malkavian.dyndns.org
Member of LinUxers Group from Bizkaia (GLUB) http://glub.biz
Member of GoBi Go Club, Eghost, Itsas, Aske, Guardianes del Túmulo...
________________________________________________________________________
Microsoft is to operating systems and security what McDonald's to gourmet food
and healthy nutrition. (Javier Ortega Conde (Malkavian))




--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Ben Hutchings

unread,
Oct 17, 2011, 10:10:01 PM10/17/11
to
On Tue, 2011-10-18 at 00:37 +0200, Javier Ortega Conde (Malkavian)
wrote:
> This bug (in general, not just this on this web) have been in GNU/Linux since
> a long time with various disks, mainboards, SATA controllers, distros and
> kernels (maybe since changes after 2.6.24).

Just because you see the same error messages, that does not mean you are
seeing the same bug.
So that's a bug in some drives, though we need to work around it.

> Possible workarounds readed to this bug:
> -1: Add "libata.atapi_passthru16=0" to the kernel boot options (because some
> devices may not support 16-byte ATA commands) (
> https://bugzilla.redhat.com/show_bug.cgi?id=684599 )
> -2: (Same as 1) Add options libata atapi_passthru16=0 to
> /etc/modprobe.d/modprobe.conf and add FILES="/etc/modprobe.d/modprobe.conf" to
> /etc/mkinitcpio.conf ( https://bbs.archlinux.org/viewtopic.php?pid=895404 )

OK.

> -3: Somebody called Fujisan said in 2009 "adding 'acpi=off noapic' to the
> kernel in /etc.grub.conf seems to have solved the problem for me" (
> https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=462425 ). Raman
> Gupta and Andreas M. Kirchwitz say in other forums that adding 'acpi=off'
> doesn't work ( https://bugzilla.redhat.com/show_bug.cgi?id=549981 )
> -4: (Similar to 3) Completely disable ACPI in mainboard BIOS. (
> http://lists.debian.org/debian-user/2010/01/msg00023.html )

These are workarounds for bugs in IRQ routing on some motherboards.

They are also outdated advice. 10 years ago when both ACPI and the APIC
architecture were quite new, there were a lot of bugs in both BIOS and
kernel support for them. It was therefore sensible to try disabling it
when a new system seemed unstable. Today, this is not the case.

> -5: Gaetan Cambier says "add the option line to grub to disable ncq :
> 'libata.force=noncq' for me, with this, i have no froze". (
> https://bugzilla.redhat.com/show_bug.cgi?id=549981 ). Others reply that it
> doesn't work for them. PsYcHoK9 sys it works for him but John Doe replies that
> not for him ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892 ).

Not even the same symptoms.

> -6: Reartes Guillermo says "booting with the kernel parameter: pcie_aspm=off ?
> For me it worked (nvidia)". Raman Gupta replies that "I tried this and it did
> not fix the problem." ( https://bugzilla.redhat.com/show_bug.cgi?id=549981 )

This is a workaround for a controller or chipset bug.

[...]
> Same problem in my old PC/Server Pentium II MMX with Debian 6.0.3 (stable)
> with kernel 2.6.32-5-686 and libata version 3.00 in an "IBM-DTLA-305010" 10Gb
> IDE disk (configured by debian as sda) in an old mainboard . No RAID used, but
> only soft reset, and no hard reset, so I don't lose data. Could send logs, but
> I think they wouldn't give any more info.
>
> Same problem in my desktop PC every 2 or 3 months in Debian testing with
> kernels 3.0.0-1-amd64, 3.0.0-rc2-amd64, 2.6.39-2-amd64, 2.6.39-amd64,
> 2.6.38-2-amd64, 2.6.38-amd64 and maybe others older, and libata 3.00 in two
> Seagate 7200.11 "ST3500320AS" 500Gb SATA2 disks (with last firmware) from a
> RAID10. Fortunately the other two Western Digital "WDC WD1002FAEX-00Z3A0" 1Tb
> SATA3 disks don't fail, but I have to reboot and re-add disk to reconstruct
> raid. Could send logs, but I think they wouldn't give any more info.
[...]

Use reportbug to open a *separate* bug report for *each* of these
systems. Do send the logs. Please do not try to find connections with
other bug reports.

Ben.

--
Ben Hutchings
No political challenge can be met by shopping. - George Monbiot
signature.asc
0 new messages