Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[gentoo-user] 3.7.1 SATA errors

92 views
Skip to first unread message

fe...@crowfix.com

unread,
Dec 23, 2012, 2:30:02 PM12/23/12
to
A few weeks ago I had a scare when a reboot paniced the kernel with a complaint that it could not find the root device (/dev/sde), and further reboots couldn't even see the USB keyboard. Leavng the system powered off overnight "fixed" the problem and the system has been working fine ever since.

I have since had some time to explore this and find it related to the kernel; 3.6.10 works fine, while 3.7.1 fails. If I reset during the 3.7.1 boot while it is spewing its error messages, but before the kernel ultimately panics, I can reboot with 3.6.10, but if 3.7.1 goes all the way to the panic, I have to power off and wait a few minutes before a 3.6.10 reboot is succesful. This is repeatable, but I haven't bothered to see how long the system must be off; "a few minutes" is enough.

This is a ~amd64 system, dual Opterons, Tyan S2882, Thunder K8S Pro. The dmesg times here start around 30 seconds because it spends 15 seconds on each of two SCSI hosts probing for nonexistent drives. udev etc are all frozen pre-systemd nonsense. Disks are two SSDs, two 4T drives, two 300G drives, and one 320G IDE/PATA drive; the main board is so old that there are only three boot options: IDE, DVD, network.

There are two error messages during the 3.7.1 boot, repeated for all SATA drives:

ata5.00: qc timeout (cmd 0x2f)
ata5.00: failed to set xfermode (err_mask=0x40)

Google does not enlighten me. One suggestion was change the SATA cable, but this is definitely a change from 3.6.10 to 3.7.1.

So here are some details ... You can see everything at https://www.dropbox.com/sh/o8j80rps3agvvcf/FBjJLcykRS

I am willing to try reasonable config changes for a new reboot attempt, but it is my main home server, not an experimental toy :-)

================ dmesg differences

I took some pictures during the boot process and transcribed the results. The 3.6.10 dmesg matches, but of course I can't get a 3.7.1 dmesg.

Both 3.6.10 and 3.7.1 appear to be the same up to this point:

ata13.00: ATA-8: WDC WD3200AAJB-00J3A0, 01.03E01, max UDMA/133
ata13.00: 625142448 sectors, multi 16: LBA48
ata13.00: configured for UDMA/133
ata1: SATA link down (SStatus 0 SControl 300)
ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata9.00: ATA-9: M4-CT512M4SD2, 000F, max UDMA/100
ata9.00: 1000215216 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata9.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 300)
ata3: SATA link down (SStatus 0 SControl 300)
ata4: SATA link down (SStatus 0 SControl 300)
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: ATA-7: Maxtor 6B300S0, BANC17M0, max UDMA/133
ata5.00: 586114704 sectors, multi 0: LBA48 NCQ (not used)

Around here 3.6.10 begins scrolling so fast that I could not get any pictures, so this is from the 3.6.10 dmesg, where it diverges from 3.7.1:

ata5.00: configured for UDMA/133
scsi 6:0:0:0: Direct-Access ATA Maxtor 6B300S0 BANC PQ: 0 ANSI: 5
sd 6:0:0:0: [sda] 586114704 512-byte logical blocks: (300 GB/279 GiB)
sd 6:0:0:0: [sda] Write Protect is off
sd 6:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda:
sd 6:0:0:0: [sda] Attached SCSI disk
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata6.00: ATA-7: Maxtor 6B300S0, BANC17M0, max UDMA/133
ata6.00: 586114704 sectors, multi 0: LBA48 NCQ (not used)
ata6.00: configured for UDMA/133
scsi 7:0:0:0: Direct-Access ATA Maxtor 6B300S0 BANC PQ: 0 ANSI: 5
sd 7:0:0:0: [sdb] 586114704 512-byte logical blocks: (300 GB/279 GiB)
sd 7:0:0:0: [sdb] Write Protect is off
sd 7:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 7:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sdb: unknown partition table
sd 7:0:0:0: [sdb] Attached SCSI disk
.... and on and on until it boots. (The unknown partition table is an LVM volume.)

But 3.7.1 pokes along slowly enough while generating its errors that I did get some pictures to transcribe, and this is where it diverges from 3.6.10.

ata5.00: qc timeout (cmd 0x2f)
ata5.00: failed to set xfermode (err_mask=0x40)
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: qc timeout (cmd 0x2f)
ata5.00: failed to set xfermode (err_mask=0x40)
ata5: limiting SATA link speed to 1.5 Gbps
ata5.00: limiting speed to UDMA/133:PIO3
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata5.00: qc timeout (cmd 0x2f)
ata5.00: failed to set xfermode (err_mask=0x40)
ata5.00: disabled
ata5: hard resetting link
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata5: EH complete
... for all ATA drives until it eventually panics because the root device, /dev/sde, is not found.


================ 3.6.10 ---> 3.7.1 conf changes

I rebuilt the 3.7.1 kernel and logged all the new config items.

Cputime accounting
> 1. Simple tick based cputime accounting (TICK_CPU_ACCOUNTING) (NEW)
2. Fine granularity task level IRQ time accounting (IRQ_TIME_ACCOUNTING)
choice[1-2]:

Consider userspace as in RCU extended quiescent state (RCU_USER_QS) [N/y/?] (NEW)

Module signature verification (MODULE_SIG) [N/y/?] (NEW)

Supervisor Mode Access Prevention (X86_SMAP) [Y/n/?] (NEW) n

Legacy cpb sysfs knob support for AMD CPUs (X86_ACPI_CPUFREQ_CPB) [Y/n/?] (NEW)

Enable core dump support (COREDUMP) [Y/n/?] (NEW)

Packet: sockets monitoring interface (PACKET_DIAG) [N/m/y/?] (NEW) m

IPv4 NAT (NF_NAT_IPV4) [N/m/?] (NEW) m

OMAP OCP2SCP DRIVER (OMAP_OCP2SCP) [N/m/y/?] (NEW) m

Calxeda Highbank SATA support (SATA_HIGHBANK) [N/m/y/?] (NEW) m

Virtual eXtensible Local Area Network (VXLAN) (VXLAN) [N/m/y/?] (NEW) m

Solarflare SFC9000-family PTP support (SFC_PTP) [Y/n/?] (NEW)

Microchip MRF24J40 transceiver driver (IEEE802154_MRF24J40) [N/m/?] (NEW) m

8250/16550 PNP device support (SERIAL_8250_PNP) [Y/n/?] (NEW)

MAX310X support (SERIAL_MAX310X) [N/y/?] (NEW)

SCCNXP serial port support (SERIAL_SCCNXP) [N/m/y/?] (NEW) m

TPM HW Random Number Generator support (HW_RANDOM_TPM) [M/n/?] (NEW)

TPM Interface Specification 1.2 Interface (I2C - Infineon) (TCG_TIS_I2C_INFINEON) [N/m/?] (NEW) m

NXP SC18IS602/602B/603 I2C to SPI bridge (SPI_SC18IS602) [N/m/y/?] (NEW) m

Dialog DA9052 GPIO (GPIO_DA9052) [N/m/y/?] (NEW) m

TWL6040 GPO (GPIO_TWL6040) [N/m/y/?] (NEW) m

OMAP HDQ driver (HDQ_MASTER_OMAP) [N/m/?] (NEW) m

Marvell 88PM860x battery driver (BATTERY_88PM860X) [N/m/y/?] (NEW) m

Dialog DA9052 Battery (BATTERY_DA9052) [N/m/y/?] (NEW) m

Marvell 88PM860x Charger driver (CHARGER_88PM860X) [N/m/?] (NEW) m

Analog Devices ADT7410 (SENSORS_ADT7410) [N/m/?] (NEW) m

Maxim MAX197 and compatibles (SENSORS_MAX197) [N/m/?] (NEW) m

generic cpu cooling support (CPU_THERMAL) [N/y/?] (NEW)

Support for the SMSC ECE1099 series chips (MFD_SMSC) [N/y/?] (NEW)

Dialog Semiconductor DA9055 PMIC Support (MFD_DA9055) [N/y/?] (NEW)

Texas Instruments LP8788 Power Management Unit Driver (MFD_LP8788) [N/y/?] (NEW)

Maxim Semiconductor MAX8907 PMIC Support (MFD_MAX8907) [N/m/y/?] (NEW) m

Fairchild FAN53555 Regulator (REGULATOR_FAN53555) [N/m/y/?] (NEW) m

Maxim 8907 voltage regulator (REGULATOR_MAX8907) [N/m/?] (NEW) m

TechnoTrend USB IR Receiver (IR_TTUSBIR) [N/m/?] (NEW) m

Media USB Adapters (MEDIA_USB_SUPPORT) [N/y/?] (NEW) y

STK1160 USB video capture support (VIDEO_STK1160) [N/m/?] (NEW) m

STK1160 AC97 codec support (VIDEO_STK1160_AC97) [N/y/?] (NEW) y

Support for various USB DVB devices v2 (DVB_USB_V2) [N/m/?] (NEW) m

Enable debug for the B2C2 FlexCop drivers (DVB_B2C2_FLEXCOP_USB_DEBUG) [N/y/?] (NEW)

Media PCI Adapters (MEDIA_PCI_SUPPORT) [N/y/?] (NEW)

Media test drivers (V4L_TEST_DRIVERS) [N/y] (NEW)

ISA and parallel port devices (MEDIA_PARPORT_SUPPORT) [N/y/?] (NEW)

Autoselect tuners and i2c modules to build (MEDIA_SUBDRV_AUTOSELECT) [Y/n/?] (NEW)

Maximum debug level (NOUVEAU_DEBUG) [5] (NEW)

Default debug level (NOUVEAU_DEBUG_DEFAULT) [3] (NEW)

Backlight Driver for LM3630 (BACKLIGHT_LM3630) [N/m/y/?] (NEW) m

Backlight Driver for LM3639 (BACKLIGHT_LM3639) [N/m/y/?] (NEW) m

TPS65217 Backlight (BACKLIGHT_TPS65217) [N/m/?] (NEW) m

Default time-out for HD-audio power-save mode (SND_HDA_POWER_SAVE_DEFAULT) [0] (NEW)

CIR via RC class (HID_PICOLCD_CIR) [N/y/?] (NEW)

Sony PS3 BD Remote Control (HID_PS3REMOTE) [N/m/?] (NEW) m

HID Sensors framework support (HID_SENSOR_HUB) [N/m/?] (NEW) m

ZTE USB serial driver (USB_SERIAL_ZTE) [N/m/?] (NEW) m

OMAP USB2 PHY Driver (OMAP_USB2) [N/m/y/?] (NEW) m

LED support for LM3642 Chip (LEDS_LM3642) [N/m/y/?] (NEW) m

LED support for LM355x Chips, LM3554 and LM3556 (LEDS_LM355x) [N/m/y/?] (NEW) m

LED CPU Trigger (LEDS_TRIGGER_CPU) [N/y/?] (NEW)

Dynamic compression of swap pages and clean pagecache pages (ZCACHE2) [N/y/?] (NEW)

Silicom devices (NET_VENDOR_SILICOM) [Y/n/?] (NEW)

Silicom BypassCTL library support (SBYPASS) [N/m/?] (NEW) m

Silicom BypassCTL net support (BPCTL) [N/m/?] (NEW) m

Cambridge Electronic Design 1401 USB support (CED1401) [N/m/?] (NEW) m

Digi Realport driver (DGRP) [N/m/y/?] (NEW) m

STE-Modem remoteproc support (STE_MODEM_RPROC) [N/m/y/?] (NEW) m

SMB2 network file system support (EXPERIMENTAL) (CIFS_SMB2) [N/y/?] (NEW)

RCU debugging: preemptible RCU race provocation (PROVE_RCU_DELAY) [N/y/?] (NEW)

Red-Black tree test (RBTREE_TEST) [N/m/?] (NEW) m

Interval tree test (INTERVAL_TREE_TEST) [N/m/?] (NEW) m

CAST5 (CAST-128) cipher algorithm (x86_64/AVX) (CRYPTO_CAST5_AVX_X86_64) [N/m/y/?] (NEW) m

CAST6 (CAST-256) cipher algorithm (x86_64/AVX) (CRYPTO_CAST6_AVX_X86_64) [N/m/y/?] (NEW) m

Asymmetric (public-key cryptographic) key type (ASYMMETRIC_KEY_TYPE) [N/m/y/?] (NEW) m

Asymmetric public-key crypto algorithm subtype (ASYMMETRIC_PUBLIC_KEY_SUBTYPE) [N/m/?] (NEW) m

RSA public-key algorithm (PUBLIC_KEY_ALGO_RSA) [N/m/?] (NEW) m

X.509 certificate parser (X509_CERTIFICATE_PARSER) [N/m/?] (NEW) m

--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / fe...@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o

Nikos Chantziaras

unread,
Dec 23, 2012, 4:00:01 PM12/23/12
to
On 23/12/12 21:23, fe...@crowfix.com wrote:
> A few weeks ago I had a scare when a reboot paniced the kernel with a
> complaint that it could not find the root device (/dev/sde), and
> further reboots couldn't even see the USB keyboard. Leavng the
> system powered off overnight "fixed" the problem and the system has
> been working fine ever since.

Do a memtest first. emerge sys-apps/memtest86+ and then add an entry
for it in Grub:

title=Memtest86+
root (hd0,0) # <- adapt this to your partition
kernel /boot/memtest86plus/memtest.bin

Then boot that entry and see if you get any errors in the first 5
minutes or so.

fe...@crowfix.com

unread,
Dec 23, 2012, 4:10:02 PM12/23/12
to
Starting the emerge etc. But why would this be a memory problem when
it is so clearly 3.6 vs 3.7?

fe...@crowfix.com

unread,
Dec 23, 2012, 6:40:01 PM12/23/12
to
On Sun, Dec 23, 2012 at 10:49:46PM +0200, Nikos Chantziaras wrote:

> Then boot that entry and see if you get any errors in the first 5
> minutes or so.

Let it run a complete pass, about an hour, no errors.

Nikos Chantziaras

unread,
Dec 23, 2012, 10:20:01 PM12/23/12
to
On 23/12/12 23:00, fe...@crowfix.com wrote:
> On Sun, Dec 23, 2012 at 10:49:46PM +0200, Nikos Chantziaras wrote:
>> On 23/12/12 21:23, fe...@crowfix.com wrote:
>>> A few weeks ago I had a scare when a reboot paniced the kernel with a
>>> complaint that it could not find the root device (/dev/sde), and
>>> further reboots couldn't even see the USB keyboard. Leavng the
>>> system powered off overnight "fixed" the problem and the system has
>>> been working fine ever since.
>>
>> Do a memtest first. emerge sys-apps/memtest86+ and then add an entry
>> for it in Grub:
>>
>> title=Memtest86+
>> root (hd0,0) # <- adapt this to your partition
>> kernel /boot/memtest86plus/memtest.bin
>>
>> Then boot that entry and see if you get any errors in the first 5
>> minutes or so.
>
> Starting the emerge etc. But why would this be a memory problem when
> it is so clearly 3.6 vs 3.7?

It's simply an easy check to do and can rule RAM failure out early on.
When RAM dies, various seemingly unrelated issues can pop up.

But since your RAM seems clean, it's not the issue.

Nilesh Govindrajan

unread,
Dec 24, 2012, 12:00:02 AM12/24/12
to
On an interesting note, I'm on 3.7.1 pf-kernel and uptime is more than
11 hours. No such issue.

--
Nilesh Govindarajan
http://nileshgr.com

Bruce Hill

unread,
Dec 24, 2012, 9:40:01 AM12/24/12
to
On Sun, Dec 23, 2012 at 11:23:35AM -0800, fe...@crowfix.com wrote:
<snip, whack, d200d, cough, spit>

Puhleeeze don't put such long stuff in an email. Have you heard of attachments?
pastebins?

Your dropbox postings lost me after reading:

Please enable browser-cookies to use the Dropbox website.
--
Happy Penguin Computers >')
126 Fenco Drive ( \
Tupelo, MS 38801 ^^
sup...@happypenguincomputers.com
662-269-2706 662-205-6424
http://happypenguincomputers.com/

Don't top-post: http://en.wikipedia.org/wiki/Top_post#Top-posting

fe...@crowfix.com

unread,
Dec 24, 2012, 10:50:01 AM12/24/12
to
On Mon, Dec 24, 2012 at 08:35:20AM -0600, Bruce Hill wrote:

> Puhleeeze don't put such long stuff in an email. Have you heard of attachments?
> pastebins?

I was under the impression that gentoo strips attachments. At any
rate, I summarized as much as possible and only put the the full logs
at the end.

As for the cookies, <shrug> so many sites require cookies and/or
javascript these days that I won't waste my time trying to find one
that doesn't. I just make sure they are temporary.

Dale

unread,
Dec 24, 2012, 11:10:02 AM12/24/12
to
fe...@crowfix.com wrote:
> On Mon, Dec 24, 2012 at 08:35:20AM -0600, Bruce Hill wrote:
>
>> Puhleeeze don't put such long stuff in an email. Have you heard of attachments?
>> pastebins?
> I was under the impression that gentoo strips attachments. At any
> rate, I summarized as much as possible and only put the the full logs
> at the end.
>
> As for the cookies, <shrug> so many sites require cookies and/or
> javascript these days that I won't waste my time trying to find one
> that doesn't. I just make sure they are temporary.
>

One bad thing about paste bins, they get removed. Most people on this
list prefer them included or attached. That way the error is always
available for future reference in the archives. If it is on a paste bin
site and it gets removed, then that reference is gone, usually forever.

I might add, I don't have a paste bin account either. ;-)

Dale

:-) :-)

--
I am only responsible for what I said ... Not for what you understood or how you interpreted my words!

Bruce Hill

unread,
Dec 24, 2012, 11:20:01 AM12/24/12
to
On Mon, Dec 24, 2012 at 07:41:10AM -0800, fe...@crowfix.com wrote:
>
> I was under the impression that gentoo strips attachments. At any
> rate, I summarized as much as possible and only put the the full logs
> at the end.
>
> As for the cookies, <shrug> so many sites require cookies and/or
> javascript these days that I won't waste my time trying to find one

Would you consider our own pastebin from portage?

emerge -av app-text/wgetpaste && wgetpaste /path/to/3.6/.config
/path/to/3.7/.config

You can pastebin them both at the same time, in the same paste, and include a
link. I ask for both because there might be other options other than the ones
you noted, and we can use vimdiff on the two files side-by-side, which IMO
makes it very easy to see the differences.

Also can you "dmesg | wgetpaste" and note the "uname -srm" output?

Thanks,
Bruce

fe...@crowfix.com

unread,
Dec 24, 2012, 11:40:02 AM12/24/12
to
On Mon, Dec 24, 2012 at 10:07:04AM -0600, Bruce Hill wrote:

> Would you consider our own pastebin from portage?

Sure, in progress. I'll have to read up on this pastebin stuff.

Bruce Hill

unread,
Dec 24, 2012, 12:00:02 PM12/24/12
to
<content omitted from reply>

This time it has 4 attachments; afaik there were zero attachments the first
time (deleted email here so can't check now). No worries, files here now.

Do you have a /var/log/messages (might be in rotated, gzipped one even) that
includes the 3.6.10 *and* 3.7.1 boot?

fe...@crowfix.com

unread,
Dec 24, 2012, 12:00:02 PM12/24/12
to
On Mon, Dec 24, 2012 at 10:07:04AM -0600, Bruce Hill wrote:

> emerge -av app-text/wgetpaste && wgetpaste /path/to/3.6/.config
> /path/to/3.7/.config

3.6.10 .config -- http://bpaste.net/show/66307/
3.7.1 .config -- http://bpaste.net/show/66309/

> Also can you "dmesg | wgetpaste" and note the "uname -srm" output?

3.6.10 dmesg -- http://bpaste.net/show/66310/

uname -srm: Linux 3.6.10-gentoo x86_64

A couple of others:

My partial transcription of the 3.7.1 boot error messages: http://bpaste.net/show/66311/

3.6.10 emerge --info: http://bpaste.net/show/66312/

I also added all this to the Dropbox dir.

fe...@crowfix.com

unread,
Dec 24, 2012, 12:10:01 PM12/24/12
to
On Mon, Dec 24, 2012 at 07:41:10AM -0800, fe...@crowfix.com wrote:
>
> I was under the impression that gentoo strips attachments. At any
> rate, I summarized as much as possible and only put the the full logs
> at the end.

Looks like the attachments got thru. I will try to remember that.

fe...@crowfix.com

unread,
Dec 24, 2012, 12:30:02 PM12/24/12
to
On Mon, Dec 24, 2012 at 10:53:34AM -0600, Bruce Hill wrote:

> This time it has 4 attachments; afaik there were zero attachments the first
> time (deleted email here so can't check now). No worries, files here now.

Yes, I originally sent no attachments, since I thought the mailing list stripped them.

> Do you have a /var/log/messages (might be in rotated, gzipped one even) that
> includes the 3.6.10 *and* 3.7.1 boot?

Can't do anything for 3.7.1, since it never boots. Here is the 3.6.10
file, from boot until all disks are found: http://bpaste.net/show/66317/

Mark Knecht

unread,
Dec 24, 2012, 1:30:01 PM12/24/12
to
On Mon, Dec 24, 2012 at 6:35 AM, Bruce Hill
<da...@happypenguincomputers.com> wrote:
> On Sun, Dec 23, 2012 at 11:23:35AM -0800, fe...@crowfix.com wrote:
> <snip, whack, d200d, cough, spit>
>
> Puhleeeze don't put such long stuff in an email. Have you heard of attachments?
> pastebins?
>

Felix,
Personally, after years reading LKML, I have no problem with
in-line text of _any_ length, especially on the initial post or when
you are asked to respond with detailed info. While I understand
Bruce's comment I don't think it represents a democratic picture of
what this list has been comfortable with over the years.

That said, what I do have a BIG problem with is people responding
and not taking the time to edit the response down to a few lines that
make it clear about what their point is. Many responses to 1000 line
emails are 1001 lines - the responder adds a one-liner. That's a real
waste.

It's a trade off. It's less likely that some of us will go read
pastebin stuff, and if we want to respond technically then that's
leaving us to copy/paste responses which I'm personally less likely to
do.

Anyway, you pays your money, you takes your chance... ;-)

Cheers,
Mark

Michael Mol

unread,
Dec 24, 2012, 1:30:01 PM12/24/12
to
On Mon, Dec 24, 2012 at 1:21 PM, Mark Knecht <markk...@gmail.com> wrote:
> On Mon, Dec 24, 2012 at 6:35 AM, Bruce Hill
> <da...@happypenguincomputers.com> wrote:
>> On Sun, Dec 23, 2012 at 11:23:35AM -0800, fe...@crowfix.com wrote:
>> <snip, whack, d200d, cough, spit>
>>
>> Puhleeeze don't put such long stuff in an email. Have you heard of attachments?
>> pastebins?
>>
>
> Felix,
> Personally, after years reading LKML, I have no problem with
> in-line text of _any_ length, especially on the initial post or when
> you are asked to respond with detailed info. While I understand
> Bruce's comment I don't think it represents a democratic picture of
> what this list has been comfortable with over the years.

Agreed.

>
> That said, what I do have a BIG problem with is people responding
> and not taking the time to edit the response down to a few lines that
> make it clear about what their point is. Many responses to 1000 line
> emails are 1001 lines - the responder adds a one-liner. That's a real
> waste.

Guilty. To be fair, I try to properly snip and edit when I can, but if
I'm responding from my phone (more often than not, of late), getting
that kind of editing work in is very difficult.

--
:wq

Florian Philipp

unread,
Dec 25, 2012, 7:20:01 AM12/25/12
to
Am 23.12.2012 20:23, schrieb fe...@crowfix.com:
>
> I have since had some time to explore this and find it related to the
> kernel; 3.6.10 works fine, while 3.7.1 fails. If I reset during the
> 3.7.1 boot while it is spewing its error messages, but before the
> kernel ultimately panics, I can reboot with 3.6.10, but if 3.7.1 goes
> all the way to the panic, I have to power off and wait a few minutes
> before a 3.6.10 reboot is succesful. This is repeatable, but I
> haven't bothered to see how long the system must be off; "a few
> minutes" is enough.
>
> There are two error messages during the 3.7.1 boot, repeated for all
> SATA drives:
>
> ata5.00: qc timeout (cmd 0x2f) ata5.00: failed to set xfermode
> (err_mask=0x40)
>

The code that prints these messages has not been changed since 2011 so I
guess it is a driver issue. You never posted which driver you use
exactly and your kernel config enables all. Therefore I cannot look further.

The best way to find out what's wrong is to bisect the kernel, i.e.
finding the exact commit that caused the issue to appear.

http://wiki.gentoo.org/wiki/Kernel_git-bisect

Unfortunately, there have been 1545 commits between 3.6 and 3.7. With
blind bisection you need 39 kernels to find the issue. Maybe `git log`
can give you a hint which commits might be relevant.

Regards,
Florian Philipp

signature.asc

Bruce Hill

unread,
Dec 25, 2012, 10:00:01 AM12/25/12
to
On Mon, Dec 24, 2012 at 08:53:33AM -0800, fe...@crowfix.com wrote:
> On Mon, Dec 24, 2012 at 10:07:04AM -0600, Bruce Hill wrote:
>
> > emerge -av app-text/wgetpaste && wgetpaste /path/to/3.6/.config
> > /path/to/3.7/.config
>
> 3.6.10 .config -- http://bpaste.net/show/66307/
> 3.7.1 .config -- http://bpaste.net/show/66309/
>
> > Also can you "dmesg | wgetpaste" and note the "uname -srm" output?
>
> 3.6.10 dmesg -- http://bpaste.net/show/66310/
>
> uname -srm: Linux 3.6.10-gentoo x86_64
>
> A couple of others:
>
> My partial transcription of the 3.7.1 boot error messages: http://bpaste.net/show/66311/
>
> 3.6.10 emerge --info: http://bpaste.net/show/66312/
>
> I also added all this to the Dropbox dir.

We're on the road, getting ready to pack, and not in a good position to do
much on this issue atm.

I would suggest you run "lspci -nnk" with your running 3.6.10 kernel and save
that output. Then go into the kernel source directory for 3.7.1, run "make
mrproper" then "make defconfig" and enable all the kernel drivers listed in
the "lspci -nnk" output, as well as the drivers for your IDE/SATA controllers,
and / filesystem. That kernel should boot you, and will get rid of a lot of
the cruft from the present bloated kernels.

Todd Goodman

unread,
Dec 25, 2012, 11:10:01 AM12/25/12
to
* Florian Philipp <li...@binarywings.net> [121225 07:16]:
A me too on the problem the original poster is seeing.

I too am seeing this on a server I have. 3.7.0 and 3.7.1 both don't work
but 3.6.10 works fine.

I'm using the sata_mv driver with a SuperMicro (two actually) cards with
Marvell MV88SX6081's. These chips and their driver have had some issues
in the past.

I also looked for changes in the driver and didn't see any. Though I
did see some libata changes.

I haven't had time to do a git bisect yet.

Todd

fe...@crowfix.com

unread,
Dec 25, 2012, 11:20:02 AM12/25/12
to
On Tue, Dec 25, 2012 at 08:56:56AM -0600, Bruce Hill wrote:

> We're on the road, getting ready to pack, and not in a good position to do
> much on this issue atm.

Nevertheless, a most unexpected Christmas present! In progress, and thank you.

My dilemna certainly isn't urgent, since 3.6.10 still works.

fe...@crowfix.com

unread,
Dec 25, 2012, 11:50:01 AM12/25/12
to
On Tue, Dec 25, 2012 at 10:58:54AM -0500, Todd Goodman wrote:
> A me too on the problem the original poster is seeing.
>
> I too am seeing this on a server I have. 3.7.0 and 3.7.1 both don't work
> but 3.6.10 works fine.
>
> I'm using the sata_mv driver with a SuperMicro (two actually) cards with
> Marvell MV88SX6081's. These chips and their driver have had some issues
> in the past.

A pruned lspci -nnk:

00:07.1 IDE interface [0101]: Advanced Micro Devices [AMD] AMD-8111 IDE [1022:7469] (rev 03)
Subsystem: Advanced Micro Devices [AMD] AMD-8111 IDE [1022:7469]
Kernel driver in use: pata_amd
01:03.0 SCSI storage controller [0100]: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller [11ab:6081] (rev 09)
Subsystem: Marvell Technology Group Ltd. Device [11ab:11ab]
Kernel driver in use: sata_mv
02:06.0 SCSI storage controller [0100]: Adaptec AIC-7902B U320 [9005:801d] (rev 10)
Subsystem: Adaptec Device [9005:005e]
Kernel driver in use: aic79xx
02:06.1 SCSI storage controller [0100]: Adaptec AIC-7902B U320 [9005:801d] (rev 10)
Subsystem: Adaptec Device [9005:005e]
Kernel driver in use: aic79xx
03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller [1095:3114]
Kernel driver in use: sata_sil

pata_amd /dev/sdg 320G for boot which seems happy

sata_mv /dev/sd[ab] 2 x 300G LVM mounted automatically from fstab
/dev/sd[cd] 2 4T ditto

sata_sil /dev/sde 512G SSD with / and swap
/dev/sdf 512G SSD wirh LVM for /home, /encfs, and mail spool

aic79xx no drives

The sata_mv drives are not necessary for boot, but they do take up
/dev/sd? namespace. Might be interesting to try Bruce Hill's idea of
a pruned 3.7.1 kernel without that driver.

fe...@crowfix.com

unread,
Dec 25, 2012, 5:00:02 PM12/25/12
to
On Tue, Dec 25, 2012 at 08:56:56AM -0600, Bruce Hill wrote:

> I would suggest you run "lspci -nnk" with your running 3.6.10 kernel and save
> that output. Then go into the kernel source directory for 3.7.1, run "make
> mrproper" then "make defconfig" and enable all the kernel drivers listed in
> the "lspci -nnk" output, as well as the drivers for your IDE/SATA controllers,
> and / filesystem. That kernel should boot you, and will get rid of a lot of
> the cruft from the present bloated kernels.

Made a minimal 3.7.1 kernel, much smaller and compiled nice and fast.
Hung just like the bloated one, drat.

So I guess I will read up on bisecting. I know the principle, but
have never tried it. I suppose one starting point is make sure a
pure-vanilla 3.6.10 kernel boots.

Dale

unread,
Dec 25, 2012, 5:30:02 PM12/25/12
to
fe...@crowfix.com wrote:
> On Tue, Dec 25, 2012 at 08:56:56AM -0600, Bruce Hill wrote:
>
>> I would suggest you run "lspci -nnk" with your running 3.6.10 kernel and save
>> that output. Then go into the kernel source directory for 3.7.1, run "make
>> mrproper" then "make defconfig" and enable all the kernel drivers listed in
>> the "lspci -nnk" output, as well as the drivers for your IDE/SATA controllers,
>> and / filesystem. That kernel should boot you, and will get rid of a lot of
>> the cruft from the present bloated kernels.
> Made a minimal 3.7.1 kernel, much smaller and compiled nice and fast.
> Hung just like the bloated one, drat.
>
> So I guess I will read up on bisecting. I know the principle, but
> have never tried it. I suppose one starting point is make sure a
> pure-vanilla 3.6.10 kernel boots.
>

This is what I would try:

Do a lspci -k from whatever Linux you can boot, sysrescue CD or stick
comes to mind here. That should list the drivers you need for
hardware. Then mount partitions so you can get to /usr/src/<kernel
here> and cat the config file and make sure the results from lspci are
built INTO the kernel, not modules but built INTO the kernel. You could
even do: 'cat .config | grep -i <driver name from lspci -k here>'
Repeat that for each driver. Remember, arrow up keys for that one.
Saves you some typing. lol

If you have those built in, the only thing to check then is that the
file system for / is also built INTO the kernel. That has always got me
to at least a console login. Some other hardware may not work but you
can boot and fix from inside the OS instead of booting DVD, USB stick or
whatever and having to mount and such. That is such a pain to do.

Maybe that will help. At least get you to a console. That alone makes
fixing something else easier.

fe...@crowfix.com

unread,
Dec 25, 2012, 6:20:02 PM12/25/12
to
On Tue, Dec 25, 2012 at 04:20:23PM -0600, Dale wrote:

> This is what I would try:
> ...
> Maybe that will help. At least get you to a console. That alone makes
> fixing something else easier.

Checked all that -- it boots into the same ATA driver failures as the
bloated version of the kernel. Even have to power off and wait a
while before it resets properly for a 3.6.10 reboot. So I think it is
bisecting for me.

Paul Hartman

unread,
Dec 25, 2012, 6:30:02 PM12/25/12
to
On Sun, Dec 23, 2012 at 1:23 PM, <fe...@crowfix.com> wrote:
> Google does not enlighten me. One suggestion was change the SATA cable, but this is definitely a change from 3.6.10 to 3.7.1.

I can't find where I read it, but just yesterday I was reading a
somewhat recent LKML post which mentioned SATA errors introduced in
3.7.x series, especially problems with JMicron controllers (surprise,
surprise), but perhaps others as well, and also some new warnings
thrown out in the kernel log that didn't used to be there. Sorry I
have nothing more than that anecdote...

Dale

unread,
Dec 25, 2012, 7:10:02 PM12/25/12
to
fe...@crowfix.com wrote:
> On Tue, Dec 25, 2012 at 04:20:23PM -0600, Dale wrote:
>
>> This is what I would try:
>> ...
>> Maybe that will help. At least get you to a console. That alone makes
>> fixing something else easier.
> Checked all that -- it boots into the same ATA driver failures as the
> bloated version of the kernel. Even have to power off and wait a
> while before it resets properly for a 3.6.10 reboot. So I think it is
> bisecting for me.
>

Is it possible that you have two SATA drivers enabled and the two
conflict each other? I read, I think on this list, where someone had to
disable one driver for the correct driver to work. You may want to go here:

http://kmuto.jp/debian/hcl/

Get the driver list from that and try it. On that site, you can use
lspci or look up by model brand on the left. This is weird. If I think
of anything else, I'll post but I'm sort of stumped. My previous post
always gets me to a console login if nothing else. Once you get that,
you can work out the rest.

One other thought, you tried a even more recent kernel version? Maybe
that version is bad or something. Back to my stump.

fe...@crowfix.com

unread,
Dec 25, 2012, 7:30:02 PM12/25/12
to
On Tue, Dec 25, 2012 at 06:03:12PM -0600, Dale wrote:

> Is it possible that you have two SATA drivers enabled and the two
> conflict each other? I read, I think on this list, where someone had to
> disable one driver for the correct driver to work. You may want to go here:
>
> http://kmuto.jp/debian/hcl/

I'm not sure what good it would do me to find driver incompatibility
like that, since I need all the drivers working at once, and I'd still
have to bisect them.

> One other thought, you tried a even more recent kernel version? Maybe
> that version is bad or something. Back to my stump.

3.7.0 failed, then 3.7.1. I haven't tried anything more recent. I'm
trying to download the kernel git, but my satellite link is kinda
slow, and it's snowing on and off and temporarily clogging the dish
until the heater kicks in.

Once I get it downloaded, I'll make sure the 3.6.10 equivalent works
and the 3.7.0 fails, then start bisecting.

fe...@crowfix.com

unread,
Dec 25, 2012, 8:20:01 PM12/25/12
to
On Tue, Dec 25, 2012 at 01:11:04PM +0100, Florian Philipp wrote:

> The best way to find out what's wrong is to bisect the kernel, i.e.
> finding the exact commit that caused the issue to appear.
>
> http://wiki.gentoo.org/wiki/Kernel_git-bisect

Got the repository cloned:

# git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-stable

Tried to start the bisect, but ran into a problem:

# git bisect start
# git bisect bad v3.7.0
fatal: Needed a single revision
Bad rev input: v3.7.0

Tried v3.7.0.0 for fun, same error.

Tried good first, guessing it can't do much harm that a git bisect reset can't fix.

# git bisect good v3.6.10
a63a7cf3fc2ac1aff657f58ea446c34f3252209a was both good and bad
# git bisect bad v3.7.0
fatal: Needed a single revision
Bad rev input: v3.7.0

Have I grabbed a repository which doesn't include 3.7.0?

Google research continues.

Florian Philipp

unread,
Dec 26, 2012, 7:10:02 AM12/26/12
to
Am 26.12.2012 02:11, schrieb fe...@crowfix.com:
> On Tue, Dec 25, 2012 at 01:11:04PM +0100, Florian Philipp wrote:
>
>> The best way to find out what's wrong is to bisect the kernel, i.e.
>> finding the exact commit that caused the issue to appear.
>>
>> http://wiki.gentoo.org/wiki/Kernel_git-bisect
>
> Got the repository cloned:
>
> # git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-stable
>
> Tried to start the bisect, but ran into a problem:
>
> # git bisect start
> # git bisect bad v3.7.0
> fatal: Needed a single revision
> Bad rev input: v3.7.0
>
> Tried v3.7.0.0 for fun, same error.
>
> Tried good first, guessing it can't do much harm that a git bisect reset can't fix.
>
> # git bisect good v3.6.10
> a63a7cf3fc2ac1aff657f58ea446c34f3252209a was both good and bad
> # git bisect bad v3.7.0
> fatal: Needed a single revision
> Bad rev input: v3.7.0
>
> Have I grabbed a repository which doesn't include 3.7.0?
>
> Google research continues.
>

`git tag` should give you a list of version numbers. The tag you are
searching for is "v3.7".

Regards,
Florian Philipp

signature.asc

fe...@crowfix.com

unread,
Dec 26, 2012, 8:20:02 AM12/26/12
to
On Wed, Dec 26, 2012 at 12:56:39PM +0100, Florian Philipp wrote:

> `git tag` should give you a list of version numbers. The tag you are
> searching for is "v3.7".

Thanks -- power went out, standby generator kicked in and woke me up
at 0430, and I woke realizing that. Bisect is happy. My git-fu is
weak, since I mostly use it for personal projects. Work only uses
subversion, blecch. Didn't know about git tag, and got bisect help
doesn't mention it.

fe...@crowfix.com

unread,
Dec 26, 2012, 10:30:02 PM12/26/12
to
Finished the bisect between 3.6.10 and 3.7. Here's the log. The suspect patch has an interesting name:

ahci: implement aggressive SATA device sleep support

I'll send email to the patch author too.

I should make it clear that this is not urgent for me, since 3.6.10 isn't obsolete yet.

================
Bisecting: a merge base must be tested
[a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
Bisecting: 6499 revisions left to test after this (roughly 13 steps)
[d66e6737d454553e1e62109d8298ede5351178a4] Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Bisecting: 3257 revisions left to test after this (roughly 12 steps)
[6d55d5968a8622f3ea20ec40737aea1cfba6438c] Merge branch 'next/soc' into HEAD
Bisecting: 1329 revisions left to test after this (roughly 11 steps)
[aecdc33e111b2c447b622e287c6003726daa1426] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Bisecting: 684 revisions left to test after this (roughly 9 steps)
[65b99c74fdd325d1ffa2e5663295888704712604] Merge tag 'upstream-3.7-rc1' of git://git.infradead.org/linux-ubi
Bisecting: 337 revisions left to test after this (roughly 8 steps)
[16642a2e7be23bbda013fc32d8f6c68982eab603] Merge tag 'pm-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Bisecting: 164 revisions left to test after this (roughly 7 steps)
[7a9a2970b5c1c2ce73d4bb84edaa7ebf13e0c841] Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Bisecting: 81 revisions left to test after this (roughly 6 steps)
[c26d4114aac55b57078caf83e261621d22e4596d] Merge branch 'pm-qos'
Bisecting: 45 revisions left to test after this (roughly 5 steps)
[c09b890b763df3ccd79a2c34c2f1abeb73179caf] spi/imx: set the inactive state of the clock according to the clock polarity
Bisecting: 24 revisions left to test after this (roughly 5 steps)
[7fe0b14b725d6d09a1d9e1409bd465cb88b587f9] Merge tag 'spi-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[f1e70c2c535923de253eea2021376a936eb8d478] ata/ahci_platform: Add clock framework support
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[583661a89ed2e484bd295e7b4606099340478c38] ata: define enum constants for IDENTIFY DEVICE
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[8996b89d6bc98ae2f6d6e6e624a42a3f89d06949] ata: add platform driver for Calxeda AHCI controller
Bisecting: 0 revisions left to test after this (roughly 1 step)
[100f586bd0959fe0e52b8a0b8cb49a3df1c6b044] sata_fsl: add workaround for data length mismatch on freescale V2 controller
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[65fe1f0f66a57380229a4ced844188103135f37b] ahci: implement aggressive SATA device sleep support
65fe1f0f66a57380229a4ced844188103135f37b is the first bad commit
commit 65fe1f0f66a57380229a4ced844188103135f37b
Author: Shane Huang <shane...@amd.com>
Date: Fri Sep 7 22:40:01 2012 +0800

ahci: implement aggressive SATA device sleep support

Device Sleep is a feature as described in AHCI 1.3.1 Technical Proposal.
This feature enables an HBA and SATA storage device to enter the DevSleep
interface state, enabling lower power SATA-based systems.

Aggressive Device Sleep enables the HBA to assert the DEVSLP signal as
soon as there are no commands outstanding to the device and the port
specific Device Sleep idle timer has expired. This enables autonomous
entry into the DevSleep interface state without waiting for software
in power sensitive systems.

This patch enables Aggressive Device Sleep only if both host controller
and device support it.

Tested on AMD reference board together with Device Sleep supported device
sample.

Signed-off-by: Shane Huang <shane...@amd.com>
Reviewed-by: Aaron Lu <aaro...@gmail.com>
Signed-off-by: Jeff Garzik <jga...@redhat.com>

:040000 040000 9441b703760224de98a80546977129214d9528f8 436fe4f42392a48b4564f09cad69dafbe82be2c1 M drivers
:040000 040000 3177c859173da3d15f3c2fb287364f063aa420d9 a39a26dc3f6c0b21433688420a820b121a921cec M include
================

Mark Knecht

unread,
Dec 27, 2012, 12:00:01 AM12/27/12
to
On Wed, Dec 26, 2012 at 7:19 PM, <fe...@crowfix.com> wrote:
> Finished the bisect between 3.6.10 and 3.7. Here's the log. The suspect patch has an interesting name:
>
> ahci: implement aggressive SATA device sleep support
>
> I'll send email to the patch author too.
>
> I should make it clear that this is not urgent for me, since 3.6.10 isn't obsolete yet.
>
<SNIP>

Possibly related?

https://bugzilla.kernel.org/show_bug.cgi?id=51881

fe...@crowfix.com

unread,
Dec 27, 2012, 12:20:02 AM12/27/12
to
On Wed, Dec 26, 2012 at 08:53:14PM -0800, Mark Knecht wrote:

> Possibly related?
>
> https://bugzilla.kernel.org/show_bug.cgi?id=51881

Indeed :-) The patch author directed me there, I've applied the 51881
patch to the 3.7.1 sources, and it just started compiling.

fe...@crowfix.com

unread,
Dec 27, 2012, 12:50:01 AM12/27/12
to
On Wed, Dec 26, 2012 at 09:14:33PM -0800, fe...@crowfix.com wrote:

> Indeed :-) The patch author directed me there, I've applied the 51881
> patch to the 3.7.1 sources, and it just started compiling.

I configured a minimal kernel to test it sooner, and it booted to a
prompt. Now I am compiling with my normal config, including encfs and
a lot of other gorp, and will try it in the morning.

Pandu Poluan

unread,
Dec 27, 2012, 1:40:02 AM12/27/12
to


On Dec 27, 2012 12:45 PM, <fe...@crowfix.com> wrote:
>
> On Wed, Dec 26, 2012 at 09:14:33PM -0800, fe...@crowfix.com wrote:
>
> > Indeed :-) The patch author directed me there, I've applied the 51881
> > patch to the 3.7.1 sources, and it just started compiling.
>
> I configured a minimal kernel to test it sooner, and it booted to a
> prompt.  Now I am compiling with my normal config, including encfs and
> a lot of other gorp, and will try it in the morning.
>

I much appreciate your efforts, and greatly admire your willingness to do bisection.

Do tell us the results... you're doing great service for all of us.

Rgds,
--

Mark Knecht

unread,
Dec 27, 2012, 9:40:01 AM12/27/12
to
On Wed, Dec 26, 2012 at 9:14 PM, <fe...@crowfix.com> wrote:
> On Wed, Dec 26, 2012 at 08:53:14PM -0800, Mark Knecht wrote:
>
>> Possibly related?
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=51881
>
> Indeed :-) The patch author directed me there, I've applied the 51881
> patch to the 3.7.1 sources, and it just started compiling.

Glad you're finding a solution. Hope it works out.

Cheers,
Mark

fe...@crowfix.com

unread,
Dec 27, 2012, 10:10:01 AM12/27/12
to
On Wed, Dec 26, 2012 at 09:41:54PM -0800, fe...@crowfix.com wrote:
>
> I configured a minimal kernel to test it sooner, and it booted to a
> prompt. Now I am compiling with my normal config, including encfs and
> a lot of other gorp, and will try it in the morning.

My bloated fully-larded normal config version of the patched 3.7.1
kernel also works. dmesg logs match with the usual differences in USB
assignments and a few messages which changed wording.

The patch author says the patch is just waiting for the maintainers to
approve it up the line. I do not know if that means it will be in 3.7.2.

Thanks to everyone who helped here, especially with git bisect.

Todd Goodman

unread,
Dec 27, 2012, 11:10:02 AM12/27/12
to
* fe...@crowfix.com <fe...@crowfix.com> [121227 10:08]:
> On Wed, Dec 26, 2012 at 09:41:54PM -0800, fe...@crowfix.com wrote:
> >
> > I configured a minimal kernel to test it sooner, and it booted to a
> > prompt. Now I am compiling with my normal config, including encfs and
> > a lot of other gorp, and will try it in the morning.
>
> My bloated fully-larded normal config version of the patched 3.7.1
> kernel also works. dmesg logs match with the usual differences in USB
> assignments and a few messages which changed wording.
>
> The patch author says the patch is just waiting for the maintainers to
> approve it up the line. I do not know if that means it will be in 3.7.2.
>
> Thanks to everyone who helped here, especially with git bisect.

Thank you for doing the git bisect and tracking this down (and Mark for
pointing to the bugzilla.kernel.org bug.)

Todd
0 new messages