Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: [BUG] kernel 2.6.15.4: soft lockup detected on CPU#0!

50 views
Skip to first unread message

Folkert van Heusden

unread,
Feb 14, 2006, 6:50:12 AM2/14/06
to
I've got the same soft lockups, not while cdrom access but when sa-learn
runs (from spamassassin):

[189597.723886] BUG: soft lockup detected on CPU#0!
[189597.723960]
[189597.724007] Pid: 14, comm: kblockd/0
[189597.724061] EIP: 0060:[<c022dacc>] CPU: 0
[189597.724117] EIP is at ide_pio_sector+0xb2/0xdf
[189597.724169] EFLAGS: 00200206 Not tainted (2.6.15.4-pwc-ara)
[189597.724225] EAX: c2188000 EBX: ebdc5600 ECX: 00000000 EDX: 00000005
[189597.724278] ESI: f7cf1000 EDI: c03f2780 EBP: c03f2814 DS: 007b ES: 007b
[189597.724374] CR0: 8005003b CR2: 4d5591e4 CR3: 003b2000 CR4: 000006d0
[189597.724456] [<c022db2b>] ide_pio_multi+0x32/0x3b
[189597.724616] [<c022ddd3>] task_out_intr+0xc3/0xe5
[189597.724782] [<c01231de>] del_timer+0x56/0x58
[189597.725022] [<c022dd10>] task_out_intr+0x0/0xe5
[189597.725172] [<c02296df>] ide_intr+0xbe/0x139
[189597.725318] [<c013a041>] handle_IRQ_event+0x26/0x59
[189597.725461] [<c013a0f4>] __do_IRQ+0x80/0xe0
[189597.725600] [<c0104dc6>] do_IRQ+0x36/0x66
[189597.725794] [<c01034f2>] common_interrupt+0x1a/0x20
[189597.725935] [<f886b0d4>] ipt_do_table+0xac/0x315 [ip_tables]
[189597.726098] [<f88d1029>] ipt_hook+0x29/0x31 [iptable_filter]
[189597.726231] [<c0286ff1>] nf_iterate+0x69/0x83
[189597.726359] [<c028c893>] ip_local_deliver_finish+0x0/0x1cc
[189597.726568] [<c028c893>] ip_local_deliver_finish+0x0/0x1cc
[189597.726696] [<c0287068>] nf_hook_slow+0x5d/0xea
[189597.726831] [<c028c893>] ip_local_deliver_finish+0x0/0x1cc
[189597.726972] [<c028c328>] ip_local_deliver+0x220/0x250
[189597.727134] [<c028c893>] ip_local_deliver_finish+0x0/0x1cc
[189597.727298] [<c028c5ed>] ip_rcv+0x295/0x53b
[189597.727555] [<c028ca5f>] ip_rcv_finish+0x0/0x295
[189597.727757] [<c0273111>] netif_receive_skb+0x283/0x352
[189597.727930] [<c0273266>] process_backlog+0x86/0x106
[189597.728081] [<c027336e>] net_rx_action+0x88/0x160
[189597.728229] [<c011fb34>] __do_softirq+0xbc/0xce
[189597.728443] [<c011fb78>] do_softirq+0x32/0x34
[189597.728561] [<c0104dcb>] do_IRQ+0x3b/0x66
[189597.728685] [<c01034f2>] common_interrupt+0x1a/0x20
[189597.728817] [<c0228e61>] start_request+0xfc/0x201
[189597.728955] [<c0229190>] ide_do_request+0x209/0x351
[189597.729163] [<c01c072a>] as_work_handler+0x40/0x42
[189597.729296] [<c01297a6>] worker_thread+0x185/0x21a
[189597.729433] [<c01c06ea>] as_work_handler+0x0/0x42
[189597.729570] [<c011770a>] default_wake_function+0x0/0xc
[189597.729711] [<c011770a>] default_wake_function+0x0/0xc
[189597.729856] [<c0129621>] worker_thread+0x0/0x21a
[189597.730083] [<c012d2ce>] kthread+0x9c/0xa1
[189597.730239] [<c012d232>] kthread+0x0/0xa1
[189597.730419] [<c0101065>] kernel_thread_helper+0x5/0xb


On Sun, Feb 12, 2006 at 07:50:48PM +0100, Charles-Edouard Ruault wrote:
> Hi All,
> i was trying to rip a CD when the whole machine started to freeze
> periodicaly, i then looked at the logs and found the following :
>
> Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
> Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!
> Feb 12 19:23:39 ruault kernel:
> Feb 12 19:23:39 ruault kernel: Pid: 0, comm: swapper
> Feb 12 19:23:39 ruault kernel: EIP: 0060:[<c0272b75>] CPU: 0
> Feb 12 19:23:39 ruault kernel: EIP is at ide_inb+0x5/0x10
> Feb 12 19:23:39 ruault kernel: EFLAGS: 00000216 Tainted: P
> (2.6.15.4)
> Feb 12 19:23:39 ruault kernel: EAX: 00000180 EBX: 01aefc21 ECX: cfd3beae
> EDX: 00000177
> Feb 12 19:23:39 ruault kernel: ESI: 00000088 EDI: c0419e30 EBP: c0419ec4
> DS: 007b ES: 007b
> Feb 12 19:23:39 ruault kernel: CR0: 8005003b CR2: b7274c4c CR3: 2086e000
> CR4: 000006d0
> Feb 12 19:23:39 ruault kernel: [<c0273656>] ide_wait_stat+0xf6/0x140
> Feb 12 19:23:39 ruault kernel: [<c0271cf6>] start_request+0xe6/0x2e0
> Feb 12 19:23:39 ruault kernel: [<c027220b>] ide_do_request+0x2eb/0x400
> Feb 12 19:23:39 ruault kernel: [<c0280ff0>] cdrom_newpc_intr+0x0/0x470
> Feb 12 19:23:39 ruault kernel: [<c027255c>] ide_timer_expiry+0xec/0x290
> Feb 12 19:23:39 ruault kernel: [<c0272470>] ide_timer_expiry+0x0/0x290
> Feb 12 19:23:39 ruault kernel: [<c0128afc>] run_timer_softirq+0xbc/0x210
> Feb 12 19:23:39 ruault kernel: [<c0123f12>] __do_softirq+0x52/0xa0
> Feb 12 19:23:39 ruault kernel: [<c0123f8d>] do_softirq+0x2d/0x40
> Feb 12 19:23:39 ruault kernel: [<c0124068>] irq_exit+0x38/0x40
> Feb 12 19:23:39 ruault kernel: [<c0105a77>] do_IRQ+0x37/0x70
> Feb 12 19:23:39 ruault kernel: [<c0103d4e>] common_interrupt+0x1a/0x20
> Feb 12 19:23:39 ruault kernel: [<c010105b>] default_idle+0x2b/0x60
> Feb 12 19:23:39 ruault kernel: [<c01010f5>] cpu_idle+0x45/0x80
> Feb 12 19:23:39 ruault kernel: [<c03b084f>] start_kernel+0x17f/0x1e0
> Feb 12 19:23:39 ruault kernel: [<c03b0390>] unknown_bootoption+0x0/0x1e0
> Feb 12 19:23:39 ruault kernel: hdc: status timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdc: drive not ready for command
> Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
> Feb 12 19:23:39 ruault kernel: hdd: ATAPI reset complete
> Feb 12 19:23:49 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> Feb 12 19:23:49 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:49 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> Feb 12 19:24:09 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
>
> After a few seconds, the whole machine was totally locked up and the
> only option i had was a hard reset.
> I was able to reproduce it once more.
>
> I'm running vanilla kernel 2.6.15.4 on an athlon 32 bits. here's generic
> info about my configuration :
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 6
> model : 8
> model name : AMD Athlon(TM) XP 2400+
> stepping : 1
> cpu MHz : 2000.448
> cache size : 256 KB
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
> bogomips : 4004.40
>
> /sbin/lspci -v
> 00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP]
> Host Bridge
> Subsystem: ASUSTeK Computer Inc. A7V8X motherboard
> Flags: bus master, 66Mhz, medium devsel, latency 0
> Memory at f0000000 (32-bit, prefetchable) [size=128M]
> Capabilities: <available only to root>
>
> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8235 PCI Bridge (prog-if 00
> [Normal decode])
> Flags: bus master, 66Mhz, medium devsel, latency 0
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> I/O behind bridge: 0000d000-0000dfff
> Memory behind bridge: d6800000-d7dfffff
> Prefetchable memory behind bridge: d8000000-efffffff
> Capabilities: <available only to root>
>
> 00:07.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host
> Controller (rev 46) (prog-if 10 [OHCI])
> Subsystem: VIA Technologies, Inc. IEEE 1394 Host Controller
> Flags: bus master, stepping, medium devsel, latency 32, IRQ 20
> Memory at d6000000 (32-bit, non-prefetchable) [size=2K]
> I/O ports at b800 [size=128]
> Capabilities: <available only to root>
>
> 00:08.0 RAID bus controller: Promise Technology, Inc. PDC20376 (FastTrak
> 376) (rev 02)
> Subsystem: ASUSTeK Computer Inc. A7V8X motherboard
> Flags: bus master, 66Mhz, medium devsel, latency 96, IRQ 10
> I/O ports at b400 [size=64]
> I/O ports at b000 [size=16]
> I/O ports at a800 [size=128]
> Memory at d5800000 (32-bit, non-prefetchable) [size=4K]
> Memory at d5000000 (32-bit, non-prefetchable) [size=128K]
> Capabilities: <available only to root>
>
> 00:09.0 Ethernet controller: Broadcom Corporation BCM4401 100Base-T (rev 01)
> Subsystem: ASUSTeK Computer Inc. A7V8X motherboard
> Flags: bus master, fast devsel, latency 32, IRQ 16
> Memory at d4800000 (32-bit, non-prefetchable) [size=8K]
> Expansion ROM at d7ff0000 [disabled] [size=16K]
> Capabilities: <available only to root>
>
> 00:0d.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro
> 100] (rev 08)
> Subsystem: Intel Corporation EtherExpress PRO/100+ Management
> Adapter
> Flags: bus master, medium devsel, latency 32, IRQ 17
> Memory at d4000000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at a400 [size=64]
> Memory at d3800000 (32-bit, non-prefetchable) [size=1M]
> [virtual] Expansion ROM at 50000000 [disabled] [size=1M]
> Capabilities: <available only to root>
>
> 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
> Controller (rev 80) (prog-if 00 [UHCI])
> Subsystem: ASUSTeK Computer Inc. VT6202 USB2.0 4 port controller
> Flags: bus master, medium devsel, latency 32, IRQ 19
> I/O ports at a000 [size=32]
> Capabilities: <available only to root>
>
> 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
> Controller (rev 80) (prog-if 00 [UHCI])
> Subsystem: ASUSTeK Computer Inc. VT6202 USB2.0 4 port controller
> Flags: bus master, medium devsel, latency 32, IRQ 19
> I/O ports at 9800 [size=32]
> Capabilities: <available only to root>
>
> 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
> Controller (rev 80) (prog-if 00 [UHCI])
> Subsystem: ASUSTeK Computer Inc. VT6202 USB2.0 4 port controller
> Flags: bus master, medium devsel, latency 32, IRQ 19
> I/O ports at 9400 [size=32]
> Capabilities: <available only to root>
>
> 00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82) (prog-if
> 20 [EHCI])
> Subsystem: ASUSTeK Computer Inc. A7V8X motherboard
> Flags: bus master, medium devsel, latency 32, IRQ 19
> Memory at d3000000 (32-bit, non-prefetchable) [size=256]
> Capabilities: <available only to root>
>
> 00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
> Subsystem: ASUSTeK Computer Inc. A7V8X motherboard
> Flags: bus master, stepping, medium devsel, latency 0
> Capabilities: <available only to root>
>
> 00:11.1 IDE interface: VIA Technologies, Inc.
> VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
> (prog-if 8a [Master SecP PriP])
> Subsystem: ASUSTeK Computer Inc. A7V8X / A7V333 motherboard
> Flags: bus master, medium devsel, latency 32, IRQ 255
> I/O ports at 9000 [size=16]
> Capabilities: <available only to root>
>
> 00:11.5 Multimedia audio controller: VIA Technologies, Inc.
> VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
> Subsystem: ASUSTeK Computer Inc. A7V8X Motherboard (Realtek
> ALC650 codec)
> Flags: medium devsel, IRQ 18
> I/O ports at e000 [size=256]
> Capabilities: <available only to root>
>
> 01:00.0 VGA compatible controller: ATI Technologies Inc Radeon R300 NE
> [Radeon 9500 Pro] (prog-if 00 [VGA])
> Subsystem: ATI Technologies Inc Radeon R300 NE [Radeon 9500 Pro]
> Flags: bus master, stepping, 66Mhz, medium devsel, latency 255,
> IRQ 17
> Memory at e8000000 (32-bit, prefetchable) [size=128M]
> I/O ports at d800 [size=256]
> Memory at d7000000 (32-bit, non-prefetchable) [size=64K]
> Expansion ROM at e7fe0000 [disabled] [size=128K]
> Capabilities: <available only to root>
>
> 01:00.1 Display controller: ATI Technologies Inc Radeon R300 [Radeon
> 9500 Pro] (Secondary)
> Subsystem: ATI Technologies Inc Radeon R300 NE [Radeon 9500 Pro]
> Flags: bus master, stepping, 66Mhz, medium devsel, latency 64
> Memory at d8000000 (32-bit, prefetchable) [size=128M]
> Memory at d6800000 (32-bit, non-prefetchable) [size=64K]
> Capabilities: <available only to root>
>
> /sbin/lsmod
> Module Size Used by
> rfcomm 36504 0
> l2cap 24196 5 rfcomm
> bluetooth 44900 4 rfcomm,l2cap
> deflate 3328 0
> zlib_deflate 23576 1 deflate
> twofish 47104 0
> serpent 20224 0
> aes_i586 38272 0
> blowfish 8192 0
> des 16384 0
> lp 10696 0
> sha256 10880 0
> sha1 2816 0
> crypto_null 2560 0
> xfrm4_tunnel 3204 0
> ipcomp 7176 0
> esp4 7168 0
> ah4 5632 0
> af_key 31248 0
> autofs4 18052 1
> vmnet 35748 3
> parport_pc 38852 1
> parport 35528 2 lp,parport_pc
> vmmon 111468 0
> sunrpc 139836 1
> ipt_MARK 2432 1
> iptable_mangle 2560 1
> ipt_MASQUERADE 3072 1
> iptable_nat 6916 1
> ip_nat 17068 2 ipt_MASQUERADE,iptable_nat
> ipt_LOG 6656 1
> ipt_limit 2432 1
> ipt_mark 1664 1
> ipt_state 1920 1
> ip_conntrack 48556 4 ipt_MASQUERADE,iptable_nat,ip_nat,ipt_state
> nfnetlink 5272 2 ip_nat,ip_conntrack
> iptable_filter 2560 1
> ip_tables 21120 9
> ipt_MARK,iptable_mangle,ipt_MASQUERADE,iptable_nat,ipt_LOG,ipt_limit,ipt_mark,ipt_state,iptable_filter
> binfmt_misc 9864 1
> fglrx 454604 7
> ipv6 259840 23
> sd_mod 17296 0
> dm_mod 54072 0
> video 14468 0
> button 5264 0
> ac 3716 0
> quickcam 77348 0
> videodev 7936 1 quickcam
> usbhid 37344 0
> usb_storage 32516 0
> scsi_mod 126824 2 sd_mod,usb_storage
> ohci1394 33460 0
> ieee1394 92600 1 ohci1394
> uhci_hcd 31888 0
> ehci_hcd 33416 0
> via_agp 8320 1
> agpgart 29768 2 fglrx,via_agp
> shpchp 44224 0
> i2c_viapro 7828 0
> i2c_core 18192 1 i2c_viapro
> snd_via82xx 24468 4
> snd_ac97_codec 94112 1 snd_via82xx
> snd_ac97_bus 2176 1 snd_ac97_codec
> snd_seq_dummy 3076 0
> snd_seq_oss 34176 0
> snd_seq_midi_event 6784 1 snd_seq_oss
> snd_seq 52112 5 snd_seq_dummy,snd_seq_oss,snd_seq_midi_event
> snd_pcm_oss 50848 0
> snd_mixer_oss 18048 1 snd_pcm_oss
> snd_pcm 85128 5 snd_via82xx,snd_ac97_codec,snd_pcm_oss
> snd_timer 22788 3 snd_seq,snd_pcm
> snd_page_alloc 8968 2 snd_via82xx,snd_pcm
> snd_mpu401_uart 6784 1 snd_via82xx
> snd_rawmidi 22176 1 snd_mpu401_uart
> snd_seq_device 7436 4
> snd_seq_dummy,snd_seq_oss,snd_seq,snd_rawmidi
> snd 49252 16
> snd_via82xx,snd_ac97_codec,snd_seq_oss,snd_seq,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
> soundcore 8416 1 snd
> eepro100 27920 0
> b44 23948 0
> mii 5376 2 eepro100,b44
> floppy 60100 0
> usbcore 120836 6
> quickcam,usbhid,usb_storage,uhci_hcd,ehci_hcd
>
> relevant info regarding drives etc :
>
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> Feb 12 19:37:12 ruault kernel: VP_IDE: IDE controller at PCI slot
> 0000:00:11.1
> Feb 12 19:37:12 ruault kernel: ACPI: PCI Interrupt 0000:00:11.1[A]: no GSI
> Feb 12 19:37:12 ruault kernel: PCI: Via IRQ fixup for 0000:00:11.1, from
> 255 to 15
> Feb 12 19:37:12 ruault kernel: VP_IDE: chipset revision 6
> Feb 12 19:37:12 ruault kernel: VP_IDE: not 100% native mode: will probe
> irqs later
> Feb 12 19:37:12 ruault kernel: VP_IDE: VIA vt8235 (rev 00) IDE UDMA133
> controller on pci0000:00:11.1
> Feb 12 19:37:12 ruault kernel: ide0: BM-DMA at 0x9000-0x9007, BIOS
> settings: hda:DMA, hdb:pio
> Feb 12 19:37:12 ruault kernel: ide1: BM-DMA at 0x9008-0x900f, BIOS
> settings: hdc:DMA, hdd:DMA
> Feb 12 19:37:12 ruault kernel: hda: Maxtor 6Y080L0, ATA DISK drive
> Feb 12 19:37:12 ruault kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> Feb 12 19:37:12 ruault kernel: hdc: PLEXTOR CD-R PX-W1610A, ATAPI
> CD/DVD-ROM drive
> Feb 12 19:37:12 ruault kernel: hdd: _NEC DVD_RW ND-3500AG, ATAPI
> CD/DVD-ROM drive
> Feb 12 19:37:12 ruault kernel: ide1 at 0x170-0x177,0x376 on irq 15
> Feb 12 19:37:12 ruault kernel: hda: max request size: 128KiB
> Feb 12 19:37:12 ruault kernel: hda: 160086528 sectors (81964 MB)
> w/2048KiB Cache, CHS=65535/16/63, UDMA(133)
> Feb 12 19:37:12 ruault kernel: hda: cache flushes supported
> Feb 12 19:37:12 ruault kernel: hda: hda1 hda2 hda3 hda4 < hda5 hda6
> hda7 hda8 hda9 hda10 >
> Feb 12 19:37:12 ruault kernel: hdc: ATAPI 40X CD-ROM CD-R/RW drive,
> 2048kB Cache, DMA
> Feb 12 19:37:12 ruault kernel: Uniform CD-ROM driver Revision: 3.20
> Feb 12 19:37:12 ruault kernel: hdd: ATAPI 32X DVD-ROM DVD-R CD-R/RW
> drive, 2048kB Cache, UDMA(33)
> Feb 12 19:37:12 ruault kernel: ide-floppy driver 0.99.newide
>
>
> I'm also attaching my kernel config.
>
> Let me know if you need anything else to troobleshoot the problem.
> Regards.
>
> PS: i'm not on the list so please CC me when answering !
> Thanks
>
> --
> Charles-Edouard Ruault
> PGP Key ID E4D2B80C
>


Folkert van Heusden

--
www.vanheusden.com/multitail - multitail is tail on steroids. multiple
windows, filtering, coloring, anything you can think of
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Lee Revell

unread,
Feb 14, 2006, 11:40:13 AM2/14/06
to
On Tue, 2006-02-14 at 12:41 +0100, Folkert van Heusden wrote:
> I've got the same soft lockups, not while cdrom access but when sa-learn
> runs (from spamassassin):

Why is DMA disabled for your IDE disks?

Lee

Folkert van Heusden

unread,
Feb 15, 2006, 10:10:17 AM2/15/06
to
> > I've got the same soft lockups, not while cdrom access but when sa-learn
> > runs (from spamassassin):
> Why is DMA disabled for your IDE disks?

root@muur:/var/www/htdocs# hdparm -i /dev/hda

/dev/hda:

Model=QUANTUM FIREBALLP AS20.5, FwRev=A2R.0600, SerialNo=692124031358
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1902kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=40132503
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1:

* signifies the current active mode

On the other hand:
root@muur:/var/www/htdocs# hdparm -d /dev/hda

/dev/hda:
using_dma = 0 (off)

That is puzzling me as I explicitly swith that on in the startupscripts:
0 root@muur:/var/www/htdocs# grep hdparm /etc/rc.d/*
/etc/rc.d/rc.local:/usr/sbin/hdparm -c 3 -d 1 -X 69 -u 1 /dev/hda &

So it seems it switches back to pio somewhere in time.

But on the other hand: shouldn't pio run ok as well without softlockup
errors?


Folkert van Heusden

--
www.biglumber.com <- site where one can exchange PGP key signatures

----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

Lee Revell

unread,
Feb 15, 2006, 1:30:26 PM2/15/06
to
On Wed, 2006-02-15 at 16:07 +0100, Folkert van Heusden wrote:
>
> So it seems it switches back to pio somewhere in time.
>
> But on the other hand: shouldn't pio run ok as well without softlockup
> errors?

AFAICT in PIO mode if a huge IO request is submitted and/or the device
is slow to respond there's nothing to prevent us spending a long time in
the IDE driver.

IIRC this can't be made preemptible due to data corruption issues on
some hardware. Maybe we should touch the watchdog in there.

Lee

Alan Cox

unread,
Feb 15, 2006, 8:40:09 PM2/15/06
to
On Mer, 2006-02-15 at 13:19 -0500, Lee Revell wrote:
> AFAICT in PIO mode if a huge IO request is submitted and/or the device
> is slow to respond there's nothing to prevent us spending a long time in
> the IDE driver.

IDE is basically bits of ISA bus decode stuffed down a cable and then
sped up/fancified later. In PIO0 each cycle is pretty slow and we may
issue quite a few in one transfer. If IORDY is enabled as is normally
the case the drive also has the power to tell us to shut up and can do
so for as long as it likes.

> IIRC this can't be made preemptible due to data corruption issues on
> some hardware. Maybe we should touch the watchdog in there.

Some chips disable IRQs here and need to so that the stream of data
doesn't stall and stuff fall apart. Touching the watchdog would be
sensible.

Possibly the same is needed in the new libata pata bits too

Andrew Morton

unread,
Feb 15, 2006, 10:00:15 PM2/15/06
to
Charles-Edouard Ruault <c...@ruault.com> wrote:
>
> i was trying to rip a CD when the whole machine started to freeze
> periodicaly, i then looked at the logs and found the following :
>
> Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command

No idea what caused that.

> Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!

The following was merged today. Hopefully it suppresses this false
positive.

--- devel/drivers/ide/ide-taskfile.c~ide-touch-softlockup-detector-during-pio 2006-02-15 14:57:05.000000000 -0800
+++ devel-akpm/drivers/ide/ide-taskfile.c 2006-02-15 14:57:05.000000000 -0800
@@ -34,6 +34,7 @@
#include <linux/kernel.h>
#include <linux/timer.h>
#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/major.h>
#include <linux/errno.h>
@@ -314,6 +315,8 @@ static void ide_pio_datablock(ide_drive_
if (rq->bio) /* fs request */
rq->errors = 0;

+ touch_softlockup_watchdog();
+
switch (drive->hwif->data_phase) {
case TASKFILE_MULTI_IN:
case TASKFILE_MULTI_OUT:
_

Charles-Edouard Ruault

unread,
Feb 16, 2006, 4:50:14 AM2/16/06
to
Andrew Morton wrote:

>Charles-Edouard Ruault <c...@ruault.com> wrote:
>
>
>>i was trying to rip a CD when the whole machine started to freeze
>> periodicaly, i then looked at the logs and found the following :
>>
>> Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
>> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
>> Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
>> Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
>> Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
>>
>>
>
>No idea what caused that.
>
>
>
>> Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!
>>
>>
>
>The following was merged today. Hopefully it suppresses this false
>positive.
>
>

From what i understand it will fix the problem only if the drive is in
PIO mode, which is the case for Folkert van Heusden, who reported the
same BUG output.
However it does not appear that my cdrom drives are using PIO, from the
logs i have they're supposed to use DMA :

>Feb 12 19:37:12 ruault kernel: hdc: ATAPI 40X CD-ROM CD-R/RW drive,2048kB Cache, DMA


> Feb 12 19:37:12 ruault kernel: Uniform CD-ROM driver Revision: 3.20
> Feb 12 19:37:12 ruault kernel: hdd: ATAPI 32X DVD-ROM DVD-R CD-R/RW
> drive, 2048kB Cache, UDMA(33)

sudo /sbin/hdparm -i /dev/hdc

/dev/hdc:

Model=PLEXTOR CD-R PX-W1610A, FwRev=1.05, SerialNo=
Config={ Fixed Removeable DTR<=5Mbs DTR>10Mbs nonMagnetic }
RawCHS=0/0/0, TrkSize=0, SectSize=0, ECCbytes=0
BuffType=unknown, BuffSize=0kB, MaxMultSect=0
(maybe): CurCHS=0/0/0, CurSects=0, LBA=yes, LBAsects=0
IORDY=on/off, tPIO={min:180,w/IORDY:120}, tDMA={min:120,rec:120}


PIO modes: pio0 pio1 pio2 pio3 pio4

DMA modes: sdma0 sdma1 sdma2 mdma0 mdma1 *mdma2
AdvancedPM=no

sudo /sbin/hdparm -i /dev/hdd

/dev/hdd:

Model=_NEC DVD_RW ND-3500AG, FwRev=2.1A, SerialNo=
Config={ Removeable DTR<=5Mbs DTR>10Mbs nonMagnetic }
RawCHS=0/0/0, TrkSize=0, SectSize=0, ECCbytes=0
BuffType=unknown, BuffSize=0kB, MaxMultSect=0
(maybe): CurCHS=0/0/0, CurSects=0, LBA=yes, LBAsects=0
IORDY=yes, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}


PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2

UDMA modes: udma0 udma1 *udma2
AdvancedPM=no


So i think this patch won't do me any good :(
any other idea ????

Thanks for the answer.

--
Charles-Edouard Ruault
GPG key Id E4D2B80C

Folkert van Heusden

unread,
Feb 16, 2006, 5:40:08 AM2/16/06
to
> From what i understand it will fix the problem only if the drive is in
> PIO mode, which is the case for Folkert van Heusden, who reported the
> same BUG output.
> However it does not appear that my cdrom drives are using PIO, from the
> logs i have they're supposed to use DMA :
> >Feb 12 19:37:12 ruault kernel: hdc: ATAPI 40X CD-ROM CD-R/RW drive,2048kB Cache, DMA
> > Feb 12 19:37:12 ruault kernel: Uniform CD-ROM driver Revision: 3.20
> > Feb 12 19:37:12 ruault kernel: hdd: ATAPI 32X DVD-ROM DVD-R CD-R/RW
> > drive, 2048kB Cache, UDMA(33)
> sudo /sbin/hdparm -i /dev/hdc
> /dev/hdc:
> DMA modes: sdma0 sdma1 sdma2 mdma0 mdma1 *mdma2

Try /sbin/hdparm -d /dev/hdc
and see if it returns "using_dma = 1 (on)".
I noticed that altough -i says using dma, that -d tells me it really is
off.


Folkert van Heusden

--
www.biglumber.com <- site where one can exchange PGP key signatures
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

Andrew Morton

unread,
Feb 16, 2006, 5:50:14 AM2/16/06
to
Folkert van Heusden <fol...@vanheusden.com> wrote:
>
> I noticed that altough -i says using dma, that -d tells me it really is
> off.

Yup, `hdparm -i' tells you what the drive _can_ do, not what it's actually
doing.

Charles-Edouard Ruault

unread,
Feb 16, 2006, 6:20:15 AM2/16/06
to
Andrew Morton wrote:

>Folkert van Heusden <fol...@vanheusden.com> wrote:
>
>
>>I noticed that altough -i says using dma, that -d tells me it really is
>> off.
>>
>>
>
>Yup, `hdparm -i' tells you what the drive _can_ do, not what it's actually
>doing.
>
>
>

here's what i have:

sudo /sbin/hdparm -d /dev/hdd
/dev/hdd:
using_dma = 1 (on)

sudo /sbin/hdparm -d /dev/hdc
/dev/hdc:
using_dma = 1 (on)

So it appears that both CDROM/DVDROM drives are using DMA, which
confirms what i was thinking (since the traceback i have is not the same
as the one Folker provided which clearly indicates that it happens
during a pio operation ).

--
Charles-Edouard Ruault
GPG key Id E4D2B80C

-

Alan Cox

unread,
Feb 16, 2006, 8:00:13 AM2/16/06
to
On Iau, 2006-02-16 at 10:44 +0100, Charles-Edouard Ruault wrote:
> >From what i understand it will fix the problem only if the drive is in
> PIO mode, which is the case for Folkert van Heusden, who reported the
> same BUG output.

Timeouts will cause PIO fallbacks as the driver tries to beat the device
back into sanity so it seems reasonable to expect the fix to work in
this case.

Bartlomiej Zolnierkiewicz

unread,
Feb 16, 2006, 8:40:16 AM2/16/06
to
On 2/16/06, Andrew Morton <ak...@osdl.org> wrote:
> Charles-Edouard Ruault <c...@ruault.com> wrote:
> >
> > i was trying to rip a CD when the whole machine started to freeze
> > periodicaly, i then looked at the logs and found the following :
> >
> > Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
>
> No idea what caused that.
>
> > Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!
>
> The following was merged today. Hopefully it suppresses this false
> positive.

Unfortunately it won't. Charles' problem is different (and the BUG
output is different!) - soft lockup got triggered for PIO handling in
ide-cd driver. This patch fixes the problem only for generic ATA PIO
routines (disks and [P]IDENTIFY), ATAPI PIO still needs fixing
(ide-cd/floppy/tape/scsi drivers).

Andrew, there is no "high level" function for ATAPI PIO as
ide_pio_datablock() for ATA PIO so fixing soft lockup false positives
would require going through all drivers and adding bunch of
touch_softlockup_watchdog() or using sledge-hammer approach
and touching watchdog in ide-iops.c:atapi_[input,output]_bytes().

BTW Charles' OOPS is for tainted (P) kernel (fglrx loaded)

Bartlomiej

Andrew Morton

unread,
Feb 16, 2006, 3:30:28 PM2/16/06
to
Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:
>
> On 2/16/06, Andrew Morton <ak...@osdl.org> wrote:
> > Charles-Edouard Ruault <c...@ruault.com> wrote:
> > >
> > > i was trying to rip a CD when the whole machine started to freeze
> > > periodicaly, i then looked at the logs and found the following :
> > >
> > > Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> > > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > > Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> > > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > > Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
> >
> > No idea what caused that.
> >
> > > Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!
> >
> > The following was merged today. Hopefully it suppresses this false
> > positive.
>
> Unfortunately it won't. Charles' problem is different (and the BUG
> output is different!) - soft lockup got triggered for PIO handling in
> ide-cd driver. This patch fixes the problem only for generic ATA PIO
> routines (disks and [P]IDENTIFY), ATAPI PIO still needs fixing
> (ide-cd/floppy/tape/scsi drivers).

argh. We do need to bop that warning on the head - it's consuming people's
time and causing unneeded concern.

> Andrew, there is no "high level" function for ATAPI PIO as
> ide_pio_datablock() for ATA PIO so fixing soft lockup false positives
> would require going through all drivers and adding bunch of
> touch_softlockup_watchdog() or using sledge-hammer approach
> and touching watchdog in ide-iops.c:atapi_[input,output]_bytes().

Send fixup patch, please?

Bartlomiej Zolnierkiewicz

unread,
Feb 17, 2006, 6:50:12 AM2/17/06
to
On 2/16/06, Andrew Morton <ak...@osdl.org> wrote:
> Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:
> >
> > On 2/16/06, Andrew Morton <ak...@osdl.org> wrote:
> > > Charles-Edouard Ruault <c...@ruault.com> wrote:
> > > >
> > > > i was trying to rip a CD when the whole machine started to freeze
> > > > periodicaly, i then looked at the logs and found the following :
> > > >
> > > > Feb 12 19:23:39 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
> > > > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > > > Feb 12 19:23:39 ruault kernel: hdd: status timeout: status=0x80 { Busy }
> > > > Feb 12 19:23:39 ruault kernel: ide: failed opcode was: unknown
> > > > Feb 12 19:23:39 ruault kernel: hdd: drive not ready for command
> > >
> > > No idea what caused that.
> > >
> > > > Feb 12 19:23:39 ruault kernel: BUG: soft lockup detected on CPU#0!
> > >
> > > The following was merged today. Hopefully it suppresses this false
> > > positive.
> >
> > Unfortunately it won't. Charles' problem is different (and the BUG
> > output is different!) - soft lockup got triggered for PIO handling in
> > ide-cd driver. This patch fixes the problem only for generic ATA PIO
> > routines (disks and [P]IDENTIFY), ATAPI PIO still needs fixing
> > (ide-cd/floppy/tape/scsi drivers).
>
> argh. We do need to bop that warning on the head - it's consuming people's
> time and causing unneeded concern.

Yes, it is definitely consuming time...
this is 3-rd time we hit false-positive in IDE subsystem.

"soft lockup during probe" problem was fixed by touching
watchdog. I believe proper fix would require using msleep() instead
of mdelay() and we knew this before soft lockup issue.

So from my IDE work POV it was just wasted time...

"ATA PIO" problem was handled in the same way.
I'm still not 100% sure if it was false positive - it looked like
from the trace but I find it hard to believe that users wouldn't
complain about 10sec stalls [ Soft lockup detector claims to
trigger if after 10sec it hasn't been touched - is it really working
as advertised? How can we verify this? ].

Again it only wasted time from IDE POV.

And now ATAPI PIO issues requires me to audit all ATAPI
device drivers to quiet soft lockup detector...

> > Andrew, there is no "high level" function for ATAPI PIO as
> > ide_pio_datablock() for ATA PIO so fixing soft lockup false positives
> > would require going through all drivers and adding bunch of
> > touch_softlockup_watchdog() or using sledge-hammer approach
> > and touching watchdog in ide-iops.c:atapi_[input,output]_bytes().
>
> Send fixup patch, please?

Other things have much bigger priority than quieting soft lockup
detector because somebody decided upon driver maintainers that
soft lockup during hardware access is a _major_ problem (-> system
stays locked) and needs to be fixed ASAP (-> it is enabled by default).

This feature could be useful but why driver maintainers are _forced_ to
pay maintenance cost for it?

I can send you another sledge-hammer fix which is quite disgusting as
it touches watchdog on every ATAPI related PIO access (i.e. many times
in ATAPI PIO IRQ handlers, during packet command transfer).

Alternatively I can send you patch making soft lockup detector depend on
"CONFIG_IDE=n".

Bartlomiej

Ingo Molnar

unread,
Feb 17, 2006, 8:20:16 AM2/17/06
to

* Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:

> I'm still not 100% sure if it was false positive - it looked like from
> the trace but I find it hard to believe that users wouldn't complain
> about 10sec stalls [ Soft lockup detector claims to trigger if after
> 10sec it hasn't been touched - is it really working as advertised?
> How can we verify this? ].

the watchdog is quite simple: it consists of per-CPU SCHED_FIFO prio 99
[i.e. highest RT priority] threads that do nothing but:

while (!kthread_should_stop()) {
msleep_interruptible(1000);
touch_softlockup_watchdog();
}

i can think of only one (pretty theoretical) scenario for a false
positive here: msleep uses timers, which are processed by softirq
context, which context itself might be delayed. Under extreme load, if
softirqs get delayed for more than 9 seconds, this _might_ lead to false
positives. But that i think is highly unlikely in the reported IDE
cases.

in any case, the patch below gets rid of the softirq involvement, and
makes the soft-watchdog purely timer-irq driven (and a few minor
cleanups). Could you try it? I have tested it - it correctly detected a
11-seconds delay and stayed silent during a 9-seconds delay.

If you still get warnings even with this patch applied, then my very
strong suspicion is that the 10+ seconds delays in the IDE code are
real, and not false-positives. If there are such places then the minimum
we should do is to document them via touch_softlockup_watchdog() ...
even if you "knew" about such places already.

Andrew, could we try this patch in -mm?

Ingo

---------------

this patch makes the softlockup detector purely timer-interrupt driven,
removing softirq-context (timer) dependencies. This means that if the
softlockup watchdog triggers, it has truly observed a longer than 10
seconds scheduling delay of a SCHED_FIFO prio 99 task.

(the patch also turns off the softlockup detector during the initial
bootup phase and does small style fixes)

Signed-off-by: Ingo Molnar <mi...@elte.hu>

----

kernel/softlockup.c | 47 +++++++++++++++++++++++------------------------
1 files changed, 23 insertions(+), 24 deletions(-)

Index: linux/kernel/softlockup.c
===================================================================
--- linux.orig/kernel/softlockup.c
+++ linux/kernel/softlockup.c
@@ -1,12 +1,11 @@
/*
* Detect Soft Lockups
*
- * started by Ingo Molnar, (C) 2005, Red Hat
+ * started by Ingo Molnar, Copyright (C) 2005, 2006 Red Hat, Inc.
*
* this code detects soft lockups: incidents in where on a CPU
* the kernel does not reschedule for 10 seconds or more.
*/
-
#include <linux/mm.h>
#include <linux/cpu.h>
#include <linux/init.h>
@@ -17,13 +16,14 @@

static DEFINE_SPINLOCK(print_lock);

-static DEFINE_PER_CPU(unsigned long, timestamp) = 0;
-static DEFINE_PER_CPU(unsigned long, print_timestamp) = 0;
+static DEFINE_PER_CPU(unsigned long, touch_timestamp);
+static DEFINE_PER_CPU(unsigned long, print_timestamp);
static DEFINE_PER_CPU(struct task_struct *, watchdog_task);

static int did_panic = 0;
-static int softlock_panic(struct notifier_block *this, unsigned long event,
- void *ptr)
+
+static int
+softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
{
did_panic = 1;

@@ -36,7 +36,7 @@ static struct notifier_block panic_block

void touch_softlockup_watchdog(void)
{
- per_cpu(timestamp, raw_smp_processor_id()) = jiffies;
+ per_cpu(touch_timestamp, raw_smp_processor_id()) = jiffies;
}
EXPORT_SYMBOL(touch_softlockup_watchdog);

@@ -47,17 +47,20 @@ EXPORT_SYMBOL(touch_softlockup_watchdog)
void softlockup_tick(struct pt_regs *regs)
{
int this_cpu = smp_processor_id();
- unsigned long timestamp = per_cpu(timestamp, this_cpu);
-
- if (per_cpu(print_timestamp, this_cpu) == timestamp)
- return;
+ unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu);

- /* Do not cause a second panic when there already was one */
- if (did_panic)
+ /* Do not warn during bootup and prevent double reports: */
+ if (system_state != SYSTEM_RUNNING || did_panic ||
+ per_cpu(print_timestamp, this_cpu) == touch_timestamp)
return;

- if (time_after(jiffies, timestamp + 10*HZ)) {
- per_cpu(print_timestamp, this_cpu) = timestamp;
+ /* Wake up the high-prio watchdog task every second: */
+ if (time_after(jiffies, touch_timestamp + HZ))
+ wake_up_process(per_cpu(watchdog_task, this_cpu));
+
+ /* Warn about unreasonable 10+ seconds delays: */
+ if (time_after(jiffies, touch_timestamp + 10*HZ)) {
+ per_cpu(print_timestamp, this_cpu) = touch_timestamp;

spin_lock(&print_lock);
printk(KERN_ERR "BUG: soft lockup detected on CPU#%d!\n",
@@ -77,18 +80,16 @@ static int watchdog(void * __bind_cpu)
sched_setscheduler(current, SCHED_FIFO, &param);
current->flags |= PF_NOFREEZE;

- set_current_state(TASK_INTERRUPTIBLE);
-
/*
- * Run briefly once per second - if this gets delayed for
- * more than 10 seconds then the debug-printout triggers
- * in softlockup_tick():
+ * Run briefly once per second to reset the softlockup timestamp.
+ * If this gets delayed for more than 10 seconds then the
+ * debug-printout triggers in softlockup_tick().
*/
while (!kthread_should_stop()) {
- msleep_interruptible(1000);
+ set_current_state(TASK_INTERRUPTIBLE);
touch_softlockup_watchdog();
+ schedule();
}
- __set_current_state(TASK_RUNNING);

return 0;
}
@@ -114,7 +115,6 @@ cpu_callback(struct notifier_block *nfb,
kthread_bind(p, hotcpu);
break;
case CPU_ONLINE:
-
wake_up_process(per_cpu(watchdog_task, hotcpu));
break;
#ifdef CONFIG_HOTPLUG_CPU
@@ -146,4 +146,3 @@ __init void spawn_softlockup_task(void)

notifier_chain_register(&panic_notifier_list, &panic_block);
}
-

Bartlomiej Zolnierkiewicz

unread,
Feb 17, 2006, 9:50:05 AM2/17/06
to
On 2/17/06, Ingo Molnar <mi...@elte.hu> wrote:
>
> * Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:
>
> > I'm still not 100% sure if it was false positive - it looked like from
> > the trace but I find it hard to believe that users wouldn't complain
> > about 10sec stalls [ Soft lockup detector claims to trigger if after
> > 10sec it hasn't been touched - is it really working as advertised?
> > How can we verify this? ].
>
> the watchdog is quite simple: it consists of per-CPU SCHED_FIFO prio 99
> [i.e. highest RT priority] threads that do nothing but:
>
> while (!kthread_should_stop()) {
> msleep_interruptible(1000);
> touch_softlockup_watchdog();
> }
>
> i can think of only one (pretty theoretical) scenario for a false
> positive here: msleep uses timers, which are processed by softirq
> context, which context itself might be delayed. Under extreme load, if
> softirqs get delayed for more than 9 seconds, this _might_ lead to false
> positives. But that i think is highly unlikely in the reported IDE
> cases.

Actually it seems very likely [ if I'm reading the code right ]: ide_intr()
IRQ handler contains "optimization" that it calls ide_do_request() for
the next request if the previous request is completed. Processing the
next request involves waiting for device to become ready (up to 5sec)
and sending first chunk of data out in case of PIO-out protocol.
Moreover if the requests are short we can be hitting "optimization"
case few times in the row. Hard IRQs will still have their chance to be
processed but softirq context can be delayed quite a lot as it is executed
after hard IRQ handler completes.

> in any case, the patch below gets rid of the softirq involvement, and
> makes the soft-watchdog purely timer-irq driven (and a few minor
> cleanups). Could you try it? I have tested it - it correctly detected a
> 11-seconds delay and stayed silent during a 9-seconds delay.
>
> If you still get warnings even with this patch applied, then my very
> strong suspicion is that the 10+ seconds delays in the IDE code are
> real, and not false-positives. If there are such places then the minimum
> we should do is to document them via touch_softlockup_watchdog() ...
> even if you "knew" about such places already.

Fully agreed, where is the patch?

Sorry but I have enough more high priority issues to take care of and
I'm not going to spend any more time on soft lockups even if they are
really problems in IDE subsystem. If this is not fixed before 2.6.16
I'm submitting patch to Linus making DETECT_SOFTLOCKUP depend
on "CONFIG_IDE=n"... at least users will be able to use their systems
instead of seeing lockups.

DETECT_SOFTLOCKUP should be an aim in development not a
method of forcing driver maintainers to work on specific issues...

Thank you for understanding.

Bartlomiej

Ingo Molnar

unread,
Feb 17, 2006, 3:00:11 PM2/17/06
to

* Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:

> > If you still get warnings even with this patch applied, then my very
> > strong suspicion is that the 10+ seconds delays in the IDE code are
> > real, and not false-positives. If there are such places then the minimum
> > we should do is to document them via touch_softlockup_watchdog() ...
> > even if you "knew" about such places already.
>
> Fully agreed, where is the patch?

it's included in the mail you replied to :)

Ingo

Ingo Molnar

unread,
Feb 17, 2006, 3:20:06 PM2/17/06
to

* Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:

> Sorry but I have enough more high priority issues to take care of and
> I'm not going to spend any more time on soft lockups even if they are
> really problems in IDE subsystem. If this is not fixed before 2.6.16
> I'm submitting patch to Linus making DETECT_SOFTLOCKUP depend on
> "CONFIG_IDE=n"... at least users will be able to use their systems
> instead of seeing lockups.

i have lots of IDE based systems (they dont use PIO though) and i'm not
seeing these. I'll oppose such a patch if it's to hide genuine issues -
the 10 seconds tolerance is already generous i think. I'll of course fix
any false positives which are the fault of the softlockup-watchdog, but
from your mails it appears to me that the IDE warnings are indeed
genuine.

If the source of the delay is hard to fix you can temporarily work it
around in the code by putting in the touch_softlockup_watchdog() lines -
that will also document the places that cause long delays - which is a
good thing.

It is entirely feasible to put a touch_softlockup_watchdog() call into
every PIO OP - even a single-byte PIO related IN/OUT instruction takes a
couple of microseconds, so a touch_softlockup_watchdog() wont even show
up on the radar.

> DETECT_SOFTLOCKUP should be an aim in development not a method of
> forcing driver maintainers to work on specific issues...

well, 10+ seconds delays on a running system are not really acceptable,
and can cause other problems. The softlockup-watchdog is optional and
can be easily turned off in the .config.

Ingo

Bartlomiej Zolnierkiewicz

unread,
Feb 17, 2006, 5:20:14 PM2/17/06
to
On 2/17/06, Ingo Molnar <mi...@elte.hu> wrote:
>
> * Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:
>
> > Sorry but I have enough more high priority issues to take care of and
> > I'm not going to spend any more time on soft lockups even if they are
> > really problems in IDE subsystem. If this is not fixed before 2.6.16
> > I'm submitting patch to Linus making DETECT_SOFTLOCKUP depend on
> > "CONFIG_IDE=n"... at least users will be able to use their systems
> > instead of seeing lockups.
>
> i have lots of IDE based systems (they dont use PIO though) and i'm not
> seeing these. I'll oppose such a patch if it's to hide genuine issues -
> the 10 seconds tolerance is already generous i think. I'll of course fix
> any false positives which are the fault of the softlockup-watchdog, but
> from your mails it appears to me that the IDE warnings are indeed
> genuine.
>
> If the source of the delay is hard to fix you can temporarily work it
> around in the code by putting in the touch_softlockup_watchdog() lines -
> that will also document the places that cause long delays - which is a
> good thing.
>
> It is entirely feasible to put a touch_softlockup_watchdog() call into
> every PIO OP - even a single-byte PIO related IN/OUT instruction takes a
> couple of microseconds, so a touch_softlockup_watchdog() wont even show
> up on the radar.

OK, I'll just add touch_softlockup_watchdog() if needed but first lets
wait for results of your patch.

Note that I'll invest my time on this which could be invested into other
things and I don't see it as top-priority issue if you differ in your opinion
you should be the person fixing affected drivers.

The conclusion of the rant is: people making changes at higher layers
should start paying maintenance costs of their changes. Over few years
of maintaining IDE I learned quite a lot about block layer, VFS, VM, ACPI,
PM, IRQ routing, scheduling, sysfs etc (I'm not talking about interface
changes but about bugs/changes which are reported by end users
and driver maintainers are end-point). This is all good but distracts me
from my primary task and now it is turn for people hacking on generic
code to learn few driver specific things... :)

No wonder that nobody wants to hack drivers: less fame, more flames,
and actually besides knowing hardware you need to know a lot about
kernel in general to do your job right. I hope that Andrew is reading this.

End of whining.

> > DETECT_SOFTLOCKUP should be an aim in development not a method of
> > forcing driver maintainers to work on specific issues...
>
> well, 10+ seconds delays on a running system are not really acceptable,
> and can cause other problems. The softlockup-watchdog is optional and
> can be easily turned off in the .config.

It is "y" by default so anybody saying "y" to DEBUG_KERNEL will get it as
added bonus and moreover DEBUG_KERNEL is "y" in x86_64 defconfig.

Bartlomiej

Lee Revell

unread,
Feb 17, 2006, 6:00:22 PM2/17/06
to
On Thu, 2006-02-16 at 01:38 +0000, Alan Cox wrote:
> Some chips disable IRQs here and need to so that the stream of data
> doesn't stall and stuff fall apart. Touching the watchdog would be
> sensible.
>
> Possibly the same is needed in the new libata pata bits too

Are you saying this might hit on non-legacy, non-broken hardware?

Lee

Charles-Edouard Ruault

unread,
Mar 13, 2006, 9:50:10 AM3/13/06
to
Bartlomiej Zolnierkiewicz wrote:

Guys,
sorry for having been silent for so long ... since i had the problem, i
upgraded to 2.6.15.6 and i experienced the problem again over the week-end.
My point it that it appears that the BUG: soft lockup detected on CPU#0!
message is legitimate in my case since my machine really freezes
periodically because of ide timeouts.
What i'm trying to figure out is why are the IDE timeouts occuring ?
Both my DVD burner & CD burner usually work fine ( i had been able to
burn a DVD a few hours before the problem occured , i'm able to read CDs
& DVDs fine ).
However, the problem seems to occur only when trying to rip CDs. It
appears that both drives get confused .... ( see logs, both are
reporting timeouts/erros ).
Mar 11 17:41:47 ruault kernel: hdd: ATAPI reset
complete
Mar 11 17:41:57 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
Mar 11 17:41:57 ruault kernel: ide: failed opcode was:
unknown
Mar 11 17:41:57 ruault kernel: hdd: status timeout: status=0x80 { Busy }
Mar 11 17:41:57 ruault kernel: ide: failed opcode was:
unknown
Mar 11 17:41:57 ruault kernel: hdd: drive not ready for command
Mar 11 17:41:57 ruault kernel: hdd: ATAPI reset
complete
Mar 11 17:42:17 ruault kernel: hdc: irq timeout: status=0x80 { Busy }
Mar 11 17:42:17 ruault kernel: ide: failed opcode was:
unknown
Mar 11 17:42:17 ruault kernel: hdd: status timeout: status=0x80 { Busy }
Mar 11 17:42:17 ruault kernel: ide: failed opcode was:
unknown
Mar 11 17:42:17 ruault kernel: hdd: drive not ready for command
Mar 11 17:42:17 ruault kernel: BUG: soft lockup detected on
CPU#0!
Mar 11 17:42:17 ruault kernel:
Mar 11 17:42:17 ruault kernel: Pid: 0, comm:
swapper
Mar 11 17:42:17 ruault kernel: EIP: 0060:[<c0272c55>] CPU: 0
Mar 11 17:42:17 ruault kernel: EIP is at
ide_inb+0x5/0x10

Mar 11 17:42:17 ruault kernel: EFLAGS: 00000206 Tainted: P
(2.6.15.6)
Mar 11 17:42:17 ruault kernel: EAX: 00000180 EBX: 03aba395 ECX: f2b9d721
EDX: 00000177
Mar 11 17:42:17 ruault kernel: ESI: 00000088 EDI: c0419e30 EBP: c0419ec4
DS: 007b ES: 007b
Mar 11 17:42:17 ruault kernel: CR0: 8005003b CR2: b7fe2000 CR3: 32e1b000
CR4: 000006d0

And i can confirm that when this happens the machine is totally
unresponsive, as before i had to reboot the machine with a hard reset
after a while.
So my question is what could be causing this behaviour ? is this a bug
in the IDE driver ? a hardware problem with my drives ( as i said they
appear to be working fine otherwise ), another hardware problem with my
motherboard ?
What could i do to help troubleshoot the problem ?
Thanks in advance.

PS: I do not subscribe to the list so please CC me when replying ....

--
Charles-Edouard Ruault
GPG key Id E4D2B80C

-

0 new messages