Linux 2.6.13-rc5

Linus Torvalds

unread,

Aug 2, 2005, 1:11:21 AM8/2/05

to Linux Kernel Mailing List

Ok, one more in the series towards final 2.6.13.

This one is small enough that both shortlog and diffstat fit comfortably:
no big architecture updates or anything like that. Some input, x86-64 and
ppc updates, and various fairly small fixes in random places.

Some reverts back to 2.6.12 behaviour - you've seen the discussions, and
I'm sure we'll end up discussing things further for a long while still,
but the plan is to release 2.6.13 with known behaviour characteristics.

Give it a good testing, I'm hoping this can really turn into 2.6.13.

Linus

---

Adam Kropelin:
Input: HID - only report events coming from interrupts to hiddev

Adrian Bunk:
include/linux/dcookies.h: dummy functions must be "static inline"
SCSI_SATA has to be a tristate
USB: drivers/usb/net/: remove two unused multicast_filter_limit variables
DEBUG_FS must depend on SYSFS

Alan Stern:
USB: Usbcore: Don't try to delete unregistered interfaces
USB: usbfs: Don't leak uninitialized data

Alasdair G Kergon:
device-mapper: fix md->lock deadlocks in core
device-mapper: fix deadlocks in core
device-mapper: fix deadlocks in core (prep)

Alexander Nyberg:
x86_64: cpu hotplug changes kills nmi watchdog

Alexey Kuznetsov:
[NET]: fix oops after tunnel module unload

Andi Kleen:
x86_64: Remove unused variable in k8-bus.c
x86_64: Fix SRAT handling on non dual core systems
x86_64: Switch to the interrupt stack when running a softirq in local_bh_enable()
x86_64: Small assembly improvements
x86_64: Remove unnecessary include in fault.c
x86_64: When running cpuid4 need to run on the correct CPU
x86_64: Turn BUG data into valid instruction
x86_64: Support more than 8 cores on AMD systems
x86_64: Remove the broadcast options that were added for cpuhotplug
x86_64: Remove IA32_* build tools in Makefile
x86_64: Create per CPU machine check sysfs directories
x86_64: Print a boot message for hotplug memory zones
x86_64: Fix incorrectly defined MSR_K8_SYSCFG
x86_64: Fix some typos in system.h comments
x86_64: Remove obsolete eat_key prototype
x86_64: Fix some comments in tlbflush.h
x86_64: Some updates for boot-options.txt
x86_64: Improve CONFIG_GART_IOMMU description and make it default y
x86_64: Remove unused variable in delay.c
x86_64: Some cleanup in setup64.c
x86_64: Clarify Booting processor ... message
x86_64: Minor clean up to CPU setup - use smp_processor_id instead of custom hack
x86_64: Move cpu_present/possible_map parsing earlier
x86_64: i386/x86_64: remove prototypes for not existing functions in smp.h
x86_64: Use for_each_cpu_mask for clustered IPI flush
x86_64: Update defconfig
x86_64: Always ack IPIs even on errors

Andreas Gruenbacher:
x86_64: Icecream has no way of detecting assembler-level includes

Andrew Morton:
shm: CONFIG_SHMEM=n build fix
transmeta: CONFIG_PROC_FS=n build fix
skge build fix
i2c-mpc.c: revert duplicate patch
revert bogus softirq changes
Input: cannot refer to __exit from within __init.

Anton Blanchard:
ppc64: topology API fix

Antonino A. Daplas:
tridentfb: Fix scrolling artifacts during disk IO
tridentfb: Fix scrolling artifacts if acceleration is enabled
vesafb: Document mtrr boot option usage
fbdev: Replace memcpy with for-loop when preparing bitmap
vesafb: Fix mtrr bugs

Baruch Even:
[NET]: Spelling mistakes threshoulds -> thresholds

Ben Dooks:
USB: add S3C24XX USB Host driver support

Benjamin Herrenschmidt:
ppc64: Fix CONFIG_ALTIVEC not set

Bjorn Helgaas:
serial: add MMIO support to 8250_pnp

Bodo Stroesser:
uml: Fix typo
uml: Fix skas0 stub return

Christophe Lucas:
uml: Clean up prink calls

Conger, Chris A:
USB: fix Bug in usb-skeleton.c

Cornelia Huck:
s390: device recognition

Dan Streetman:
USB: fix in usb_calc_bus_time

Daniel Walker:
stable_api_nonsense.txt fixes

Daniele Gaffuri:
PCI: Hidden SMBus bridge on Toshiba Tecra M2

Dave Hansen:
re-disable TSC on NUMAQ

Dave Jones:
arch/i386/kernel/cpu/cpufreq/powernow-k8.c: In function `powernow_k8_cpu_init_acpi':
Fix up powernow-k8 compile. (Missing definitions).
powernow-k8.c: In function `query_current_values_with_pending_wait':
Here are two possible cleanups in cpufreq.c:
Opteron revision F will support higher frequencies than
powernow-k8 requires that a data structure for

Dave Peterson:
x86_64: fix bug in csum_partial_copy_generic()

David Moore:
Input: ALPS - fix resume (for DualPoints)

David S. Miller:
[NET]: Fix busy waiting in dev_close().

David Shaohua Li:
[ACPI] suspend/resume ACPI PCI Interrupt Links
[ACPI] address boot-freeze with updated DMI blacklist for c-states

Denis Lunev:
[NET] Fix too aggressive backoff in dst garbage collection

Denis Vlasenko:
silence cs89x0

Dmitry Torokhov:
Input: i8042 - don't use negation to mark AUX data
Input: i8042 - add Alienware Sentia to NOMUX blacklist.
Input: serio_raw - link serio_raw misc device to corresponding
Merge rsync://www.kernel.org/.../torvalds/linux-2.6
Input: make name, phys and uniq be 'const char *' because once
Input: rearrange procfs code to reduce number of #ifdefs
Sonypi: make sure that input_work is not running when unloading
Input: introduce usb_to_input_id() to uniformly produce
Input: acecad - drop unneeded cast and couple unneeded spaces.
Input: serio - add modalias attribute and environment variable to
Input: uinput - use completions instead of events and manual
Input: clean up uinput driver (formatting, extra braces)

Dominik Brodowski:
pcmcia: fix multiple insertion of multifunction cards
pcmcia: defer ide-cs initialization after other IDE drivers started up
[ACPI] Always set P-state on initialization

Eric Dumazet:
sys_set_mempolicy() doesnt check if mode < 0

Eric Lammerts:
disable addres space randomization default on transmeta CPUs

Eric W. Biederman:
Fix sync_tsc hang
x86_64 machine_kexec: Use standard pagetable helpers
x86_64 machine_kexec: Cleanup inline assembly.
i386 machine_kexec: Cleanup inline assembly
reboot: remove device_suspend(PMSG_FREEZE) from kernel_kexec

Eugene Surovegin:
ppc32: add missing 4xx EMAC sysfs nodes
ppc32: fix 44x early serial debug for configurations with more than 512M of RAM

Evgeniy Polyakov:
w1: kconfig/Makefile fix.

George Anzinger:
posix timers: fix normalization problem

Gerald Schaefer:
s390: fix inline assembly in appldata

Greg Felix:
libata: Check PCI sub-class code before disabling AHCI

Greg Kroah-Hartman:
Add the rules about the -stable kernel releases to the Documentation directory

Harald Welte:
[NETFILTER] Inherit masq_index to slave connections

Heiko Carstens:
s390: kexec fixes and improvements.
s390: check for interrupt before waiting

Hirokazu Takata:
m32r: Fix local-timer event handling

Hugh Dickins:
x86_64: access of some bad address

Ian Abbott:
USB: ftdi_sio: fix a couple of timeouts
USB: ftdi_sio: Update RTS and DTR simultaneously
USB: ftdi_sio: new microHAM and Evolution Robotics devices

Ingo Molnar:
remove sys_set_zone_reclaim()

Ivan Kokshaysky:
PCI: remove PCI_BRIDGE_CTL_VGA handling from setup-bus.c

James Simmons:
Display name of fbdev device

Jay Vosburgh:
bonding: documentation update

Jean Delvare:
I2C: 24RF08 corruption prevention (again)
I2C: missing new lines in i2c-core messages
I2C: use time_after in 3 chip drivers
I2C: Missing space in split strings

Jeff Dike:
uml: fix vsyscall brokenness
uml: Fix load average >=1
uml: Fix redundant assignment
uml: vm86 compile fix
uml: fix TT mode by reverting "use fork instead of clone"

Jesse Millan:
x86_64: Fix gcc 4 warning in sched_find_first_bit

John McCutchan:
inotify: fix race between the kernel and user space
inotify: fix file deletion by rename detection

Jon Smirl:
PCI: Adjust PCI rom code to handle more broken ROMs
fbdev: colormap fixes fix

Keith Mannthey:
x86_64: Fix overflow in NUMA hash function setup

Kumar Gala:
ppc32: Mark boards that don't build as BROKEN
PCI: fix up errors after dma bursting patch and CONFIG_PCI=n -- bug?
I2C-MPC: Restore code removed

Ladislav Michl:
I2C: ds1337 - fix 12/24 hour mode bug

Len Brown:
merge 2.6.13-rc4 with ACPI's to-linus tree
/home/lenb/src/to-linus branch 'acpi-2.6.12'

Linus Torvalds:
Linux v2.6.13-rc5
Revert ACPI interrupt resume changes
Fix get_user_pages() race for write access
Merge head 'upstream-fixes' of master.kernel.org:/.../jgarzik/netdev-2.6
Merge head 'upstream-fixes' of master.kernel.org:/.../jgarzik/libata-dev
Revert "yenta free_irq on suspend"
Merge master.kernel.org:/.../lenb/to-linus
Merge master.kernel.org:/.../davej/cpufreq
x86: fix new find_first_bit()
Merge master.kernel.org:/.../davej/cpufreq
Merge master.kernel.org:/.../dtor/input
Merge master.kernel.org:/home/rmk/linux-2.6-arm-smp
Merge head 'upstream' of master.kernel.org:/.../jgarzik/libata-dev
Merge master.kernel.org:/.../davem/net-2.6

Luca T:
Input: HID - add a quirk for Aashima Trust (06d6:0025) gamepad

Luming Yu:
[ACPI] Add "ec_polling" boot option

Maneesh Soni:
sysfs: fix sysfs_setattr
sysfs: fix sysfs_chmod_file

Mark Haverkamp:
aacraid: Fix for controller load based timeouts

Martin J. Bligh:
Fix NUMA node sizing in nr_free_zone_pages

Martin Schwidefsky:
s390: ioprio & inotify system calls.
s390: default configuration

Masahito Omote:
USB: Patch for KYOCERA AH-K3001V support

Mathieu:
USB: drivers/net/usb/zd1201.c: Gigabyte GN-WLBZ201 dongle usbid

Matt Porter:
ppc32: add bamboo defconfig
ppc32: add bamboo platform
ppc32: add 440ep support

Matthew Garrett:
agp: restore APBASE after setting APSIZE

Mauro Carvalho Chehab:
v4l: bug fix to correct tea5767 autodetection
V4L: Miscellaneous fixes

Michael Hund:
USB: ldusb fixes

Michael Kerrisk:
MAINTAINERS record -- MAN-PAGES

Michael Krufky:
v4l: cx88 card support and documentation finishing touches

Michael Prokop:
Input: elo - fix help in Kconfig (wrong module name)

Mike Kravetz:
ppc64: POWER 4 fails to boot with NUMA

Natalie.P...@unisys.com:
x86_64: avoid wasting IRQs patch update
x86: avoid wasting IRQs patch update

Neil Brown:
Input: serio_raw - fix Kconfig help

NeilBrown:
md: make sure raid5/raid6 resync uses correct 'max_sectors'

Nishanth Aravamudan:
x86_64: Use msleep in smpboot.c

Paolo 'Blaisorblade' Giarrusso:
uml: implement hostfs syncing
uml: avoid unnecessary pcap rebuild

Paul Jackson:
plug MAN-PAGES maintainer in Documentation/SubmittingPatches

Pete Zaitcev:
USB: hidinput_hid_event() oops fix

Peter Osterlund:
Input: ALPS - unconditionally enable tapping mode

Rafael J. Wysocki:
sk98lin: basic suspend/resume support fixes
[ACPI] fix resume issues on Asus L5D

Ralf Baechle:
SMP fix for 6pack driver

Ravikiran G Thirumalai:
mm: Ensure proper alignment for node_remap_start_pfn
x86_64: fix cpu_to_node setup for sparse apic_ids

Robert Love:
ppc64: inotify syscalls
ppc32: inotify syscalls

Roman Zippel:
hfs: don't reference missing page
hfs: don't dirty unchanged inode

Russell King:
[ARM SMP] Ensure secondary CPUs see their pen release
[ARM SMP] Fix another ARMv6 bitop problem
[ARM SMP] Ensure secondary CPUs have a clean TLB

Rusty Russell:
Module per-cpu alignment cannot always be met

Sergey Vlasov:
Input: synaptics - fix setting packet size on passthrough port.

Simon Horman:
Input: synaptics - limit rate to 40pps on Toshiba Dynabooks

Stephen Hemminger:
sk98lin: fix workaround for yukon-lite chipset (> rev 7)
skge: version 0.8
skge: led toggle cleanup
skge: ignore phy interrupts during negotiation
skge: fifo control register access fix
skge: whitespace fixes
skge: support yukon lite rev 4
skge: phy lock deadlock
skge: disable tranmitter on shutdown
skge: remove SK-9EE support
skge: silence mac data parity messages

Stephen Smalley:
selinux: Fix address length checks in connect hook

Tobias Klauser:
Input: joydev - remove custom conversion from jiffies to msecs

Tony Lindgren:
Fix OMAP specific typo in smc91x.h

Venkatesh Pallipadi:
[ACPI] delete boot-time printk()s from processor_idle.c
[ACPI] Fix memset arguments in acpi processor_idle.c
[ACPI] Fix the regression with c1_default_handler on some systems

Vojtech Pavlik:
Input: check keycodesize when adjusting keymaps
Input: psmouse - wheel mice (imps, exps) always have 3rd button
Input: i8042 - add Fujitsu T3010 to NOMUX blacklist.

-- diffstat ---

Jan De Luyck

unread,

Aug 2, 2005, 2:20:01 AM8/2/05

to linux-...@vger.kernel.org

On Tuesday 02 August 2005 07:07, Linus Torvalds wrote:
> Ok, one more in the series towards final 2.6.13.
>
> This one is small enough that both shortlog and diffstat fit comfortably:
> no big architecture updates or anything like that. Some input, x86-64 and
> ppc updates, and various fairly small fixes in random places.
>
> Some reverts back to 2.6.12 behaviour - you've seen the discussions, and
> I'm sure we'll end up discussing things further for a long while still,
> but the plan is to release 2.6.13 with known behaviour characteristics.

Built & runs fine, built using GCC 4.0.0-2 (Debian Sid) on Pentium M.

Jan

--
I marvel at the strength of human weakness.

Jan De Luyck

unread,

Aug 2, 2005, 2:44:43 AM8/2/05

to linux-...@vger.kernel.org

On Tuesday 02 August 2005 07:07, Linus Torvalds wrote:

> Ok, one more in the series towards final 2.6.13.

One thing that seems like a regression: doing

$ cat /proc/acpi/battery/BAT1/info

causes a two-second pause and then gives me the information, while in 2.6.12.3
that was near-instant.

$ date; cat /proc/acpi/battery/BAT1/info ; date
Tue Aug 2 08:41:03 CEST 2005
present: yes
design capacity: 4400 mAh
last full capacity: 418 mAh
battery technology: rechargeable
design voltage: 14800 mV
design capacity warning: 250 mAh
design capacity low: 100 mAh
capacity granularity 1: 10 mAh
capacity granularity 2: 25 mAh
model number: 03ZG
serial number:
battery type: LION
OEM info: SANYO
Tue Aug 2 08:41:05 CEST 2005
$

Jan

--
FLASH!
Intelligence of mankind decreasing.
Details at ... uh, when the little hand is on the ....

Olaf Hering

unread,

Aug 2, 2005, 3:58:56 AM8/2/05

to Linus Torvalds, Linux Kernel Mailing List

On Mon, Aug 01, Linus Torvalds wrote:

> Give it a good testing, I'm hoping this can really turn into 2.6.13.

aic doesnt work anymore, the poweroff thing should also be fixed in some
way.

http://marc.theaimsgroup.com/?l=linux-scsi&m=112180348617932&w=2

(google did not find that posting, but it did find the commit to
http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-misc-2.6.changelog

Rafael J. Wysocki

unread,

Aug 2, 2005, 6:48:25 AM8/2/05

to Jan De Luyck, linux-...@vger.kernel.org

On Tuesday, 2 of August 2005 08:43, Jan De Luyck wrote:
> On Tuesday 02 August 2005 07:07, Linus Torvalds wrote:
> > Ok, one more in the series towards final 2.6.13.
>
> One thing that seems like a regression: doing
>
> $ cat /proc/acpi/battery/BAT1/info
>
> causes a two-second pause and then gives me the information, while in 2.6.12.3
> that was near-instant.

Please try to ad the ec_polling parameter to the kernel command line and
retest.

Greets,
Rafael

--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

Jan De Luyck

unread,

Aug 3, 2005, 7:03:24 AM8/3/05

to Rafael J. Wysocki, linux-...@vger.kernel.org

On Tuesday 02 August 2005 12:50, Rafael J. Wysocki wrote:
> Please try to ad the ec_polling parameter to the kernel command line and
> retest.

That helps a lot. Thanks, it's back to the 'old way'.

Jan
--
..Deep Hack Mode -- that mysterious and frightening state of
consciousness where Mortal Users fear to tread.
-- Matt Welsh

Helge Hafting

unread,

Aug 5, 2005, 6:38:38 AM8/5/05

to Linus Torvalds, Linux Kernel Mailing List

2.6.13-rc5 seemed to kill a scsi disk (sdb) for me, where 2.6.13-rc4-mm1
have no problems with the same disk.

Machine: opteron running a x86-64 kernel, with built-in SATA as well as
a symbios scsi controller. Two videocards running independent xservers.
The sdb disk is on the symbios controller.

Using 2.6.13-rc5 I suddenly got this in my logs:

Aug 3 22:06:00 tenkende-august -- MARK --
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: ABORT operation started.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: ABORT operation timed-out.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:1:0: ABORT operation started.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:1:0: ABORT operation timed-out.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: DEVICE RESET operation start
ed.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: DEVICE RESET operation timed
-out.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:1:0: DEVICE RESET operation start
ed.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:1:0: DEVICE RESET operation timed
-out.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: BUS RESET operation started.
Aug 3 22:17:15 tenkende-august kernel: sym0: SCSI BUS reset detected.
Aug 3 22:17:15 tenkende-august kernel: sym0: SCSI BUS has been reset.
Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: BUS RESET operation complete
.
Aug 3 22:17:15 tenkende-august kernel: target0:0:1: FAST-40 WIDE SCSI 80.0 MB/
s ST (25 ns, offset 31)
Aug 3 22:17:15 tenkende-august kernel: sdb: Current: sense key: No Sense
Aug 3 22:17:15 tenkende-august kernel: Additional sense: No additional sens
e information
Aug 3 22:17:15 tenkende-august kernel: sdb: Current: sense key: No Sense
Aug 3 22:17:15 tenkende-august kernel: Additional sense: No additional sens
e information

This "no additiomnal sense" then repeats for many screenfulls.
Two sdb partitions got dropped from RAID-1 as they failed, the
md devices got remoutned read-only.

I thought the disk had died - it was my oldest so it'd be reasonable.
Rebooting 2.6.13-rc5 did not bring the disk back - it came up useless again.

I switched back to 2.6.13-rc4-mm1 at this point for another reason,
my X display aquired a nasty tendency to go blank for no reason during work,
something I could fix by changing resolution baqck and forth. X also tended to get
stuck for a minute now and then - a problem I haven't seen since early 2.6.

These troubles disappeared by going back to 2.6.13-rc4-mm1. Even more interesting,
the sdb disk seems fine again. There were no errors as I copied
all data to another disk, and no errors when I ran a badblocks write-test
(the nondestructive write test) on it.

The two kernels have some config differences. The 2.6.13-rc5 kernel
has ACPI+CPUFREQ configured, that the 2.6.13-rc4-mm1 doesn't have.

An lspci, in case hw driver trouble is suspected:
0000:00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev
01)
0000:00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800 South]
0000:00:05.0 SCSI storage controller: LSI Logic / Symbios Logic 53c895 (rev 01)
0000:00:06.0 Multimedia audio controller: Trident Microsystems 4DWave NX (rev 02)
0000:00:08.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200 SE]
(rev 01)
0000:00:08.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200 SE]
(Secondary) (rev 01)
0000:00:0b.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705 Gigabit
Ethernet (rev 03)
0000:00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID
Controller (rev 80)
0000:00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C
PIPC Bus Master IDE (rev 06)
0000:00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
0000:00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
0000:00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
0000:00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
0000:00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [K8T800 South]
0000:00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237
AC97 Audio Controller (rev 60)
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)

I can run more tests, but don't know what would be the most interesting.
rc5 without powermanagement? rc4-mm1 with it? Or the newest git kernel?
Or is this the effect of some known problem?

Helge Hafting

Andrew Morton

unread,

Aug 5, 2005, 6:24:19 PM8/5/05

to Helge Hafting, torv...@osdl.org, linux-...@vger.kernel.org

Helge Hafting <helg...@aitel.hist.no> wrote:
>
> 2.6.13-rc5 seemed to kill a scsi disk (sdb) for me, where 2.6.13-rc4-mm1
> have no problems with the same disk.
>
> Machine: opteron running a x86-64 kernel, with built-in SATA as well as
> a symbios scsi controller. Two videocards running independent xservers.
> The sdb disk is on the symbios controller.
>
>
> Using 2.6.13-rc5 I suddenly got this in my logs:
>
> Aug 3 22:06:00 tenkende-august -- MARK --
> Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: ABORT operation started.
> Aug 3 22:17:15 tenkende-august kernel: sd 0:0:0:0: ABORT operation timed-out.
>

> ...

> This "no additiomnal sense" then repeats for many screenfulls.
> Two sdb partitions got dropped from RAID-1 as they failed, the
> md devices got remoutned read-only.
>
> I thought the disk had died - it was my oldest so it'd be reasonable.
> Rebooting 2.6.13-rc5 did not bring the disk back - it came up useless again.
>
> I switched back to 2.6.13-rc4-mm1 at this point for another reason,
> my X display aquired a nasty tendency to go blank for no reason during work,
> something I could fix by changing resolution baqck and forth. X also tended to get
> stuck for a minute now and then - a problem I haven't seen since early 2.6.
>
> These troubles disappeared by going back to 2.6.13-rc4-mm1. Even more interesting,
> the sdb disk seems fine again. There were no errors as I copied
> all data to another disk, and no errors when I ran a badblocks write-test
> (the nondestructive write test) on it.
>
> The two kernels have some config differences. The 2.6.13-rc5 kernel
> has ACPI+CPUFREQ configured, that the 2.6.13-rc4-mm1 doesn't have.

That's a pretty big difference ;)

> ...

> I can run more tests, but don't know what would be the most interesting.
> rc5 without powermanagement? rc4-mm1 with it? Or the newest git kernel?
> Or is this the effect of some known problem?

The latest -git kernel (or 2.6.13-rc6 if it's there) with APCI enabled is
the one to test, please.

Helge Hafting

unread,

Aug 7, 2005, 5:34:30 AM8/7/05

to Andrew Morton, torv...@osdl.org, linux-...@vger.kernel.org

On Fri, Aug 05, 2005 at 03:05:06PM -0700, Andrew Morton wrote:
> Helge Hafting <helg...@aitel.hist.no> wrote:
[...]

> > The two kernels have some config differences. The 2.6.13-rc5 kernel
> > has ACPI+CPUFREQ configured, that the 2.6.13-rc4-mm1 doesn't have.
>
> That's a pretty big difference ;)
>

Sure.

> > ...
> > I can run more tests, but don't know what would be the most interesting.
> > rc5 without powermanagement? rc4-mm1 with it? Or the newest git kernel?
> > Or is this the effect of some known problem?
>
> The latest -git kernel (or 2.6.13-rc6 if it's there) with APCI enabled is
> the one to test, please.
>

I tried 2.6.13-rc5-git4, with and without ACPI. They seem to behave
identical:

I haven't seen any more disk trouble, but one of my X displays go black
now and then, forcing me to do a display resize to get it back. This
does not happen at all with 2.6.13-rc4-mm1.

The display that goes black uses the evdev protocol to read the
second keyboard which is not connected to any tty. That is a very
new option which could have its own issues, but there where no problems
in rc4-mm1. It is annoying enough that I don't want to run rc5 for
anything but tests.

Helge Hafting

Danny ter Haar

unread,

Aug 7, 2005, 1:08:07 PM8/7/05

to linux-...@vger.kernel.org

Andrew Morton <ak...@osdl.org> wrote:
>Helge Hafting <helg...@aitel.hist.no> wrote:
>> 2.6.13-rc5 seemed to kill a scsi disk (sdb) for me, where 2.6.13-rc4-mm1
>> have no problems with the same disk.

Sort of same with me:
2.6.12-mm1 runs for _weeks_ where others keep crashing:

>The latest -git kernel (or 2.6.13-rc6 if it's there) with APCI enabled is
>the one to test, please.

no rc6 yet, i did however experience the following:

reboot system boot 2.6.12-mm1 Sun Aug 7 18:20 (00:36)
dth pts/1 zaphod.dth.net Sun Aug 7 15:41 - crash (02:38)
reboot system boot 2.6.13-rc5-git5 Sun Aug 7 14:04 (04:52)
reboot system boot 2.6.13-rc5-git4 Sun Aug 7 10:05 (03:43)
reboot system boot 2.6.13-rc5-git3 Fri Aug 5 16:55 (1+17:07)

git3 lasted near 2 days
git4 ran for nearly 5 hours than i upgraded to
git5 didn't last longer than 2.5 hours

Fortunatly some info was found in the log files.
What i dont "get" is that ethernet also goes down when the scsi
controller goes bezerk.
I'm pretty sure it's not a hardware problem since 2.6.12-mm1 survives
and brings this usenet host in the worldwide top 1000.

From the log files:

scsi1: Transmission error detected
LQISTAT1[0x8]:(LQICRCI_NLQ) LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE)
SCSISIGI[0x60]:(P_DATAIN_DT) PERRDIAG[0x4]:(CRCERR)
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi1: Dumping Card State at program address 0x11 Mode 0x33
Card was paused
HS_MAILBOX[0x0] INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK)
SEQINTSTAT[0x10]:(SEQ_SWTMRTO) SAVED_MODE[0x11]
DFFSTAT[0x24]:(CURRFIFO_0|FIFO1FREE) SCSISIGI[0xb6]:(P_MESGOUT|REQI|BSYI|ATNI)
SCSIPHASE[0x4]:(MSG_OUT_PHASE) SCSIBUS[0xff] LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE)
SCSISEQ0[0x0] SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI)
SEQCTL0[0x0] SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0]
SSTAT0[0x2]:(SPIORDY) SSTAT1[0x19]:(REQINIT|BUSFREE|PHASEMIS)
SSTAT2[0x20]:(NONPACKREQ) SSTAT3[0x0] PERRDIAG[0x0]
SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO)
LQISTAT0[0x0] LQISTAT1[0x8]:(LQICRCI_NLQ) LQISTAT2[0xc0]:(LQIPHASE_OUTPKT|PACKETIZED)
LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]:(LQOSTOP0|LQOPKT)

SCB Count = 128 CMDS_PENDING = 2 LASTSCB 0x34 CURRSCB 0x8 NEXTSCB 0xffc0
qinstart = 15426 qinfifonext = 15426
QINFIFO:
WAITING_TID_QUEUES:
Pending list:
8 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x47]
66 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x47]
Total 2
Kernel Free SCB list: 53 94 52 127 115 73 21 114 63 37 22 117 93 92 64 1 41 88 43 121 68 50 91 46 122 56 80 30 104 116 34 48 7 105 3 39 58 81 112 119 28 27 62 4 17 120 24 103 0 49 101 106 32 47 75 69 11 95 42 65 96 14 67 72 89 108 13 36 125 44 51 71 20 54 38 90 82 85 31 59 76 60 6 97 33 5 124 16 25 111 18 15 19 87 107 23 123 99 110 10 84 29 100 74 55 118 40 9 126 113 12 61 77 98 79 109 2 78 102 57 70 35 83 45 86 26
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:

scsi1: FIFO0 Active, LONGJMP == 0x232, SCB 0x42
SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
SEQINTSRC[0x0] DFCNTRL[0x8]:(HDMAEN) DFSTATUS[0xc9]:(FIFOEMP|HDONE|PKT_PRELOAD_AVAIL|PRELOAD_AVAIL)
SG_CACHE_SHADOW[0x78] SG_STATE[0x3]:(SEGS_AVAIL|LOADING_NEEDED)
DFFSXFRCTL[0x0] SOFFCNT[0x0] MDFFSTAT[0x6]:(DATAINFIFO|DLZERO)
SHADDR = 0x048b4000, SHCNT = 0x0 HADDR = 0x048b4000, HCNT = 0x0
CCSGCTL[0x10]:(SG_CACHE_AVAIL)
scsi1: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x0]
LQIN: 0x4 0x0 0x0 0x42 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x28 0x0 0x0 0x0 0x2 0x0
scsi1: LQISTATE = 0x2b, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi1: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
SIMODE0[0xc]:(ENOVERRUN|ENIOERR)
CCSCBCTL[0x0]
scsi1: REG0 == 0x42, SINDEX = 0x178, DINDEX = 0x10a
scsi1: SCBPTR == 0x35, SCB_NEXT == 0xff00, SCB_NEXT2 == 0xff39
CDB 28 0 c 80 70 a4
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
DevQ(0:4:0): 0 waiting
DevQ(0:5:0): 0 waiting
DevQ(0:14:0): 0 waiting
DevQ(0:15:0): 0 waiting
LQICRC_NLQ
eth0: Optical link DOWN
eth0: Optical link UP (Full Duplex, Flow Control: TX RX)
scsi1: Unexpected PKT busfree condition
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi1: Dumping Card State at program address 0x11 Mode 0x33
Card was paused
HS_MAILBOX[0x0] INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK)
SEQINTSTAT[0x10]:(SEQ_SWTMRTO) SAVED_MODE[0x11]
DFFSTAT[0x24]:(CURRFIFO_0|FIFO1FREE) SCSISIGI[0xb6]:(P_MESGOUT|REQI|BSYI|ATNI)
SCSIPHASE[0x4]:(MSG_OUT_PHASE) SCSIBUS[0xff] LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE)
SCSISEQ0[0x0] SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI)
SEQCTL0[0x0] SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0]
SSTAT0[0x2]:(SPIORDY) SSTAT1[0x19]:(REQINIT|BUSFREE|PHASEMIS)
SSTAT2[0x20]:(NONPACKREQ) SSTAT3[0x0] PERRDIAG[0x0]
SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO)
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0xc0]:(LQIPHASE_OUTPKT|PACKETIZED)
LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]:(LQOSTOP0|LQOPKT)

SCB Count = 128 CMDS_PENDING = 2 LASTSCB 0x34 CURRSCB 0x8 NEXTSCB 0xffc0
qinstart = 15426 qinfifonext = 15426
QINFIFO:
WAITING_TID_QUEUES:
Pending list:
8 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x47]
66 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x47]
Total 2
Kernel Free SCB list: 53 94 52 127 115 73 21 114 63 37 22 117 93 92 64 1 41 88 43 121 68 50 91 46 122 56 80 30 104 116 34 48 7 105 3 39 58 81 112 119 28 27 62 4 17 120 24 103 0 49 101 106 32 47 75 69 11 95 42 65 96 14 67 72 89 108 13 36 125 44 51 71 20 54 38 90 82 85 31 59 76 60 6 97 33 5 124 16 25 111 18 15 19 87 107 23 123 99 110 10 84 29 100 74 55 118 40 9 126 113 12 61 77 98 79 109 2 78 102 57 70 35 83 45 86 26
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:

scsi1: FIFO0 Active, LONGJMP == 0x232, SCB 0x42
SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
SEQINTSRC[0x0] DFCNTRL[0x8]:(HDMAEN) DFSTATUS[0xc9]:(FIFOEMP|HDONE|PKT_PRELOAD_AVAIL|PRELOAD_AVAIL)
SG_CACHE_SHADOW[0x78] SG_STATE[0x3]:(SEGS_AVAIL|LOADING_NEEDED)
DFFSXFRCTL[0x0] SOFFCNT[0x0] MDFFSTAT[0x6]:(DATAINFIFO|DLZERO)
SHADDR = 0x048b4000, SHCNT = 0x0 HADDR = 0x048b4000, HCNT = 0x0
CCSGCTL[0x10]:(SG_CACHE_AVAIL)
scsi1: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x0]
LQIN: 0x4 0x0 0x0 0x42 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x28 0x0 0x0 0x0 0x2 0x0
scsi1: LQISTATE = 0x2b, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi1: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
SIMODE0[0xc]:(ENOVERRUN|ENIOERR)
CCSCBCTL[0x0]
scsi1: REG0 == 0x42, SINDEX = 0x178, DINDEX = 0x10a
scsi1: SCBPTR == 0x35, SCB_NEXT == 0xff00, SCB_NEXT2 == 0xff39
CDB 28 0 c 80 70 a4
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
DevQ(0:4:0): 0 waiting
DevQ(0:5:0): 0 waiting
DevQ(0:14:0): 0 waiting
DevQ(0:15:0): 0 waiting

Helge Hafting

unread,

Aug 8, 2005, 7:20:43 AM8/8/05

to Danny ter Haar, linux-...@vger.kernel.org, ak...@osdl.org

Danny ter Haar wrote:

>Andrew Morton <ak...@osdl.org> wrote:
>
>
>>Helge Hafting <helg...@aitel.hist.no> wrote:
>>
>>
>>>2.6.13-rc5 seemed to kill a scsi disk (sdb) for me, where 2.6.13-rc4-mm1
>>>have no problems with the same disk.
>>>
>>>
>
>Sort of same with me:
>2.6.12-mm1 runs for _weeks_ where others keep crashing:
>
>
>
>>The latest -git kernel (or 2.6.13-rc6 if it's there) with APCI enabled is
>>the one to test, please.
>>
>>
>
>no rc6 yet, i did however experience the following:
>
>reboot system boot 2.6.12-mm1 Sun Aug 7 18:20 (00:36)
>dth pts/1 zaphod.dth.net Sun Aug 7 15:41 - crash (02:38)
>reboot system boot 2.6.13-rc5-git5 Sun Aug 7 14:04 (04:52)
>reboot system boot 2.6.13-rc5-git4 Sun Aug 7 10:05 (03:43)
>reboot system boot 2.6.13-rc5-git3 Fri Aug 5 16:55 (1+17:07)
>
>git3 lasted near 2 days
>git4 ran for nearly 5 hours than i upgraded to
>git5 didn't last longer than 2.5 hours
>
>Fortunatly some info was found in the log files.
>What i dont "get" is that ethernet also goes down when the scsi
>controller goes bezerk.
>I'm pretty sure it's not a hardware problem since 2.6.12-mm1 survives
>and brings this usenet host in the worldwide top 1000.
>
>

Interesting.
I have no idea what the core problem is, but one problem will often lead
to others. My scsi problem froze some apps that couldn't be paged in
from the "failing" disk, for example.

Something going wrong in the kernel can delay other devices for too long,
maybe your network driver was hit by nasty latency in the middle of
something
as the scsi controller reset itself. It may also be memory scribbling.

I sometimes gets x lockups with rc5. Sometimes they just lock one display,
sometimes the whole machine locks solid necessitating a reset. sysrq+B
did not work, on either keyboard.

rc5 is no good for amd64, and it doesn't need power management to go wrong.

Helge Hafting

Danny ter Haar

unread,

Aug 8, 2005, 8:15:36 AM8/8/05

to linux-...@vger.kernel.org

Helge Hafting <helge....@aitel.hist.no> wrote:

>Danny ter Haar wrote:
>>What i dont "get" is that ethernet also goes down when the scsi
>>controller goes bezerk.
>>I'm pretty sure it's not a hardware problem since 2.6.12-mm1 survives
>>and brings this usenet host in the worldwide top 1000.
>Interesting.
>I have no idea what the core problem is, but one problem will often lead
>to others. My scsi problem froze some apps that couldn't be paged in
>from the "failing" disk, for example.

I found out in the mean time that ethernet&scsi controller share the
same IRQ, so it's even sort op logical i guess

irq 25: aic79xx, eth3 (although ath0 was complaining)

>rc5 is no good for amd64, and it doesn't need power management to go wrong.

rc6 [KNOCK WOOD] seems to work just (so far)

dth@newsgate:~$ procinfo
Linux 2.6.13-rc6 (root@newsgate) (gcc [can't parse]) #??? 1CPU [newsgate.(none)]
Memory: Total Used Free Shared Buffers
Mem: 2058040 2040104 17936 0 476
Swap: 0 0 0
Bootup: Sun Aug 7 22:06:08 2005 Load average: 3.62 3.64 3.55 4/66 1277
user : 1:44:42.21 10.8% page in : 0
nice : 0:11:21.95 1.2% page out: 0
system: 4:56:19.28 30.7% swap in : 0
idle : 0:06:15.61 0.6% swap out: 0
uptime: 16:06:08.94 context :169218538
irq 0: 14486202 timer irq 12: 3
irq 1: 8 i8042 irq 24: 18536343 aic79xx
irq 2: 0 cascade [4] irq 25: 175472322 aic79xx, eth3
irq 4: 350 serial irq 28: 286219024 acenic

-------

Linux newsgate 2.6.13-rc6 #1 Sun Aug 7 21:27:42 CEST 2005 x86_64 GNU/Linux
Gnu C 4.0.2
Gnu make 3.80
binutils 2.16.1
util-linux 2.12p
mount 2.12p
module-init-tools 3.2-pre1
e2fsprogs 1.38
reiserfsprogs line
reiser4progs line
nfs-utils 1.0.7
Linux C Library 2.3.5
Dynamic linker (ldd) 2.3.2
Procps 3.2.5
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 5.2.1
Modules Loaded genrtc evdev hw_random i2c_amd8111 tg3 e100 mii w83627hf eeprom lm85 i2c_sensor i2c_isa i2c_amd756 i2c_core rawfs psmouse

Looks promissing.

Danny

Danny ter Haar

unread,

Aug 8, 2005, 10:59:55 AM8/8/05

to linux-...@vger.kernel.org

I <d...@picard.cistron.nl> wrote:
>rc6 [KNOCK WOOD] seems to work just (so far)

It barfed after 18 hours:

scsi1:0:14:0: Attempting to abort cmd ffff810038f6dd40: 0x2a 0x0 0x3
0x91 0x45 0x10 0x0 0x0 0x1 0x0
scsi1: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi1: Dumping Card State at program address 0x13 Mode 0x11
Card was paused

I will followup on linus's announcement with more details.

Helge Hafting

unread,

Aug 9, 2005, 8:16:54 AM8/9/05

to Dave Airlie, Linus Torvalds, Linux Kernel Mailing List

Dave Airlie wrote:

>
> I switched back to 2.6.13-rc4-mm1 at this point for another reason,
> my X display aquired a nasty tendency to go blank for no reason
> during work,
> something I could fix by changing resolution baqck and forth. X
> also tended to get
> stuck for a minute now and then - a problem I haven't seen since
> early 2.6.
>
>
>

> which head the radeon or MGA or both?

The radeon 9200SE-pci gets stuck. The MGA-agp seems to be fine. I have
compiled
dri support for both, but I can't use it at the moment. I think that is
caused by having ubuntu's xorg installed on debian. I needed xorg
in order to run an xserver that doesn't use any tty - this way I can use
two keyboards and have two simultaneous users. Debians xorg wasn't ready
at the moment. The setup is fine with 2.6.13-rc4-mm1 x86-64, no problems
there.

Helge Hafting

unread,

Aug 12, 2005, 5:54:34 AM8/12/05

to Linux Kernel Mailing List, Dave Airlie, Linus Torvalds, ak...@osdl.org

Helge Hafting wrote:

> Dave Airlie wrote:
>
>>
>> I switched back to 2.6.13-rc4-mm1 at this point for another reason,
>> my X display aquired a nasty tendency to go blank for no reason
>> during work,
>> something I could fix by changing resolution baqck and forth. X
>> also tended to get
>> stuck for a minute now and then - a problem I haven't seen since
>> early 2.6.
>>
>>
>>
>> which head the radeon or MGA or both?
>
>
> The radeon 9200SE-pci gets stuck. The MGA-agp seems to be fine. I
> have compiled
> dri support for both, but I can't use it at the moment. I think that is
> caused by having ubuntu's xorg installed on debian. I needed xorg
> in order to run an xserver that doesn't use any tty - this way I can use
> two keyboards and have two simultaneous users. Debians xorg wasn't ready
> at the moment. The setup is fine with 2.6.13-rc4-mm1 x86-64, no
> problems there.

The problem still exists in 2.6.13-rc6. Usually, all I get is a
suddenly black display,
solveable by resizing. But the machine will occationally hang, forcing
me to
use the reset button. I lost my mbox file to this (from an ext3 fs, on
raid-1 on scsi.)

It is hard to say wether the fs problem merely is an effect of hanging
with rc6.
With rc5, there definitely was some sort of io/scsi problem as one disk
was "lost" until I booted a working kernel.

Currently, it seems like I won't be able to use 2.6.13.

Alan Cox

unread,

Aug 12, 2005, 6:09:01 AM8/12/05

to Helge Hafting, Linux Kernel Mailing List, Dave Airlie, Linus Torvalds, ak...@osdl.org

On Gwe, 2005-08-12 at 12:01 +0200, Helge Hafting wrote:
> solveable by resizing. But the machine will occationally hang, forcing
> me to
> use the reset button. I lost my mbox file to this (from an ext3 fs, on
> raid-1 on scsi.)

Unless you are using data=journal and have turned write cache off on
your IDE drives that is expected. Metadata journalling protects your
file system intgerity. Data journalling is more expensive but will
protect your file integrity if the disk layer is also correctly set up.
Unfortunately the IDE layer defaults the wrong way and despite many
complaints has not been changed. In later 2.6 with modern drives you can
also enable barrier mode on the IDE layer which gives better results
than turning off the write cache.

Alan

Linus Torvalds

unread,

Aug 12, 2005, 12:54:50 PM8/12/05

to Helge Hafting, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Fri, 12 Aug 2005, Helge Hafting wrote:
>
> > at the moment. The setup is fine with 2.6.13-rc4-mm1 x86-64, no
> > problems there.
>
> The problem still exists in 2.6.13-rc6. Usually, all I get is a
> suddenly black display, solveable by resizing.

Is there any chance you could try bisecting the problem? Either just
binary-searching the patches or by using the git bisect helper scripts?

Obviously the git approach needs a "good" kernel in git, but if
2.6.13-rc4-mm1 is ok, then I assume that 2.6.13-rc4 is ok too? That's a
fair number of changes:

git-rev-list v2.6.13-rc4..v2.6.13-rc6 | wc
340 340 13940

but if you can tighten it up a bit (you already had trouble at rc5, I
think), it shouldn't require testing more than a few kernels.

Git has had bisection support for a while, but the helper scripts to use
it sanely are fairly new, so I think you'd need the git-0.99.4 release for
those. But then you'd just do

git bisect start
git bisect bad v2.6.13-rc5
git bisect good v2.6.13-rc4

and start bisecting (that will check out a mid-way point automatically,
you build it, and then do "git bisect bad" or "git bisect good" depending
on whether the result is bad or good - it will continue to try to find
half-way points until it has found the point that turns from good to
bad..)

Linus

Helge Hafting

unread,

Aug 15, 2005, 8:30:44 AM8/15/05

to Linus Torvalds, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

Linus Torvalds wrote:

Ok, I have downlaoded git and started the first compile.
Git will tell when the correct point is found (assuming I
do the "git bisect bad/good" right), by itself?

Is there any way to make git tell exactly where between rc4 and rc5
each kernel is, so I can name the bzimages accordingly?

It takes some time to trigger the bug, so I could possibly end up with
a falsely ok kernel. Is there a simple way to restart the search from
that point,
or will I have to start over with rc4 and rc5 and say
git bisect good/bad until I reach the point of mistake?

Helge Hafting

Bartlomiej Zolnierkiewicz

unread,

Aug 15, 2005, 8:54:28 AM8/15/05

to Alan Cox, Helge Hafting, Linux Kernel Mailing List, Dave Airlie, Linus Torvalds, ak...@osdl.org

On 8/12/05, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
> On Gwe, 2005-08-12 at 12:01 +0200, Helge Hafting wrote:
> > solveable by resizing. But the machine will occationally hang, forcing
> > me to
> > use the reset button. I lost my mbox file to this (from an ext3 fs, on
> > raid-1 on scsi.)
>
> Unless you are using data=journal and have turned write cache off on
> your IDE drives that is expected. Metadata journalling protects your
> file system intgerity. Data journalling is more expensive but will
> protect your file integrity if the disk layer is also correctly set up.
> Unfortunately the IDE layer defaults the wrong way and despite many
> complaints has not been changed. In later 2.6 with modern drives you can

Changing defaults is not that easy, disabling write-cache shortens HDD
life considerably (discussed on LKML).

Recommend solution is to disable write-cache w/ hdparm or use barrier mode.

> also enable barrier mode on the IDE layer which gives better results
> than turning off the write cache.

Bartlomiej Zolnierkiewicz

unread,

Aug 15, 2005, 9:01:53 AM8/15/05

to Alan Cox, Helge Hafting, Linux Kernel Mailing List, Dave Airlie, Linus Torvalds, ak...@osdl.org

On 8/15/05, Bartlomiej Zolnierkiewicz <bzol...@gmail.com> wrote:
> On 8/12/05, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
> > On Gwe, 2005-08-12 at 12:01 +0200, Helge Hafting wrote:
> > > solveable by resizing. But the machine will occationally hang, forcing
> > > me to
> > > use the reset button. I lost my mbox file to this (from an ext3 fs, on
> > > raid-1 on scsi.)
> >
> > Unless you are using data=journal and have turned write cache off on
> > your IDE drives that is expected. Metadata journalling protects your
> > file system intgerity. Data journalling is more expensive but will
> > protect your file integrity if the disk layer is also correctly set up.
> > Unfortunately the IDE layer defaults the wrong way and despite many
> > complaints has not been changed. In later 2.6 with modern drives you can
>
> Changing defaults is not that easy, disabling write-cache shortens HDD
> life considerably (discussed on LKML).
>
> Recommend solution is to disable write-cache w/ hdparm or use barrier mode.
>
> > also enable barrier mode on the IDE layer which gives better results
> > than turning off the write cache.

Moreover Helge is using RAID-1 on SCSI so IDE is out of picture here.

Linus Torvalds

unread,

Aug 15, 2005, 11:53:23 AM8/15/05

to Helge Hafting, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Mon, 15 Aug 2005, Helge Hafting wrote:
>
> Ok, I have downlaoded git and started the first compile.
> Git will tell when the correct point is found (assuming I
> do the "git bisect bad/good" right), by itself?

Yes. You should see

Bisecting: xxx revisions left to test after this

and the "xxx" should hopefully decrease by half during each round. And t
the end of it, you should get

<sha1> is first bad commit

followed by the actual patch that caused the problem.

> Is there any way to make git tell exactly where between rc4 and rc5
> each kernel is, so I can name the bzimages accordingly?

You'd have to use the raw commit names, since these things don't have any
symbolic names. You can get that by just doing

cat .git/HEAD

which will give you a 40-character hex string (representing the 160-bit
SHA1 of the top commit). Not very readable, but it's unique, and if you
report that hex string to other git users, they can trivially recreate the
tree you have.

> It takes some time to trigger the bug, so I could possibly end up with
> a falsely ok kernel. Is there a simple way to restart the search from
> that point, or will I have to start over with rc4 and rc5 and say
> git bisect good/bad until I reach the point of mistake?

If you remember/save the good/bad commit ID's, you can restart the whole
process and just feed the correct state for the ID's:

git bisect start
git bisect bad v2.6.13-rc5
git bisect good v2.6.13-rc4

.. here bisect will start narrowing things down ..
git bisect bad <sha1 of known bad>
git bisect good <sha1 of known good>
..

ie you can always feed an arbitrary number of known good/bad points by
naming them by their SHA1 ID (or their symbolic name, as in the
v2.6.13-rcX releases).

Linus

Ryan Anderson

unread,

Aug 15, 2005, 1:00:43 PM8/15/05

to Linus Torvalds, Helge Hafting, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Mon, Aug 15, 2005 at 08:50:12AM -0700, Linus Torvalds wrote:
> > Is there any way to make git tell exactly where between rc4 and rc5
> > each kernel is, so I can name the bzimages accordingly?
>
> You'd have to use the raw commit names, since these things don't have any
> symbolic names. You can get that by just doing
>
> cat .git/HEAD
>
> which will give you a 40-character hex string (representing the 160-bit
> SHA1 of the top commit). Not very readable, but it's unique, and if you
> report that hex string to other git users, they can trivially recreate the
> tree you have.

The following patch (which Sam has in the kbuild tree for 2.6.14, IIRC)
will make that automatic, or you can just do:

ln -s .git/HEAD localversion-git

(My patch will notice when you are at a tag and not append anything
special in thaat case.)

Index: linux-git/Makefile
===================================================================
--- linux-git.orig/Makefile 2005-07-31 04:30:00.000000000 -0400
+++ linux-git/Makefile 2005-07-31 04:32:16.000000000 -0400
@@ -551,6 +551,26 @@ export KBUILD_IMAGE ?= vmlinux
# images. Default is /boot, but you can set it to other values
export INSTALL_PATH ?= /boot

+# If CONFIG_LOCALVERSION_AUTO is set, we automatically perform some tests
+# and try to determine if the current source tree is a release tree, of any sort,
+# or if is a pure development tree.
+#
+# A 'release tree' is any tree with a git TAG associated
+# with it. The primary goal of this is to make it safe for a native
+# git/CVS/SVN user to build a release tree (i.e, 2.6.9) and also to
+# continue developing against the current Linus tree, without having the Linus
+# tree overwrite the 2.6.9 tree when installed.
+#
+# Currently, only git is supported.
+# Other SCMs can edit scripts/setlocalversion and add the appropriate
+# checks as needed.
+
+
+ifdef CONFIG_LOCALVERSION_AUTO
+ localversion-auto := $(shell $(PERL) $(srctree)/scripts/setlocalversion $(srctree))
+ LOCALVERSION := $(LOCALVERSION)$(localversion-auto)
+endif
+
#
# INSTALL_MOD_PATH specifies a prefix to MODLIB for module directory
# relocations required by build roots. This is not defined in the
Index: linux-git/init/Kconfig
===================================================================
--- linux-git.orig/init/Kconfig 2005-07-31 04:30:00.000000000 -0400
+++ linux-git/init/Kconfig 2005-07-31 04:32:16.000000000 -0400
@@ -77,6 +77,22 @@ config LOCALVERSION
object and source tree, in that order. Your total string can
be a maximum of 64 characters.

+config LOCALVERSION_AUTO
+ bool "Automatically append version information to the version string"
+ default y
+ help
+ This will try to automatically determine if the current tree is a
+ release tree by looking for git tags that
+ belong to the current top of tree revision.
+
+ A string of the format -gxxxxxxxx will be added to the localversion
+ if a git based tree is found. The string generated by this will be
+ appended after any matching localversion* files, and after the value
+ set in CONFIG_LOCALVERSION
+
+ Note: This requires Perl, and a git repository, but not necessarily
+ the git or cogito tools to be installed.
+
config SWAP
bool "Support for paging of anonymous memory (swap)"
depends on MMU
Index: linux-git/scripts/setlocalversion
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-git/scripts/setlocalversion 2005-07-31 04:32:16.000000000 -0400
@@ -0,0 +1,56 @@
+#!/usr/bin/perl
+# Copyright 2004 - Ryan Anderson <ry...@michonline.com> GPL v2
+
+use strict;
+use warnings;
+use Digest::MD5;
+require 5.006;
+
+if (@ARGV != 1) {
+ print <<EOT;
+Usage: setlocalversion <srctree>
+EOT
+ exit(1);
+}
+
+my ($srctree) = @ARGV;
+chdir($srctree);
+
+my @LOCALVERSIONS = ();
+
+# We are going to use the following commands to try and determine if this
+# repository is at a Version boundary (i.e, 2.6.10 vs 2.6.10 + some patches) We
+# currently assume that all meaningful version boundaries are marked by a tag.
+# We don't care what the tag is, just that something exists.
+
+# Git/Cogito store the top-of-tree "commit" in .git/HEAD
+# A list of known tags sits in .git/refs/tags/
+#
+# The simple trick here is to just compare the two of these, and if we get a
+# match, return nothing, otherwise, return a subset of the SHA-1 hash in
+# .git/HEAD
+
+sub do_git_checks {
+ open(H,"<.git/HEAD") or return;
+ my $head = <H>;
+ chomp $head;
+ close(H);
+
+ opendir(D,".git/refs/tags") or return;
+ foreach my $tagfile (grep !/^\.{1,2}$/, readdir(D)) {
+ open(F,"<.git/refs/tags/" . $tagfile) or return;
+ my $tag = <F>;
+ chomp $tag;
+ close(F);
+ return if ($tag eq $head);
+ }
+ closedir(D);
+
+ push @LOCALVERSIONS, "g" . substr($head,0,8);
+}
+
+if ( -d ".git") {
+ do_git_checks();
+}
+
+printf "-%s\n", join("-",@LOCALVERSIONS) if (scalar @LOCALVERSIONS > 0);

--

Ryan Anderson
sometimes Pug Majere

Helge Hafting

unread,

Aug 15, 2005, 1:41:11 PM8/15/05

to Linus Torvalds, Linux Kernel Mailing List

On Mon, Aug 15, 2005 at 08:50:12AM -0700, Linus Torvalds wrote:
>
>

> On Mon, 15 Aug 2005, Helge Hafting wrote:
> >
> > Ok, I have downlaoded git and started the first compile.
> > Git will tell when the correct point is found (assuming I
> > do the "git bisect bad/good" right), by itself?
>
> Yes. You should see
>
> Bisecting: xxx revisions left to test after this
>
> and the "xxx" should hopefully decrease by half during each round. And t
> the end of it, you should get
>
> <sha1> is first bad commit
>
> followed by the actual patch that caused the problem.
>
> > Is there any way to make git tell exactly where between rc4 and rc5
> > each kernel is, so I can name the bzimages accordingly?
>
> You'd have to use the raw commit names, since these things don't have any
> symbolic names. You can get that by just doing
>
> cat .git/HEAD
>
> which will give you a 40-character hex string (representing the 160-bit
> SHA1 of the top commit). Not very readable, but it's unique, and if you
> report that hex string to other git users, they can trivially recreate the
> tree you have.
>

Good. I save those .git/HEAD strings to a separate file.
The first iteration
a46e812620bd7db457ce002544a1a6572c313d8a
seemed to turn out "good". I test further during the compile of
the next one.

Thanks for all the instructions on using git.

Helge Hafting

Sanjoy Mahajan

unread,

Aug 15, 2005, 5:48:48 PM8/15/05

to Helge Hafting, Linux Kernel Mailing List, Linus Torvalds

>> Is there any way to make git tell exactly where between rc4 and rc5
>> each kernel is, so I can name the bzimages accordingly?
>
> You'd have to use the raw commit names, since these things don't have any
> symbolic names. You can get that by just doing
>
> cat .git/HEAD

Also, don't name the local version something like
2.6.13-rc6:e63b6d5ac1e17d0d9e5112bd9c0e5f17199b23da otherwise LILO
complains. For example, this bit in lilo.conf

image=/boot/vmlinuz-2.6.12:b5e43913cfe95a18ad8929585a0bb58e46cf3390
label=bisect1

produces when you run lilo:

:BIOS syntax is no longer supported. Please use a DISK section
Fatal: Not a number: "b5e43913cfe95a18ad8929585a0bb58e46cf3390"

So in my kernel tree used for bisections, 'localversion' contains

-b5e43913cfe95a18ad8929585a0bb58e46cf3390

I don't fully understand when git (doing the checkout that is implict
in git bisect) will overwrite or not overwrite local files, or when it
will create files not in a previous version, or delete files not in a
current version. So, to be sure I'm getting a clean compile from
exactly the source files I want (probably overkill), I use 'git
bisect' to get the SHA1 id's, and then do:

#!/bin/bash
sha1=`cat .git/HEAD`
dest="/usr/src/bisect/$sha1"
cg-export $dest $sha1
cp dot-config-to-test $dest/.config
cd $dest
echo "-$sha1" > localversion
# accept defaults for all new config options:
yes "" | make oldconfig
make -j 4 >& compile.log &

Helge Hafting

unread,

Aug 15, 2005, 6:04:02 PM8/15/05

to Linus Torvalds, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Mon, Aug 15, 2005 at 08:50:12AM -0700, Linus Torvalds wrote:
>
>

> On Mon, 15 Aug 2005, Helge Hafting wrote:
> >
> > Ok, I have downlaoded git and started the first compile.
> > Git will tell when the correct point is found (assuming I
> > do the "git bisect bad/good" right), by itself?
>
> Yes. You should see
>
> Bisecting: xxx revisions left to test after this
>
> and the "xxx" should hopefully decrease by half during each round. And t
> the end of it, you should get
>
> <sha1> is first bad commit
>
> followed by the actual patch that caused the problem.
>

This was interesting. At first, lots of kernels just kept working,
I almost suspected I was doing something wrong. Then the second last kernel
recompiled a lot of DRM stuff - and the crash came back!
The kernel after that worked again, and so the final message was:

561fb765b97f287211a2c73a844c5edb12f44f1d is first bad commit
diff-tree 561fb765b97f287211a2c73a844c5edb12f44f1d (from
6ade43fbbcc3c12f0ddba112351d14d6c82ae476)
Author: Anton Blanchard <an...@samba.org>
Date: Mon Aug 1 21:11:46 2005 -0700

[PATCH] ppc64: topology API fix

Dont include asm-generic/topology.h unconditionally, we end up overriding
all the ppc64 specific functions when NUMA is on.

Signed-off-by: Anton Blanchard <an...@samba.org>
Acked-by: Paul Mackerras <pau...@samba.org>
Signed-off-by: Andrew Morton <ak...@osdl.org>
Signed-off-by: Linus Torvalds <torv...@osdl.org>

:040000 040000 a760521110f862aecbee74cffa674993b6dca4a3
66b9cb2db119ab029ca7b8f71bd06507fca63921 M include

I'm a little surprised, as a ppc64 fix theoretically shouldn't matter for
x86_64? But perhaps they share something?

I hope this is of help,
Helge Hafting

Linus Torvalds

unread,

Aug 15, 2005, 7:00:43 PM8/15/05

to Helge Hafting, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Tue, 16 Aug 2005, Helge Hafting wrote:
>
> This was interesting. At first, lots of kernels just kept working,
> I almost suspected I was doing something wrong. Then the second last kernel
> recompiled a lot of DRM stuff - and the crash came back!
> The kernel after that worked again, and so the final message was:
>
> 561fb765b97f287211a2c73a844c5edb12f44f1d is first bad commit

Ok, that definitely looks bogus.

That commit should not matter at _all_, it only changes ppc64 specific
things.

If the bug is sometimes hard to trigger, maybe one of the "good" kernels
wasn't good after all. That would definitely throw a wrench in the
bisection.

Anyway, with something like this, where there may be false positives
(false "good" kernels), the only thing you can _really_ trust is a kernel
that got marked bad, because that one definitely has the problem. So make
sure that you remember all known-bad kernels.

Btw, we haven't had a lot of testign of the termination condition for "git
bisect", so it's possible it's off by a commit or two. However, the commit
you actually ended up on is literally just two commits before 2.6.13-rc5,
which makes me suspect that it's not the termination condition, as much as
the fact that it really was an earlier kernel that had the problem, but
you bisected it as "good" because the problem just didn't trigger quickly
enough..

Linus

Dave Airlie

unread,

Aug 15, 2005, 7:19:26 PM8/15/05

to Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

>
> I'm a little surprised, as a ppc64 fix theoretically shouldn't matter for
> x86_64? But perhaps they share something?

My guess is that it is maybe the DRM changes that have done it... the
32/64-bit code in 2.6.13-rc6 may have issues, but they've been tested
on a number of configurations (none of them by me... I can't test what
I don't have...)

Can you do me a favour and check 2.6.13-rc6 with the git-drm.patch from -mm?

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc5/2.6.13-rc5-mm1/broken-out/git-drm.patch

If this is a 32/64-bit issue I think that patch might help, I'm not
convinced I can't see how the DRM would ever start blanking the
screen, it doesn't have any code in that area at all.. but stranger
things have surprised me...

Is there any difference in your Xorg.0.log files before/after this...

There is also an issue at:
http://bugme.osdl.org/show_bug.cgi?id=4965

which was caused by the pci assign resources patch on x86... I'm not
sure if this is similiar..

Dave.

Dave Airlie

unread,

Aug 15, 2005, 7:25:04 PM8/15/05

to Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

> > I'm a little surprised, as a ppc64 fix theoretically shouldn't matter for
> > x86_64? But perhaps they share something?
>
> My guess is that it is maybe the DRM changes that have done it... the
> 32/64-bit code in 2.6.13-rc6 may have issues, but they've been tested
> on a number of configurations (none of them by me... I can't test what
> I don't have...)
>

Actually after looking back 2.6.13-rc4-mm1 which you say works doesn't
contain any of the later 32/64-bit changes.. so maybe you can try just
applying the git-drm.patch from that tree to see if it makes a
difference...

I'm getting less and less sure this is caused by the drm, (have you
built with DRM disabled completely??)

Do you have any fb support in-kernel (I know you might have answered
this already but I'm getting a bit lost on this thread...)

Helge Hafting

unread,

Aug 16, 2005, 3:28:14 AM8/16/05

to Dave Airlie, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

On Tue, Aug 16, 2005 at 09:24:25AM +1000, Dave Airlie wrote:
> > > I'm a little surprised, as a ppc64 fix theoretically shouldn't matter for
> > > x86_64? But perhaps they share something?
> >
> > My guess is that it is maybe the DRM changes that have done it... the
> > 32/64-bit code in 2.6.13-rc6 may have issues, but they've been tested
> > on a number of configurations (none of them by me... I can't test what
> > I don't have...)
> >
>
> Actually after looking back 2.6.13-rc4-mm1 which you say works doesn't
> contain any of the later 32/64-bit changes.. so maybe you can try just
> applying the git-drm.patch from that tree to see if it makes a
> difference...
>
> I'm getting less and less sure this is caused by the drm, (have you
> built with DRM disabled completely??)
>

No, but I can try that after work today.

> Do you have any fb support in-kernel (I know you might have answered
> this already but I'm getting a bit lost on this thread...)

There is no fb support at all. I have the vga console,
agp support (which obviously only applies to the agp g550)
drm/dri support for g550 and for the pci radeon.
Could the new patches possibly have issues with the case
where AGP support is compiled into the kernel, but
the card is pci so it isn't supposed to _use_ it?
Also, the two cards aren't used by the same user, it
is two desktops, not one big one.

The X freeze comes fast if I play "cuyo", a nice 2D game
somewhat similiar to tetris. I don't think it
uses DRM, unless x.org 6.8.2 somehow uses it to
speed up 2D operations.

The bisection search:
a46e812620bd7db457ce002544a1a6572c313d8a good
e0b98c79e605f64f263ede53344f283f5e0548be good
fd3113e84e188781aa2935fbc4351d64ccdd171b good
2757a71c3122c7653e3dd8077ad6ca71efb1d450 good
ba17101b41977f124948e0a7797fdcbb59e19f3e good
saw lots of drm recompile for the next one:
561fb765b97f287211a2c73a844c5edb12f44f1d bad

6ade43fbbcc3c12f0ddba112351d14d6c82ae476 good
And then the final one also seemed good.
If the stop condition could be off by one,
wonder what the next patch is?

Helge Hafting

unread,

Aug 16, 2005, 4:38:49 AM8/16/05

to Linus Torvalds, Helge Hafting, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

Linus Torvalds wrote:

>On Tue, 16 Aug 2005, Helge Hafting wrote:
>
>
>>This was interesting. At first, lots of kernels just kept working,
>>I almost suspected I was doing something wrong. Then the second last kernel
>>recompiled a lot of DRM stuff - and the crash came back!
>>The kernel after that worked again, and so the final message was:
>>
>>561fb765b97f287211a2c73a844c5edb12f44f1d is first bad commit
>>
>>
>
>Ok, that definitely looks bogus.
>
>That commit should not matter at _all_, it only changes ppc64 specific
>things.
>
>If the bug is sometimes hard to trigger, maybe one of the "good" kernels
>wasn't good after all. That would definitely throw a wrench in the
>bisection.
>
>

The bisection search:

a46e812620bd7db457ce002544a1a6572c313d8a good
e0b98c79e605f64f263ede53344f283f5e0548be good
fd3113e84e188781aa2935fbc4351d64ccdd171b good
2757a71c3122c7653e3dd8077ad6ca71efb1d450 good

ba17101b41977f124948e0a7797fdcbb59e19f3e good, this one has got more testing,
as my default kernel to boot for the moment.

saw lots of drm recompile for the next one:
561fb765b97f287211a2c73a844c5edb12f44f1d bad

6ade43fbbcc3c12f0ddba112351d14d6c82ae476 good
I'll test this one more to see if it is a false positive, and I'll
also test a known bad kernel without DRM.

Helge Hafting

unread,

Aug 16, 2005, 12:45:42 PM8/16/05

to Dave Airlie, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

On Tue, Aug 16, 2005 at 09:24:25AM +1000, Dave Airlie wrote:

> > > I'm a little surprised, as a ppc64 fix theoretically shouldn't matter for
> > > x86_64? But perhaps they share something?
> >
> > My guess is that it is maybe the DRM changes that have done it... the
> > 32/64-bit code in 2.6.13-rc6 may have issues, but they've been tested
> > on a number of configurations (none of them by me... I can't test what
> > I don't have...)
> >
>
> Actually after looking back 2.6.13-rc4-mm1 which you say works doesn't
> contain any of the later 32/64-bit changes.. so maybe you can try just
> applying the git-drm.patch from that tree to see if it makes a
> difference...
>
> I'm getting less and less sure this is caused by the drm, (have you
> built with DRM disabled completely??)
>

I tried rc6 with DRM turned off. That kernel consistently _died_ when
trying to start xdm. Xorg logs for both cards ended like this:

(II) LoadModule: "pcidata"
(II) Loading /usr/X11R6/lib/modules/libpcidata.a

Of course the last block of the log may be lost, as this crash
blocked even sysrq so it is reasonable to assume that the disk drivers
and filesystems froze up too.

I can retry this with a syncronously mounted /var, if the last lines
of the Xorg logs might be interesting.

Helge Hafting

Linus Torvalds

unread,

Aug 16, 2005, 1:01:32 PM8/16/05

to Helge Hafting, Dave Airlie, Linux Kernel Mailing List, ak...@osdl.org

On Tue, 16 Aug 2005, Helge Hafting wrote:
>

> I tried rc6 with DRM turned off. That kernel consistently _died_ when
> trying to start xdm. Xorg logs for both cards ended like this:
>
> (II) LoadModule: "pcidata"
> (II) Loading /usr/X11R6/lib/modules/libpcidata.a

Ok, it does sound like your X server is doing something nasty on the PCI
bus.

> I can retry this with a syncronously mounted /var, if the last lines
> of the Xorg logs might be interesting.

It would be even more interesting if you have a serial console, but if
this is the X server stomping on the PCI bus, you might just have a total
lockup - no oops, no nothing.

One thing that might be interesting is to see if the old working kernel
has a different IO-map than the broken ones. A simple

cat /proc/ioports /proc/iomem > iomaps.kernel-version

and diffing the two might be an interesting thing to try. X has been known
to sometimes just try to re-configure things on its own without telling
(or asking) the kernel.

Linus

Helge Hafting

unread,

Aug 16, 2005, 3:22:35 PM8/16/05

to Linus Torvalds, Linux Kernel Mailing List, Dave Airlie, ak...@osdl.org

On Mon, Aug 15, 2005 at 03:59:07PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 16 Aug 2005, Helge Hafting wrote:
> >
> > This was interesting. At first, lots of kernels just kept working,
> > I almost suspected I was doing something wrong. Then the second las
t kernel
> > recompiled a lot of DRM stuff - and the crash came back!
> > The kernel after that worked again, and so the final message was:
> >
> > 561fb765b97f287211a2c73a844c5edb12f44f1d is first bad commit
>
> Ok, that definitely looks bogus.
>
> That commit should not matter at _all_, it only changes ppc64 specifi
c
> things.
>
> If the bug is sometimes hard to trigger, maybe one of the "good" kern
els
> wasn't good after all. That would definitely throw a wrench in the
> bisection.
>

The hang, or at least an X "pause" tends to happen in 5-10 minutes of
playing cuyo. (�2D game). I have now had the last good kernel
(6ade43fbbcc3c12f0ddba112351d14d6c82ae476) running for almost 24
hours, only interrupted by the brief test of drm-less rc6.

Normal use haven't provoked anything. Since DRM sort of works with this
kernell, I tried tuxracer on the radeon. (Trouble is always with
the radeon, never the mga xserver). I played several games, ok
except for the usual lousy 5-9 fps. One time I had a "pause", the 3D-ga
me
just froze for about half a minute. The other xserver kept
displaying firefox (and updating the page too) but I could not
start any processes there. I tried starting an xterm - it did not
appear until tuxracer "unfroze" and continued as if nothing happened.
Perhaps the frozen process held a lock?

Disk io seemed sluggish after that incident, and the load meter in
icewm seemed to indicate more waiting than usual. The logs tells
me of SCSI aborts and a bus reset. I booted into drm-less rc6 after
that.

Some interrupts are shared on this machine:
$ cat /proc/interrupts
CPU0
0: 10113154 IO-APIC-edge timer
1: 371 IO-APIC-edge i8042
2: 0 XT-PIC cascade
4: 5735 IO-APIC-edge serial
8: 0 IO-APIC-edge rtc
12: 11024 IO-APIC-edge i8042
14: 21 IO-APIC-edge ide0
16: 803248 IO-APIC-level sym53c8xx, eth0, mga@pci:0000:01:00.0
17: 0 IO-APIC-level Trident Audio
19: 755535 IO-APIC-level radeon@pci:0000:00:08.0
20: 5946 IO-APIC-level libata
21: 9448 IO-APIC-level ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd
:usb3,
uhci_hcd:usb4
NMI: 234
LOC: 10111810
ERR: 0
MIS: 0

The troublesome radoen has a irq of its own. The scsi controller
shares irq with the matrox g550, but that card never seem to
cause any trouble, other than saturating the cpu during games. :-)

On to look at iomem and that rc6 crash.

Helge Hafting

unread,

Aug 16, 2005, 5:07:23 PM8/16/05

to Linus Torvalds, Dave Airlie, Linux Kernel Mailing List, ak...@osdl.org

On Tue, Aug 16, 2005 at 10:00:50AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 16 Aug 2005, Helge Hafting wrote:
> >
> > I tried rc6 with DRM turned off. That kernel consistently _died_ when
> > trying to start xdm. Xorg logs for both cards ended like this:
> >
> > (II) LoadModule: "pcidata"
> > (II) Loading /usr/X11R6/lib/modules/libpcidata.a
>
> Ok, it does sound like your X server is doing something nasty on the PCI
> bus.
>
> > I can retry this with a syncronously mounted /var, if the last lines
> > of the Xorg logs might be interesting.
>
> It would be even more interesting if you have a serial console, but if
> this is the X server stomping on the PCI bus, you might just have a total
> lockup - no oops, no nothing.
>

Tricky - I have nothing to connect to the serial port.

> One thing that might be interesting is to see if the old working kernel
> has a different IO-map than the broken ones. A simple
>
> cat /proc/ioports /proc/iomem > iomaps.kernel-version
>
> and diffing the two might be an interesting thing to try. X has been known
> to sometimes just try to re-configure things on its own without telling
> (or asking) the kernel.

Diffing the iomaps thus obtained for
2.6.13-rc4-6ade43fbbcc3c12f0ddba112351d14d6c82ae476
and 2.6.13-rc6 produce this:
ba112351d14d6c82ae476 iomaps.2.6.13-rc6
17a18
> 5000-5007 : viapro-smbus
52,53c53,54
< 00100000-0041a94c : Kernel code
< 0041a94d-00695337 : Kernel data
---
> 00100000-003fed39 : Kernel code
> 003fed3a-00662f77 : Kernel data

rc6 has a somewhat smaller kernel, and a viapro-smbus.

The X.org logs also got further, with the synchronous mount:

The radeon log ended like this:
[31] -1 0 0x00009000 - 0x000090ff (0x100) IX[B]
[32] -1 0 0x00009800 - 0x000098ff (0x100) IX[B](B)
[33] 0 0 0x000003b0 - 0x000003bb (0xc) IS[B]
[34] 0 0 0x000003c0 - 0x000003df (0x20) IS[B]
(II) Setting vga for screen 0.
(II) RADEON(0): MMIO registers at 0xf6000000
(II) RADEON(0): PCI bus 0 card 8 func 0
(**) RADEON(0): Depth 24, (--) framebuffer bpp 32
(II) RADEON(0): Pixel depth = 24 bits stored in 4 bytes (32 bpp pixmaps)
(==) RADEON(0): Default visual is TrueColor
(**) RADEON(0): Option "EnablePageFlip" "off"
(**) RADEON(0): Option "DynamicClocks" "off"
(II) Loading sub module "vgahw"
(II) LoadModule: "vgahw"
(II) Loading /usr/X11R6/lib/modules/libvgahw.a
(II) Module vgahw: vendor="X.Org Foundation"
compiled for 6.8.2, module version = 0.1.0
ABI class: X.Org Video Driver, version 0.7
(II) RADEON(0): vgaHWGetIOBase: hwp->IOBase is 0x03b0, hwp->PIOOffset is 0x0000
(==) RADEON(0): RGB weight 888
(II) RADEON(0): Using 8 bits per RGB (8 bit DAC)
(II) Loading sub module "int10"
(II) LoadModule: "int10"
(II) Reloading /usr/X11R6/lib/modules/libint10.a
(II) RADEON(0): initializing int10
(**) RADEON(0): Option "InitPrimary" "on"

It stopped here, while it normally goes on with:
(II) Truncating PCI BIOS Length to 53248
(--) RADEON(0): Chipset: "ATI Radeon 9200SE 5964 (AGP)" (ChipID = 0x5964)
(--) RADEON(0): Linear framebuffer at 0xe0000000
(--) RADEON(0): BIOS at 0x1ff00000
(--) RADEON(0): VideoRAM: 131072 kByte (64 bit DDR SDRAM)
(II) RADEON(0): PCI card detected
(II) Loading sub module "ddc"
..

Seems like it died trying to perform int10 initialization?

The matrox log stopped inside a listing of resource ranges after preInit:
[29] -1 0 0x0000ac00 - 0x0000ac0f (0x10) IX[B]
[30] -1 0 0x0000a800 - 0x0000a803 (0x4) IX[B]
[31] -1 0 0x0000a400 - 0x0000a407 (0x8) IX[B]
[32] -1 0 0x0000a000 - 0x0000a003 (0x4) IX[B]
[33] -1 0 0x00009c00 - 0x00009c07 (0x8) IX[B]
[34] -1 0 0x00009400 - 0x000094ff (0x100) IX[B]
[35] -1 0 0x00009000 - 0x000090ff (0x100) IX[B]
[36] 0 0 0x000003b0 - 0x000003bb (0xc) IS[B]

Normally, this continues with:
[37] 0 0 0x000003c0 - 0x000003df (0x20) IS[B](OprU)
(==) MGA(0): Write-combining range (0xf0000000,0x2000000)
(II) MGA(0): vgaHWGetIOBase: hwp->IOBase is 0x03d0, hwp->PIOOffset is 0x0000
(--) MGA(0): 16 DWORD fifo
(==) MGA(0): Default visual is TrueColor
(II) MGA(0): [drm] bpp: 16 depth: 16
(II) MGA(0): [drm] Sarea 2200+664: 2864
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 7, (OK)

I guess the radeon hung the machine, and the matrox xserver simply wasn't
scheduled after that.

The lockup wasn't total - the numlock LED responded to the numlock key
(and similar for capslock) until I did the sysrq+B. There seemed to be
no reaction, other than no more LED responses.
This kernel doesn't have ACPI so it can't turn the machine off
when doing a normal shutdown, but it is usually capable rebooting.
The console was black of course, no dumps of any kind.

I can try running the radeon xserver only, as the vga console is on the matrox
card.

Helge Hafting

Dave Airlie

unread,

Aug 16, 2005, 7:51:36 PM8/16/05

to Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

> ...

>
> Seems like it died trying to perform int10 initialization?

I'm still pointing towards that assign pci resources patch from Gregs
tree that I mentioned earlier..

the fact that disabling the DRM stops things from working is really
bad, maybe the pci_enable_device in the DRM is setting up the devices,
whereas without it X tries and fails...

>
> I can try running the radeon xserver only, as the vga console is on the matrox
> card.
>

I'm running low on ideas, I'm also having a hard time tracking what is
actually happening, the MGA bugs I've tracked are related to that
assign pci resources patch, and I really can't see what is happening
if the DRM isn't in the mix..

If you build a working kernel (i.e. like 2.6.13 without DRM) does it
hang similarly?

Dave.

Helge Hafting

unread,

Aug 17, 2005, 6:58:38 AM8/17/05

to Dave Airlie, Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

Dave Airlie wrote:

>>...
>>
>>Seems like it died trying to perform int10 initialization?
>>
>>
>
>I'm still pointing towards that assign pci resources patch from Gregs
>tree that I mentioned earlier..
>
>

git is completely new to me - is there a git-specific way to get this
patch, or should I download it the usual way from somewhere?

>the fact that disabling the DRM stops things from working is really
>bad, maybe the pci_enable_device in the DRM is setting up the devices,
>whereas without it X tries and fails...
>
>
>

That was strange, sure. Could be a different bug too.

>>I can try running the radeon xserver only, as the vga console is on the matrox
>>card.
>>
>>
>>
>
>I'm running low on ideas, I'm also having a hard time tracking what is
>actually happening, the MGA bugs I've tracked are related to that
>assign pci resources patch, and I really can't see what is happening
>if the DRM isn't in the mix..
>
>If you build a working kernel (i.e. like 2.6.13 without DRM) does it
>hang similarly?
>
>
>

2.6.13 isn't released, so I assume you meant some earlier kernel?
I'll see if I can get a drm-less kernel running.

Helge Hafting

Dave Airlie

unread,

Aug 17, 2005, 7:06:29 AM8/17/05

to Helge Hafting, Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org

> >
> >I'm still pointing towards that assign pci resources patch from Gregs
> >tree that I mentioned earlier..
> >
> >
> git is completely new to me - is there a git-specific way to get this
> patch, or should I download it the usual way from somewhere?

Just grab it from the link to comment #16 on
http://bugzilla.kernel.org/show_bug.cgi?id=4965

and revert it if you could, thanks...

> That was strange, sure. Could be a different bug too.

oh it more than likely is a different bug...

>
> >I'm running low on ideas, I'm also having a hard time tracking what is
> >actually happening, the MGA bugs I've tracked are related to that
> >assign pci resources patch, and I really can't see what is happening
> >if the DRM isn't in the mix..
> >
> >If you build a working kernel (i.e. like 2.6.13 without DRM) does it
> >hang similarly?
> >
> >
> >
> 2.6.13 isn't released, so I assume you meant some earlier kernel?
> I'll see if I can get a drm-less kernel running.

Oh yeah sorry, I meant 2.6.12 or some kernel you know works...

Dave.

Rolf Eike Beer

unread,

Aug 17, 2005, 7:25:04 AM8/17/05

to Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Linus Torvalds

Helge Hafting wrote:
>Dave Airlie wrote:
>> I switched back to 2.6.13-rc4-mm1 at this point for another reason,
>> my X display aquired a nasty tendency to go blank for no reason
>> during work,
>> something I could fix by changing resolution baqck and forth. X
>> also tended to get
>> stuck for a minute now and then - a problem I haven't seen since
>> early 2.6.
>>
>>
>>
>> which head the radeon or MGA or both?
>
>The radeon 9200SE-pci gets stuck. The MGA-agp seems to be fine. I have
>compiled
>dri support for both, but I can't use it at the moment. I think that is
>caused by having ubuntu's xorg installed on debian. I needed xorg
>in order to run an xserver that doesn't use any tty - this way I can use
>two keyboards and have two simultaneous users. Debians xorg wasn't ready
>at the moment. The setup is fine with 2.6.13-rc4-mm1 x86-64, no problems
>there.

I have some other issue with a MGA card (don't know exactly which, I have only
access to this on the weekend). With rc5 and rc6 kdm will not start on
bootup, X complains about some unresolved symbols in the X mga driver. If I
log in as user and do startx it works fine, also if I switch back to
2.6.12-rc-something. Something seems to confuse X somehow.

It's a PII-350 with more or less SuSE 9.3. The machine has no net access, so I
can only try to narrow it down to one rc at the weekend.

Eike

Linus Torvalds

unread,

Aug 17, 2005, 11:21:18 AM8/17/05

to Dave Airlie, Helge Hafting, Helge Hafting, Linux Kernel Mailing List, ak...@osdl.org

On Wed, 17 Aug 2005, Dave Airlie wrote:
>
> > git is completely new to me - is there a git-specific way to get this
> > patch, or should I download it the usual way from somewhere?
>
> Just grab it from the link to comment #16 on
> http://bugzilla.kernel.org/show_bug.cgi?id=4965

That's a good one to try (and if it matters, can you please do a full
"lspci -vvx" for before-and-after? In fact, it would probably be good to
do that _regardless_ - do it with an old known-good kernel, and with one
recent kernel).

At the same time, something struck me. Does it happen to be much warmer in
your room lately? As in due to a heatwave? I'm just wondering if it might
be something as silly as a thermal shutdown.

Linus

Linus Torvalds

unread,

Aug 22, 2005, 4:07:45 PM8/22/05

to Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, 22 Aug 2005, Linus Torvalds wrote:
>
> Eike, maybe you could change the ">=" to just ">" instead?

Ahh, I think you'd need to change the "i < PCI_ROM_RESOURCE" a few lines
above that to use "<=" too.

Linus Torvalds

unread,

Aug 22, 2005, 4:15:06 PM8/22/05

to Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, 22 Aug 2005, Rolf Eike Beer wrote:

> >It's a PII-350 with more or less SuSE 9.3. The machine has no net access, so
> > I can only try to narrow it down to one rc at the weekend.
>

> 2.6.12 works fine, everything since 2.6.13-rc1 breaks it.

Gaah. I don't see anything really obvious in that range. However, I notice
that pci_mmap_resource() (in drivers/pci/pci-sysfs.c) now has

+ if (i >= PCI_ROM_RESOURCE)
+ return -ENODEV;

which seems a big bogus. Why wouldn't we allow the ROM resource to be
mapped? I could imagine that the X server would very much like to mmap it,
although I don't know if modern X actually does that. The fact that it
works when root runs the X server and causes problems for normal users
does seem like there's something that root can do that users can't do, and
doing a mmap() on /dev/mem might be just that.

Eike, maybe you could change the ">=" to just ">" instead?

PS. The patch that introduced this was billed as "no change for anything
but ppc". Tssk.

Benjamin Herrenschmidt

unread,

Aug 22, 2005, 5:02:38 PM8/22/05

to Linus Torvalds, Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, 2005-08-22 at 10:44 -0700, Linus Torvalds wrote:
>
> On Mon, 22 Aug 2005, Rolf Eike Beer wrote:
>
> > >It's a PII-350 with more or less SuSE 9.3. The machine has no net access, so
> > > I can only try to narrow it down to one rc at the weekend.
> >
> > 2.6.12 works fine, everything since 2.6.13-rc1 breaks it.
>
> Gaah. I don't see anything really obvious in that range. However, I notice
> that pci_mmap_resource() (in drivers/pci/pci-sysfs.c) now has
>
> + if (i >= PCI_ROM_RESOURCE)
> + return -ENODEV;
>
> which seems a big bogus. Why wouldn't we allow the ROM resource to be
> mapped? I could imagine that the X server would very much like to mmap it,
> although I don't know if modern X actually does that. The fact that it
> works when root runs the X server and causes problems for normal users
> does seem like there's something that root can do that users can't do, and
> doing a mmap() on /dev/mem might be just that.
>
> Eike, maybe you could change the ">=" to just ">" instead?
>
> PS. The patch that introduced this was billed as "no change for anything
> but ppc". Tssk.

X uses /dev/mem, it doesn't use sysfs nor proc on x86, though it does
use proc for config space access (and that only) on ppc. Do you have the
sha1 ID of the above change at hand btw ?

(And yes, X should/will be fixed)

Helge Hafting

unread,

Aug 22, 2005, 5:38:26 PM8/22/05

to Linus Torvalds, Dave Airlie, Linux Kernel Mailing List, ak...@osdl.org

On Wed, Aug 17, 2005 at 08:19:36AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 17 Aug 2005, Dave Airlie wrote:
> > Just grab it from the link to comment #16 on
> > http://bugzilla.kernel.org/show_bug.cgi?id=4965
>
> That's a good one to try (and if it matters, can you please do a full
> "lspci -vvx" for before-and-after? In fact, it would probably be good to
> do that _regardless_ - do it with an old known-good kernel, and with one
> recent kernel).
>
> At the same time, something struck me. Does it happen to be much warmer in
> your room lately? As in due to a heatwave? I'm just wondering if it might
> be something as silly as a thermal shutdown.

Not warmer than usual, but the machine is always hot to the touch,
it is sitting in a small closet where I have taken the door off. Air
circulation still isn't perfect, but there is a strong fan on the cpu,
almost as noisy as a vacuum cleaner. :-(

Cpu loads never killed it before, so I don't suspect that unless the
radeon 9200SE has a thermal shutdown of its own.

I have found that the crash and the balnking may be different problems.
It seems that any kernel with a _working_ drm sooner or later will cause
a hang on the radeon display, possibly but not necessarily freezing the
machine for a while or forever. This happens more often if I actually
stress drm, such as playing tuxracer. But it can happen with
plain firefox/xterm/thunderbird work too. (no opengl screensavers
or animated window managers here.)

My rock solid 2.6.13-rc4-mm1 has drm compiled in, but drm fails when X
starts, and therefore drm isn't used. And therefore, a stable kernel.
From Xorg.2.log:
drmOpenDevice: open result is 6, (OK)
drmOpenByBusid: drmOpenMinor returns 6
drmOpenByBusid: drmGetBusid reports pci:0000:00:08.0
(II) RADEON(0): [drm] DRM interface version 1.2
(II) RADEON(0): [drm] created "radeon" driver at busid "pci:0000:00:08.0"
(II) RADEON(0): [drm] added 8192 byte SAREA at 0xffffc20000147000
(II) RADEON(0): [drm] drmMap failed
(EE) RADEON(0): [dri] DRIScreenInit failed. Disabling DRI.
(II) RADEON(0): Memory manager initialized to (0,0) (1280,8191)
(II) RADEON(0): Reserved area from (0,1024) to (1280,1026)
(II) RADEON(0): Largest offscreen area available: 1280 x 7165
(II) RADEON(0): Render acceleration enabled
(II) RADEON(0): Using XFree86 Acceleration Architecture (XAA)
drmMap failed for this kernel.

Seems like replacing the radeon is a good idea, it will probably never
do stable 3D as even old kernels have this particular problem. The
performance is apalling too, the old g550 gets a 3x-5x better framerate...

The blank display problem is different. That problem follows the
bisect search, i.e. the "good" kernels never ever blanks the
display for me, and the "bad" kernels always do so after a little while.
Even if all I use is 2D stuff. (All with drm configured)

As for the patch to revert - it did fix things so an rc6 without drm
came up. I'm using that kernel now. I guess it'll be fine,
with no drm. I'll keep it running tomorrow, for stability testing.

What is the next logical step? rc6 with both drm and this patch reverted?
Or is there any new development?

There are three lspci-vvx files attached.
One for plain 2.6.13rc6, which can't run X.
One for 2.6.13rc6 with the patch reverted, and one for
the same kernel after I rebooted the machine and also
started X. The lspci-vvx was sligthly different then.

Helge Hafting

lspci-vvx-2.6.13rc6

lspci-vvx-2.6.13rc6p

lspci-vvx-2.6.13rc6p-afterX

Rolf Eike Beer

unread,

Aug 22, 2005, 6:28:44 PM8/22/05

to Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Linus Torvalds

2.6.12 works fine, everything since 2.6.13-rc1 breaks it.

Eike

Dave Airlie

unread,

Aug 22, 2005, 7:08:18 PM8/22/05

to Helge Hafting, Linus Torvalds, Linux Kernel Mailing List, ak...@osdl.org, Greg KH

>
>
> I have found that the crash and the balnking may be different problems.
> It seems that any kernel with a _working_ drm sooner or later will cause
> a hang on the radeon display, possibly but not necessarily freezing the
> machine for a while or forever. This happens more often if I actually
> stress drm, such as playing tuxracer. But it can happen with
> plain firefox/xterm/thunderbird work too. (no opengl screensavers
> or animated window managers here.)

yes there are still some unknown issues in the r200 drivers even on my
own 9200, this isn't probably more of an issue with the userspace
driver (granted I think the DRM could have some race condition as
well...)

>
> My rock solid 2.6.13-rc4-mm1 has drm compiled in, but drm fails when X
> starts, and therefore drm isn't used. And therefore, a stable kernel.
> From Xorg.2.log:
> drmOpenDevice: open result is 6, (OK)
> drmOpenByBusid: drmOpenMinor returns 6
> drmOpenByBusid: drmGetBusid reports pci:0000:00:08.0
> (II) RADEON(0): [drm] DRM interface version 1.2
> (II) RADEON(0): [drm] created "radeon" driver at busid "pci:0000:00:08.0"
> (II) RADEON(0): [drm] added 8192 byte SAREA at 0xffffc20000147000
> (II) RADEON(0): [drm] drmMap failed
> (EE) RADEON(0): [dri] DRIScreenInit failed. Disabling DRI.
> (II) RADEON(0): Memory manager initialized to (0,0) (1280,8191)
> (II) RADEON(0): Reserved area from (0,1024) to (1280,1026)
> (II) RADEON(0): Largest offscreen area available: 1280 x 7165
> (II) RADEON(0): Render acceleration enabled
> (II) RADEON(0): Using XFree86 Acceleration Architecture (XAA)
> drmMap failed for this kernel.
>

Should be fixed in -mm now, this was a problem on x86-64

>
> Seems like replacing the radeon is a good idea, it will probably never
> do stable 3D as even old kernels have this particular problem. The
> performance is apalling too, the old g550 gets a 3x-5x better framerate...
>
> The blank display problem is different. That problem follows the
> bisect search, i.e. the "good" kernels never ever blanks the
> display for me, and the "bad" kernels always do so after a little while.
> Even if all I use is 2D stuff. (All with drm configured)
>
> As for the patch to revert - it did fix things so an rc6 without drm
> came up. I'm using that kernel now. I guess it'll be fine,
> with no drm. I'll keep it running tomorrow, for stability testing.
>
> What is the next logical step? rc6 with both drm and this patch reverted?
> Or is there any new development?

Linus,
Can we revert the PCI assign resources patch? this is 2-3 bug reports
with it and they all centre on the MGA G550 card, we need to be able
to do some sort of blacklisting perhaps so these cards don't get
touched until we can figure out why it is breaking X...

Dave.

Linus Torvalds

unread,

Aug 22, 2005, 7:41:32 PM8/22/05

to Dave Airlie, Helge Hafting, Linux Kernel Mailing List, Andrew Morton, Greg KH, Ivan Kokshaysky

On Tue, 23 Aug 2005, Dave Airlie wrote:
>
> Can we revert the PCI assign resources patch?

I'd rather not revert the whole PCI assign thing, because it's good.

But disabling the ROM assignment might be a good idea. Almost nobody ever
really wants to assign the ROM anyway, and there are cards where there are
some strange rules about ROM alignment (read: doesn't follow spec).

That may be the problem with MGA - I think some gfx cards used the same
decoder for ROM and for the video RAM aperture, so that you were supposed
to only enable ROM when the RAM thing was quiescent or something, and
always use the same address too (the current code doesn't _enable_ the
ROM, but I think it allocates and programs the base address. Which
should be harmless, but..).

Ivan? Does something like this make a difference?

Linus

---
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -52,10 +52,12 @@ pci_update_resource(struct pci_dev *dev,

if (resno < 6) {
reg = PCI_BASE_ADDRESS_0 + 4 * resno;
+#if 0
} else if (resno == PCI_ROM_RESOURCE) {
new |= res->flags & IORESOURCE_ROM_ENABLE;
reg = dev->rom_base_reg;
} else {
+#endif
/* Hmm, non-standard resource. */

return; /* kill uninitialised var warning */

Rolf Eike Beer

unread,

Aug 23, 2005, 2:50:35 AM8/23/05

to Linux Kernel Mailing List, Linus Torvalds, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

Linus Torvalds wrote:
>On Mon, 22 Aug 2005, Rolf Eike Beer wrote:
>> >It's a PII-350 with more or less SuSE 9.3. The machine has no net access,
>> > so I can only try to narrow it down to one rc at the weekend.
>>
>> 2.6.12 works fine, everything since 2.6.13-rc1 breaks it.
>
>Gaah. I don't see anything really obvious in that range. However, I notice
>that pci_mmap_resource() (in drivers/pci/pci-sysfs.c) now has
>
>+ if (i >= PCI_ROM_RESOURCE)
>+ return -ENODEV;
>
>which seems a big bogus. Why wouldn't we allow the ROM resource to be
>mapped? I could imagine that the X server would very much like to mmap it,
>although I don't know if modern X actually does that. The fact that it
>works when root runs the X server and causes problems for normal users
>does seem like there's something that root can do that users can't do, and
>doing a mmap() on /dev/mem might be just that.

No, it's a bit more obscure. The kdm daemon does not start, but if I log in as
user and do startx everything is fine.

>Eike, maybe you could change the ">=" to just ">" instead?

Will test this.

Eike

Alan Cox

unread,

Aug 23, 2005, 11:06:56 AM8/23/05

to Linus Torvalds, Dave Airlie, Helge Hafting, Linux Kernel Mailing List, Andrew Morton, Greg KH, Ivan Kokshaysky

On Llu, 2005-08-22 at 16:40 -0700, Linus Torvalds wrote:
> That may be the problem with MGA - I think some gfx cards used the same
> decoder for ROM and for the video RAM aperture, so that you were supposed

MGA requires the ROM can be mapped temporarily in order to read the data
tables. X itself solves this by mapping the ROM over the RAM addresses
assigned by the OS then mapping the RAM back after finishing with the
ROM. Its a fairly standard video card trick.

What X does in the presence of kernel support it recognizes I'm not
however sure.

Alan

Linus Torvalds

unread,

Aug 24, 2005, 2:07:00 AM8/24/05

to Dave Airlie, Helge Hafting, Linux Kernel Mailing List, Andrew Morton, Greg KH, Ivan Kokshaysky

On Mon, 22 Aug 2005, Linus Torvalds wrote:
>
> But disabling the ROM assignment might be a good idea. Almost nobody ever
> really wants to assign the ROM anyway, and there are cards where there are
> some strange rules about ROM alignment (read: doesn't follow spec).

Here's an even better idea.

Let's do the assignment internally in the kernel, but just not write it to
the device unless it's actually enabled. IOW, we'll be doing all the
resource allocation, but devices won't be affected. Modern lspci versions
will show this as a "[virtual] Expansion ROM".

The patch might look something like this. Helge, does this make any
difference?

Ivan, opinions?

Linus
---
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c

@@ -53,7 +53,9 @@ pci_update_resource(struct pci_dev *dev,

if (resno < 6) {
reg = PCI_BASE_ADDRESS_0 + 4 * resno;

} else if (resno == PCI_ROM_RESOURCE) {

- new |= res->flags & IORESOURCE_ROM_ENABLE;
+ if (!(res->flags & IORESOURCE_ROM_ENABLE))
+ return;
+ new |= PCI_ROM_ADDRESS_ENABLE;

reg = dev->rom_base_reg;
} else {

/* Hmm, non-standard resource. */

Helge Hafting

unread,

Aug 24, 2005, 4:30:01 AM8/24/05

to Linus Torvalds, Dave Airlie, Linux Kernel Mailing List, Andrew Morton, Greg KH, Ivan Kokshaysky

Linus Torvalds wrote:

>On Mon, 22 Aug 2005, Linus Torvalds wrote:
>
>
>>But disabling the ROM assignment might be a good idea. Almost nobody ever
>>really wants to assign the ROM anyway, and there are cards where there are
>>some strange rules about ROM alignment (read: doesn't follow spec).
>>
>>
>
>Here's an even better idea.
>
>Let's do the assignment internally in the kernel, but just not write it to
>the device unless it's actually enabled. IOW, we'll be doing all the
>resource allocation, but devices won't be affected. Modern lspci versions
>will show this as a "[virtual] Expansion ROM".
>
>The patch might look something like this. Helge, does this make any
>difference?
>
>

Tried it. (More risky than it sounds, I am at work, the machine is at home,
and if it didn't come up again I'd get an angry phonecall . . .)

But it came up fine, and the xservers came up too. :-)

lspci -vvx shows the disabled ROMs too:

0000:01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550
AGP (rev 01) (prog-if 00 [VGA])
Subsystem: Matrox Graphics, Inc. Millennium G550 Dual Head DDR 32Mb
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (4000ns min, 8000ns max), Cache Line Size: 0x08 (32
bytes)
Interrupt: pin A routed to IRQ 16
Region 0: Memory at f0000000 (32-bit, prefetchable) [size=32M]
Region 1: Memory at f2000000 (32-bit, non-prefetchable) [size=16K]
Region 2: Memory at f3000000 (32-bit, non-prefetchable) [size=8M]
Expansion ROM at f2020000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [f0] AGP version 2.0
Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA+ ITACoh- GART64-
HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4
Command: RQ=32 ArqSz=0 Cal=0 SBA+ AGP+ GART64- 64bit-
FW- Rate=x1
00: 2b 10 27 25 07 00 90 02 01 00 00 03 08 20 00 00
10: 08 00 00 f0 00 00 00 f2 00 00 00 f3 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 2b 10 84 0f
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 10 20

0000:00:08.0 VGA compatible controller: ATI Technologies Inc RV280
[Radeon 9200 SE] (rev 01) (prog-if 00 [VGA])
Subsystem: PC Partner Limited: Unknown device 7c25
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (2000ns min), Cache Line Size: 0x08 (32 bytes)
Interrupt: pin A routed to IRQ 19
Region 0: Memory at e0000000 (32-bit, prefetchable) [size=128M]
Region 1: I/O ports at 9800 [size=256]
Region 2: Memory at f6000000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at 1ff00000 [disabled] [size=128K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 02 10 64 59 07 00 90 02 01 00 00 03 08 20 80 00
10: 08 00 00 e0 01 98 00 00 00 00 00 f6 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4b 17 25 7c
30: 00 00 00 00 50 00 00 00 00 00 00 00 05 01 08 00

0000:00:08.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200
SE] (Secondary) (rev 01)
Subsystem: PC Partner Limited: Unknown device 7c24
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Region 0: Memory at e8000000 (32-bit, prefetchable) [disabled]
[size=128M]
Region 1: Memory at f6010000 (32-bit, non-prefetchable)
[disabled] [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 02 10 44 5d 00 00 90 02 01 00 80 03 08 20 00 00
10: 08 00 00 e8 00 00 01 f6 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4b 17 24 7c
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 08 00

Helge Hafting

Rolf Eike Beer

unread,

Aug 30, 2005, 4:08:38 AM8/30/05

to Linus Torvalds, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

Linus Torvalds wrote:
>On Mon, 22 Aug 2005, Rolf Eike Beer wrote:
>> >It's a PII-350 with more or less SuSE 9.3. The machine has no net access,
>> > so I can only try to narrow it down to one rc at the weekend.
>>
>> 2.6.12 works fine, everything since 2.6.13-rc1 breaks it.
>
>Gaah. I don't see anything really obvious in that range. However, I notice
>that pci_mmap_resource() (in drivers/pci/pci-sysfs.c) now has
>
>+ if (i >= PCI_ROM_RESOURCE)
>+ return -ENODEV;
>
>which seems a big bogus. Why wouldn't we allow the ROM resource to be
>mapped? I could imagine that the X server would very much like to mmap it,
>although I don't know if modern X actually does that. The fact that it
>works when root runs the X server and causes problems for normal users
>does seem like there's something that root can do that users can't do, and
>doing a mmap() on /dev/mem might be just that.
>
>Eike, maybe you could change the ">=" to just ">" instead?
>
>PS. The patch that introduced this was billed as "no change for anything
>but ppc". Tssk.

This does not fix the problem. I'll narrow it down to one git snapshot next
weekend (forgot the tarball on friday).

Eike

Rolf Eike Beer

unread,

Sep 5, 2005, 3:50:12 AM9/5/05

to Linus Torvalds, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

The problem appeared between 2.6.12-git3 and 2.6.12-git4.

Eike

Linus Torvalds

unread,

Sep 5, 2005, 4:46:22 AM9/5/05

to Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, 5 Sep 2005, Rolf Eike Beer wrote:
>
> The problem appeared between 2.6.12-git3 and 2.6.12-git4.

Just for reference, that's git ID's

1d345dac1f30af1cd9f3a1faa12f9f18f17f236e..2a5a68b840cbab31baab2d9b2e1e6de3b289ae1e

and that's 225 commits and the diff is 55,781 lines long.

It would be very good if you could try to use raw git and narrow it down a
bit more. It's really easy these days with a recent git version, just do

git bisect start
git bisect good 1d345dac1f30af1cd9f3a1faa12f9f18f17f236e
git bisect bad 2a5a68b840cbab31baab2d9b2e1e6de3b289ae1e

and off you go.. That will select a new kernel for you to try, which
basically cuts down the commits to ~110 - and if you can test just a few
kernels and binary-search a bit more, we'd have it down to just a couple.

If you want to try work smarter (rather than a brute-force binary search
thing), this command line:

git-whatchanged -p \
1d345dac1f30af1cd9f3a1faa12f9f18f17f236e..2a5a68b840cbab31baab2d9b2e1e6de3b289ae1e \
drivers/video

will actually give you some very good information on what to try (I forget
your exact original problem - I'm writing this from Italy, and I don't
have my full email archives here. It was some MGA card that stopped
working, no? Or was there something else?).

Anyway, git users really have a lot of nifty tools to help chase down bugs
like this. I used that "git bisect" thing twice myself last week. And the
"git-whatchanged" thing really is pretty flexible: as you can see, you can
limit it to both a range of commits and a certain subdirectory (or a _set_
of subdirectories and/or individual files - you can have as many pathname
limits as you want).

And that "-p" thing makes it show the whole diff for the thing (replace if
with a "-s" if you just want to see the descriptions and be silent about
the actual diff).

All in my never-ending quest to make people more aware of how they can use
git to pinpoint the source of kernel bugs.

Linus

Sonny Rao

unread,

Sep 5, 2005, 4:04:24 PM9/5/05

to Linus Torvalds, Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, Sep 05, 2005 at 01:45:28AM -0700, Linus Torvalds wrote:
>
> On Mon, 5 Sep 2005, Rolf Eike Beer wrote:
> >
> > The problem appeared between 2.6.12-git3 and 2.6.12-git4.
>
> Just for reference, that's git ID's
>
> 1d345dac1f30af1cd9f3a1faa12f9f18f17f236e..2a5a68b840cbab31baab2d9b2e1e6de3b289ae1e
>
> and that's 225 commits and the diff is 55,781 lines long.
>
> It would be very good if you could try to use raw git and narrow it down a
> bit more. It's really easy these days with a recent git version, just do
>
> git bisect start
> git bisect good 1d345dac1f30af1cd9f3a1faa12f9f18f17f236e
> git bisect bad 2a5a68b840cbab31baab2d9b2e1e6de3b289ae1e
>
> and off you go.. That will select a new kernel for you to try, which
> basically cuts down the commits to ~110 - and if you can test just a few
> kernels and binary-search a bit more, we'd have it down to just a couple.

Can this method detect breakages that are spread across more than one
patch? I suppose it'll just trigger on the last patch commited in the
set in this case?

Sonny

Linus Torvalds

unread,

Sep 6, 2005, 3:45:10 AM9/6/05

to Sonny Rao, Rolf Eike Beer, Linux Kernel Mailing List, Helge Hafting, Dave Airlie, Benjamin Herrenschmidt, Michael Ellerman, Greg Kroah-Hartman, Andrew Morton

On Mon, 5 Sep 2005, Sonny Rao wrote:
>
> Can this method detect breakages that are spread across more than one
> patch? I suppose it'll just trigger on the last patch commited in the
> set in this case?

It will trigger on just the commit that introduces the user-visible
breakage, so yes, it's usually the last in a series (or the first one, for
that matter).

And it's not perfect. A problem that fades in and out is not something you
can do binary searching on. For example, sometimes a bug gets introduced
and ends up being dependent on things like cache alignment or some
variable layout etc, so you only _see_ the problem occasionally, and it
ends up happening due to totally unrelated changes - then the bisection
algorithm ends up being totally useles..

Linus

Andrew Morton

unread,

Sep 8, 2005, 7:48:15 PM9/8/05

to Linus Torvalds, helge....@aitel.hist.no, linux-...@vger.kernel.org, air...@gmail.com

Linus Torvalds <torv...@osdl.org> wrote:
>
> If you remember/save the good/bad commit ID's, you can restart the whole
> process and just feed the correct state for the ID's:
>
> git bisect start
> git bisect bad v2.6.13-rc5
> git bisect good v2.6.13-rc4
> .. here bisect will start narrowing things down ..
> git bisect bad <sha1 of known bad>
> git bisect good <sha1 of known good>
> ..

What do you suggest should be done if you hit a compile error partway
through the bisection search? Is there some way to go forward or backward
a few csets while keeping the search markers sane?

Linus Torvalds

unread,

Sep 8, 2005, 8:18:43 PM9/8/05

to Andrew Morton, helge....@aitel.hist.no, Linux Kernel Mailing List, air...@gmail.com

On Thu, 8 Sep 2005, Andrew Morton wrote:
> Linus Torvalds <torv...@osdl.org> wrote:
> >
> > If you remember/save the good/bad commit ID's, you can restart the whole
> > process and just feed the correct state for the ID's:
> >
> > git bisect start
> > git bisect bad v2.6.13-rc5
> > git bisect good v2.6.13-rc4
> > .. here bisect will start narrowing things down ..
> > git bisect bad <sha1 of known bad>
> > git bisect good <sha1 of known good>
> > ..
>
> What do you suggest should be done if you hit a compile error partway
> through the bisection search? Is there some way to go forward or backward
> a few csets while keeping the search markers sane?

Hmm.. There's no really nice interface for doing it, but since bisection
uses a perfectly normal git branch (it's a special "bisect" branch) you
can use other git commands to move around the head of that branch and try
at any other point than the one it selected for you automatically.

In other words, you can "git reset" the head point of the branch to any
point you want to, and the only problem is to pick what point to try next
(since you don't want to mark the current point good or bad). One thing to
do is perhaps to just do:

git bisect visualize

which just starts "gitk" with the proper arguments that you can see what
we're currently looking at bisecting. Then you can pick a new point to
select as the bisection point by hand, and then do

git reset --hard <sha-of-that-point>

by just selecting that commit in gitk and pasting the result into that
"git reset --hard xyz.." command line.

("git reset --hard ..." will reset the current branch to the selected
point and force a checkout of the new state while its at it. It's pretty
much equivalent to "git reset ..." followed by a "git checkout -f").

Of course, you can pick the bisection point with any other means too. So
if you just do "git log" and you know what commit broke the compile, just
pick the father by hand.

The only important point is that you should obviously pick something that
is within the current known good/bad range, and that's where the
aforementioned "git bisect visualize" can help.

Oh, and the "git bisect visualize" thing is fairly new: if you have an
older version of git that doesn't have that nice helper function, you can
always do it by hand with the following magic command line.

gitk bisect/bad --not $(cd .git/refs && echo bisect/good-*)

(you can see how "git bisect visualize" is a bit simpler to type and
remember ;)

Linus