2.6.27-rc6-git6: Reported regressions from 2.6.26

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:40:17 PM9/21/08

to

This message contains a list of some regressions from 2.6.26, for which there
are no fixes in the mainline I know of. If any of them have been fixed already,
please let me know.

If you know of any other unresolved regressions from 2.6.26, please let me know
either and I'll add them to the list. Also, please let me know if any of the
entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.

Listed regressions statistics:

Date Total Pending Unresolved
----------------------------------------
2008-09-21 169 45 36

Unresolved regressions
----------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11611
Subject : Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
Submitter : Rafael J. Wysocki <r...@sisk.pl>
Date : 2008-09-20 23:24 (2 days old)
References : http://marc.info/?l=linux-kernel&m=122195277606974&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11610
Subject : Problem with kernel commit 664d080c41463570b95717b5ad86e79dc1be0877
Submitter : Michal 'vorner' Vaner <vor...@ucw.cz>
Date : 2008-09-21 17:35 (1 days old)
References : http://marc.info/?l=linux-acpi&m=122201853409501&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11609
Subject : oops in find_get_page
Submitter : Marcin Slusarz <marcin....@gmail.com>
Date : 2008-09-20 14:53 (2 days old)
References : http://marc.info/?l=linux-kernel&m=122192251101892&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608
Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request
Submitter : John Daiker <daike...@gmail.com>
Date : 2008-09-16 23:00 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11607
Subject : 2.6.27-rc6 =C2=A0Bug in tty_chars_in_buffer
Submitter : John Daiker <daike...@gmail.com>
Date : 2008-09-15 2:26 (7 days old)
References : http://marc.info/?l=linux-kernel&m=122144565514490&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11590
Subject : Nokia 5310 Xpress usb-storage not mounting
Submitter : David Almaroad <dalm...@gmail.com>
Date : 2008-09-18 21:35 (4 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject : Don't complain about disabled irqs when the system has paniced
Submitter : Andi Kleen <an...@firstfloor.org>
Date : 2008-09-02 13:49 (20 days old)
References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11568
Subject : spontaneous reboot on resume with 2.6.27
Submitter : Andy Wettstein <ajw...@gmail.com>
Date : 2008-09-14 20:00 (8 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11551
Subject : Semi-repeatable hard lockup on 2.6.27-rc6
Submitter : Steven Noonan <ste...@uplinklabs.net>
Date : 2008-09-10 18:07 (12 days old)
References : http://marc.info/?l=linux-kernel&m=122107007407994&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11548
Subject : kernel BUG at drivers/pci/intel-iommu.c:1373!
Submitter : Chris Mason <chris...@oracle.com>
Date : 2008-09-08 14:26 (14 days old)
References : http://marc.info/?l=linux-kernel&m=122088566310440&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject : kernel panic: softlockup in tick_periodic() ???
Submitter : Joshua Hoblitt <j_ke...@hoblitt.com>
Date : 2008-09-11 16:46 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By : Thomas Gleixner <tg...@linutronix.de>
Cyrill Gorcunov <gorc...@gmail.com>
Ingo Molnar <mi...@elte.hu>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11516
Subject : severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27= -rc5
Submitter : Jason Vas Dias <jason.v...@gmail.com>
Date : 2008-09-07 13:59 (15 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11512
Subject : sort-of regression due to "kconfig: speed up all*config + randconfig"
Submitter : Alexey Dobriyan <adob...@gmail.com>
Date : 2008-09-05 22:50 (17 days old)
References : http://marc.info/?l=linux-kernel&m=122065498013858&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11506
Subject : oops during unmount - ext3? (2.6.27-rc5)
Submitter : Marcin Slusarz <marcin....@gmail.com>
Date : 2008-09-04 19:14 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122055573123449&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11504
Subject : reiserfs =C2=A0BUG in 2.6.27-rc5
Submitter : Randy Dunlap <randy....@oracle.com>
Date : 2008-09-03 16:35 (19 days old)
References : http://marc.info/?l=linux-kernel&m=122045982120138&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11501
Subject : Failed to open destination file: Permission deniedihex2fw
Submitter : Andrew Morton <ak...@linux-foundation.org>
Date : 2008-09-04 18:34 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122055342419068&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11476
Subject : failure to associate after resume from suspend to ram
Submitter : Michael S. Tsirkin <m.s.t...@gmail.com>
Date : 2008-09-01 13:33 (21 days old)
References : http://marc.info/?l=linux-kernel&m=122028529415108&w=4
Handled-By : Zhu Yi <yi....@intel.com>
Dan Williams <dc...@redhat.com>
Jouni Malinen <j...@w1.fi>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11465
Subject : Linux-2.6.27-rc5, drm errors in log
Submitter : Gene Heskett <gene.h...@verizon.net>
Date : 2008-08-30 18:52 (23 days old)
References : http://marc.info/?l=linux-kernel&m=122012238925775&w=4
Handled-By : Dave Airlie <air...@gmail.com>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11459
Subject : kernel crash after wifi connection established
Submitter : Alexey Kuznetsov <a...@axet.ru>
Date : 2008-08-30 03:08 (23 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407
Subject : suspend: unable to handle kernel paging request
Submitter : Vegard Nossum <vegard...@gmail.com>
Date : 2008-08-21 17:28 (32 days old)
References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4
Handled-By : Rafael J. Wysocki <r...@sisk.pl>
Pekka Enberg <pen...@cs.helsinki.fi>
Pavel Machek <pa...@suse.cz>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter : rdunlap <randy....@oracle.com>
Date : 2008-08-21 5:52 (32 days old)
References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4
http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By : Miller, Mike (OS Dev) <Mike....@hp.com>
James Bottomley <James.B...@hansenpartnership.com>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382
Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
Submitter : David Vrabel <david....@csr.com>
Date : 2008-08-08 10:47 (45 days old)
References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4
Handled-By : Christopher Li <chr...@vmware.com>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380
Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16
Submitter : Ingo Molnar <mi...@elte.hu>
Date : 2008-08-20 6:44 (33 days old)
References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11357
Subject : Can not boot up with zd1211rw USB-Wlan Stick
Submitter : uwe <ken...@freenet.de>
Date : 2008-08-16 14:17 (37 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11340
Subject : LTP overnight run resulted in unusable box
Submitter : Alexey Dobriyan <adob...@gmail.com>
Date : 2008-08-13 9:24 (40 days old)
References : http://marc.info/?l=linux-kernel&m=121861951902949&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11335
Subject : 2.6.27-rc2-git5 BUG: unable to handle kernel paging request
Submitter : Randy Dunlap <randy....@oracle.com>
Date : 2008-08-12 4:18 (41 days old)
References : http://marc.info/?l=linux-kernel&m=121851477201960&w=4
http://lkml.org/lkml/2008/8/16/274
Handled-By : Hugh Dickins <hu...@veritas.com>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28
Submitter : Christoph Lameter <c...@linux-foundation.org>
Date : 2008-08-11 18:36 (42 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11272
Subject : BUG: parport_serial in 2.6.27-rc1 for NetMos Technology PCI 9835
Submitter : Jaswinder Singh <jaswind...@gmail.com>
Date : 2008-08-05 15:12 (48 days old)
References : http://marc.info/?l=linux-kernel&m=121794900319776&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11271
Subject : BUG: fealnx in 2.6.27-rc1
Submitter : Jaswinder Singh <jaswind...@gmail.com>
Date : 2008-08-05 14:58 (48 days old)
References : http://marc.info/?l=linux-netdev&m=121794762016830&w=4
http://lkml.org/lkml/2008/8/10/98
Handled-By : Francois Romieu <rom...@fr.zoreil.com>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11264
Subject : Invalid op opcode in kernel/workqueue
Submitter : Jean-Luc Coulon <jean.lu...@gmail.com>
Date : 2008-08-07 04:18 (46 days old)

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11230
Subject : Kconfig no longer outputs a .config with freshly updated defconfigs
Submitter : Josh Boyer <jwb...@linux.vnet.ibm.com>
Date : 2008-08-02 16:03 (51 days old)
References : http://marc.info/?l=linux-kernel&m=121769306319391&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11224
Subject : Only three cores found on quad-core machine.
Submitter : Dave Jones <da...@redhat.com>
Date : 2008-08-01 18:15 (52 days old)
References : http://marc.info/?l=linux-kernel&m=121761475224719&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11220
Subject : Screen stays black after resume
Submitter : Nico Schottelius <ni...@schottelius.org>
Date : 2008-07-31 21:05 (53 days old)
References : http://marc.info/?l=linux-kernel&m=121753882422899&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject : INFO: possible recursive locking detected ps2_command
Submitter : Zdenek Kabelac <zdenek....@gmail.com>
Date : 2008-07-31 9:41 (53 days old)
References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4
Handled-By : Peter Zijlstra <a.p.zi...@chello.nl>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11210
Subject : libata badness
Submitter : Kumar Gala <ga...@kernel.crashing.org>
Date : 2008-07-31 18:53 (53 days old)
References : http://marc.info/?l=linux-ide&m=121753059307310&w=4
Handled-By : Kumar Gala <ga...@kernel.crashing.org>

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject : VolanoMark regression with 2.6.27-rc1
Submitter : Zhang, Yanmin <yanmin...@linux.intel.com>
Date : 2008-07-31 3:20 (53 days old)
References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By : Zhang, Yanmin <yanmin...@linux.intel.com>
Peter Zijlstra <a.p.zi...@chello.nl>
Dhaval Giani <dha...@linux.vnet.ibm.com>
Miao Xie <mi...@cn.fujitsu.com>

Regressions with patches
------------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11555
Subject : rmmod ide-cd_mod: tried to init an initialized =C2=A0object, something is s= eriously wrong.
Submitter : Mariusz Kozlowski <m.koz...@tuxland.pl>
Date : 2008-07-16 2:22 (68 days old)
References : http://marc.info/?l=linux-ide&m=122061839713526&w=4
Handled-By : Jens Axboe <jens....@oracle.com>
Patch : http://marc.info/?l=linux-kernel&m=122095622602315&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11552
Subject : Disabling IRQ #23
Submitter : Justin Mattock <justin...@gmail.com>
Date : 2008-09-09 19:08 (13 days old)
References : http://marc.info/?l=linux-kernel&m=122098735230906&w=4
http://marc.info/?l=linux-kernel&m=122107367715361&w=4
Handled-By : David Brownell <dav...@pacbell.net>
Alan Stern <st...@rowland.harvard.edu>
Patch : http://marc.info/?l=linux-kernel&m=122187222705195&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11550
Subject : pnp: Huge number of "io resource overlap" messages
Submitter : Frans Pop <ele...@planet.nl>
Date : 2008-09-09 10:50 (13 days old)
References : http://marc.info/?l=linux-kernel&m=122095745403793&w=4
Handled-By : Rene Herman <rene....@keyaccess.nl>
Bjorn Helgaas <bjorn....@hp.com>
Patch : http://marc.info/?l=linux-kernel&m=122098498125536&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11549
Subject : 2.6.27-rc5 acpi: EC Storm error message on bootup
Submitter : <jme...@wolfmountaingroup.com>
Date : 2008-09-02 21:27 (20 days old)
References : http://marc.info/?l=linux-kernel&m=122039255517586&w=4
Handled-By : Alexey Starikovskiy <astari...@suse.de>
Patch : http://marc.info/?l=linux-kernel&m=122098180019264&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11507
Subject : usb: sometimes dead keyboard after boot
Submitter : Frans Pop <ele...@planet.nl>
Date : 2008-08-26 21:03 (27 days old)
References : http://marc.info/?l=linux-kernel&m=121977815018224&w=2
Handled-By : Alan Stern <st...@rowland.harvard.edu>
Patch : http://www.spinics.net/lists/linux-usb/msg09735.html

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11505
Subject : oltp ~10% regression with 2.6.27-rc5 on stoakley machine
Submitter : Lin Ming <ming....@intel.com>
Date : 2008-09-04 7:06 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122051202202373&w=4
http://marc.info/?t=122089704700005&r=1&w=4
Handled-By : Peter Zijlstra <a.p.zi...@chello.nl>
Gregory Haskins <ghas...@novell.com>
Ingo Molnar <mi...@elte.hu>
Patch : http://marc.info/?l=linux-kernel&m=122194673932703&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11442
Subject : btusb hibernation/suspend breakage in current -git
Submitter : Rafael J. Wysocki <r...@sisk.pl>
Date : 2008-08-25 11:37 (28 days old)
References : http://marc.info/?l=linux-bluetooth&m=121966402012074&w=4
Handled-By : Oliver Neukum <oli...@neukum.org>
Patch : http://marc.info/?l=linux-bluetooth&m=121967226027323&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11439
Subject : [2.6.27-rc4-git4] compilation warnings
Submitter : Rufus & Azrael <rufus-...@numericable.fr>
Date : 2008-08-26 9:37 (27 days old)
References : http://marc.info/?l=linux-kernel&m=121974353815440&w=4
Handled-By : Greg KH <gre...@suse.de>
Patch : http://marc.info/?l=linux-kernel&m=121976424221858&w=4

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11237
Subject : corrupt PMD after resume
Submitter : Alan Jenkins <alan-j...@tuffmail.co.uk>
Date : 2008-08-02 9:51 (51 days old)
References : http://marc.info/?l=linux-kernel&m=121767073424952&w=4
Handled-By : Hugh Dickins <hu...@veritas.com>
Jeremy Fitzhardinge <jer...@goop.org>
Patch : http://marc.info/?l=linux-kernel&m=122001615314700&w=2

For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions from 2.6.26,
unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=11167

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:08 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380
Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16
Submitter : Ingo Molnar <mi...@elte.hu>
Date : 2008-08-20 6:44 (33 days old)
References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:10 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11516
Subject : severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27= -rc5
Submitter : Jason Vas Dias <jason.v...@gmail.com>
Date : 2008-09-07 13:59 (15 days old)

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:11 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11550
Subject : pnp: Huge number of "io resource overlap" messages
Submitter : Frans Pop <ele...@planet.nl>
Date : 2008-09-09 10:50 (13 days old)
References : http://marc.info/?l=linux-kernel&m=122095745403793&w=4
Handled-By : Rene Herman <rene....@keyaccess.nl>
Bjorn Helgaas <bjorn....@hp.com>
Patch : http://marc.info/?l=linux-kernel&m=122098498125536&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:08 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11590
Subject : Nokia 5310 Xpress usb-storage not mounting
Submitter : David Almaroad <dalm...@gmail.com>
Date : 2008-09-18 21:35 (4 days old)

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:10 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11549
Subject : 2.6.27-rc5 acpi: EC Storm error message on bootup
Submitter : <jme...@wolfmountaingroup.com>
Date : 2008-09-02 21:27 (20 days old)
References : http://marc.info/?l=linux-kernel&m=122039255517586&w=4
Handled-By : Alexey Starikovskiy <astari...@suse.de>
Patch : http://marc.info/?l=linux-kernel&m=122098180019264&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:06 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28
Submitter : Christoph Lameter <c...@linux-foundation.org>
Date : 2008-08-11 18:36 (42 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:11 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382
Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
Submitter : David Vrabel <david....@csr.com>
Date : 2008-08-08 10:47 (45 days old)
References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4
Handled-By : Christopher Li <chr...@vmware.com>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:16 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11548
Subject : kernel BUG at drivers/pci/intel-iommu.c:1373!
Submitter : Chris Mason <chris...@oracle.com>
Date : 2008-09-08 14:26 (14 days old)
References : http://marc.info/?l=linux-kernel&m=122088566310440&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:16 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11506
Subject : oops during unmount - ext3? (2.6.27-rc5)
Submitter : Marcin Slusarz <marcin....@gmail.com>
Date : 2008-09-04 19:14 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122055573123449&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:17 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11568
Subject : spontaneous reboot on resume with 2.6.27
Submitter : Andy Wettstein <ajw...@gmail.com>
Date : 2008-09-14 20:00 (8 days old)

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:14 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608
Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request
Submitter : John Daiker <daike...@gmail.com>
Date : 2008-09-16 23:00 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:15 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject : Don't complain about disabled irqs when the system has paniced
Submitter : Andi Kleen <an...@firstfloor.org>
Date : 2008-09-02 13:49 (20 days old)
References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:18 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter : rdunlap <randy....@oracle.com>
Date : 2008-08-21 5:52 (32 days old)
References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4
http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By : Miller, Mike (OS Dev) <Mike....@hp.com>
James Bottomley <James.B...@hansenpartnership.com>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:17 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11439
Subject : [2.6.27-rc4-git4] compilation warnings
Submitter : Rufus & Azrael <rufus-...@numericable.fr>
Date : 2008-08-26 9:37 (27 days old)
References : http://marc.info/?l=linux-kernel&m=121974353815440&w=4
Handled-By : Greg KH <gre...@suse.de>
Patch : http://marc.info/?l=linux-kernel&m=121976424221858&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:18 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11609
Subject : oops in find_get_page
Submitter : Marcin Slusarz <marcin....@gmail.com>
Date : 2008-09-20 14:53 (2 days old)
References : http://marc.info/?l=linux-kernel&m=122192251101892&w=4

Mariusz Kozlowski

unread,

Sep 21, 2008, 5:00:12 PM9/21/08

to

Hi,

> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).

It was present in 2.6.27-rc6. A day or two later I checked mainline and
it was gone. I will recheck this and notify if needed.

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11555
> Subject : rmmod ide-cd_mod: tried to init an initialized =C2=A0object, something is s= eriously wrong.
> Submitter : Mariusz Kozlowski <m.koz...@tuxland.pl>
> Date : 2008-07-16 2:22 (68 days old)
> References : http://marc.info/?l=linux-ide&m=122061839713526&w=4
> Handled-By : Jens Axboe <jens....@oracle.com>
> Patch : http://marc.info/?l=linux-kernel&m=122095622602315&w=4

Thanks,

Mariusz

jme...@wolfmountaingroup.com

unread,

Sep 21, 2008, 5:50:09 PM9/21/08

to

> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11549
> Subject : 2.6.27-rc5 acpi: EC Storm error message on bootup
> Submitter : <jme...@wolfmountaingroup.com>
> Date : 2008-09-02 21:27 (20 days old)
> References : http://marc.info/?l=linux-kernel&m=122039255517586&w=4
> Handled-By : Alexey Starikovskiy <astari...@suse.de>
> Patch : http://marc.info/?l=linux-kernel&m=122098180019264&w=4
>
>
>

This bug is corrected by Alexey's patch and has passed all regression tests.

Jeff

Alexey Starikovskiy

unread,

Sep 21, 2008, 6:00:24 PM9/21/08

to

Hi Rafael,
Correct patch is the one attached to bugzilla entry,
not the one you mention.

Regards,
Alex.

Rafael J. Wysocki wrote:
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11549
> Subject : 2.6.27-rc5 acpi: EC Storm error message on bootup
> Submitter : <jme...@wolfmountaingroup.com>
> Date : 2008-09-02 21:27 (20 days old)
> References : http://marc.info/?l=linux-kernel&m=122039255517586&w=4
> Handled-By : Alexey Starikovskiy <astari...@suse.de>
> Patch : http://marc.info/?l=linux-kernel&m=122098180019264&w=4
>

Michal 'vorner' Vaner

unread,

Sep 21, 2008, 7:20:18 PM9/21/08

to

Hello

On Sun, Sep 21, 2008 at 08:54:23PM +0200, Rafael J. Wysocki wrote:
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).

Yes, it still does this with newest kernel
(9824b8f11373b0df806c135a342da9319ef1d893). At last for me.

With regards

--
Please enter password:

Michal 'vorner' Vaner

Justin Mattock

unread,

Sep 21, 2008, 7:20:21 PM9/21/08

to

On Sun, Sep 21, 2008 at 11:54 AM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>

> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11552
> Subject : Disabling IRQ #23
> Submitter : Justin Mattock <justin...@gmail.com>
> Date : 2008-09-09 19:08 (13 days old)
> References : http://marc.info/?l=linux-kernel&m=122098735230906&w=4
> http://marc.info/?l=linux-kernel&m=122107367715361&w=4
> Handled-By : David Brownell <dav...@pacbell.net>
> Alan Stern <st...@rowland.harvard.edu>
> Patch : http://marc.info/?l=linux-kernel&m=122187222705195&w=4
>
>
>

not sure if it should be;
From over here, I did a bad install
of isight-firmware-tools, causing hal and udev
to clash. After making sure the package was either
using hal or udev, there is no message of disable irq #23.
If its not too much trouble is there a way to verify that this was
the case, i.g. if udev creates a dev, then hal creates the same device
will this cause ehci_hcd to have messages of this kind? If so
then thats what happened, if not then theres something else causing this.

--
Justin P. Mattock

David Miller

unread,

Sep 21, 2008, 8:00:09 PM9/21/08

to

From: "Rafael J. Wysocki" <r...@sisk.pl>
Date: Sun, 21 Sep 2008 20:54:13 +0200 (CEST)

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382
> Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
> Submitter : David Vrabel <david....@csr.com>
> Date : 2008-08-08 10:47 (45 days old)
> References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4
> Handled-By : Christopher Li <chr...@vmware.com>

Fixed by:

commit 78566fecbb12a7616ae9a88b2ffbc8062c4a89e3
Author: Christopher Li <chr...@vmware.com>
Date: Fri Sep 5 14:04:05 2008 -0700

e1000: prevent corruption of EEPROM/NVM

Andrey reports e1000 corruption, and that a patch in vmware's ESX fixed
it.

The EEPROM corruption is triggered by concurrent access of the EEPROM
read/write. Putting a lock around it solve the problem.

[ak...@linux-foundation.org: use DEFINE_SPINLOCK to avoid confusing lockdep]
Signed-off-by: Christopher Li <chr...@vmware.com>
Reported-by: Andrey Borzenkov <arvi...@mail.ru>
Cc: Zach Amsden <za...@vmware.com>
Cc: Pratap Subrahmanyam <pra...@vmware.com>
Cc: Jeff Kirsher <jeffrey....@intel.com>
Cc: Jesse Brandeburg <jesse.br...@intel.com>
Cc: Bruce Allan <bruce....@intel.com>
Cc: PJ Waskiewicz <peter.p.wa...@intel.com>
Cc: John Ronciak <john.r...@intel.com>
Cc: Jeff Garzik <je...@garzik.org>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
Signed-off-by: Jeff Garzik <jga...@redhat.com>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:09 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407
Subject : suspend: unable to handle kernel paging request
Submitter : Vegard Nossum <vegard...@gmail.com>
Date : 2008-08-21 17:28 (32 days old)
References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4
Handled-By : Rafael J. Wysocki <r...@sisk.pl>
Pekka Enberg <pen...@cs.helsinki.fi>
Pavel Machek <pa...@suse.cz>

Steven Noonan

unread,

Sep 21, 2008, 4:50:11 PM9/21/08

to

On Sun, Sep 21, 2008 at 11:54 AM, Rafael J. Wysocki <r...@sisk.pl> wrote:

> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11551
> Subject : Semi-repeatable hard lockup on 2.6.27-rc6
> Submitter : Steven Noonan <ste...@uplinklabs.net>
> Date : 2008-09-10 18:07 (12 days old)
> References : http://marc.info/?l=linux-kernel&m=122107007407994&w=4
>
>

The machine with these symptoms was sent in for service on Friday. I
suspect there may have been dodgy hardware involved on this one. I
think this bug should be closed for the time being. Once I get the
machine back, I'll reopen the bug if I can still reproduce it.

- Steven

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:08 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11459
Subject : kernel crash after wifi connection established
Submitter : Alexey Kuznetsov <a...@axet.ru>
Date : 2008-08-30 03:08 (23 days old)

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:12 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11505
Subject : oltp ~10% regression with 2.6.27-rc5 on stoakley machine
Submitter : Lin Ming <ming....@intel.com>
Date : 2008-09-04 7:06 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122051202202373&w=4
http://marc.info/?t=122089704700005&r=1&w=4
Handled-By : Peter Zijlstra <a.p.zi...@chello.nl>
Gregory Haskins <ghas...@novell.com>
Ingo Molnar <mi...@elte.hu>
Patch : http://marc.info/?l=linux-kernel&m=122194673932703&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:15 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11551
Subject : Semi-repeatable hard lockup on 2.6.27-rc6
Submitter : Steven Noonan <ste...@uplinklabs.net>
Date : 2008-09-10 18:07 (12 days old)
References : http://marc.info/?l=linux-kernel&m=122107007407994&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:16 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11610
Subject : Problem with kernel commit 664d080c41463570b95717b5ad86e79dc1be0877
Submitter : Michal 'vorner' Vaner <vor...@ucw.cz>
Date : 2008-09-21 17:35 (1 days old)
References : http://marc.info/?l=linux-acpi&m=122201853409501&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:13 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11501
Subject : Failed to open destination file: Permission deniedihex2fw
Submitter : Andrew Morton <ak...@linux-foundation.org>
Date : 2008-09-04 18:34 (18 days old)
References : http://marc.info/?l=linux-kernel&m=122055342419068&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:15 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11552
Subject : Disabling IRQ #23
Submitter : Justin Mattock <justin...@gmail.com>
Date : 2008-09-09 19:08 (13 days old)
References : http://marc.info/?l=linux-kernel&m=122098735230906&w=4
http://marc.info/?l=linux-kernel&m=122107367715361&w=4
Handled-By : David Brownell <dav...@pacbell.net>
Alan Stern <st...@rowland.harvard.edu>
Patch : http://marc.info/?l=linux-kernel&m=122187222705195&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:12 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11476
Subject : failure to associate after resume from suspend to ram
Submitter : Michael S. Tsirkin <m.s.t...@gmail.com>
Date : 2008-09-01 13:33 (21 days old)
References : http://marc.info/?l=linux-kernel&m=122028529415108&w=4
Handled-By : Zhu Yi <yi....@intel.com>
Dan Williams <dc...@redhat.com>
Jouni Malinen <j...@w1.fi>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:15 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11507
Subject : usb: sometimes dead keyboard after boot
Submitter : Frans Pop <ele...@planet.nl>
Date : 2008-08-26 21:03 (27 days old)
References : http://marc.info/?l=linux-kernel&m=121977815018224&w=2
Handled-By : Alan Stern <st...@rowland.harvard.edu>
Patch : http://www.spinics.net/lists/linux-usb/msg09735.html

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:14 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11442
Subject : btusb hibernation/suspend breakage in current -git
Submitter : Rafael J. Wysocki <r...@sisk.pl>
Date : 2008-08-25 11:37 (28 days old)
References : http://marc.info/?l=linux-bluetooth&m=121966402012074&w=4
Handled-By : Oliver Neukum <oli...@neukum.org>
Patch : http://marc.info/?l=linux-bluetooth&m=121967226027323&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:13 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11465
Subject : Linux-2.6.27-rc5, drm errors in log
Submitter : Gene Heskett <gene.h...@verizon.net>
Date : 2008-08-30 18:52 (23 days old)
References : http://marc.info/?l=linux-kernel&m=122012238925775&w=4
Handled-By : Dave Airlie <air...@gmail.com>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:17 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject : kernel panic: softlockup in tick_periodic() ???
Submitter : Joshua Hoblitt <j_ke...@hoblitt.com>
Date : 2008-09-11 16:46 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By : Thomas Gleixner <tg...@linutronix.de>
Cyrill Gorcunov <gorc...@gmail.com>
Ingo Molnar <mi...@elte.hu>

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:12 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11512
Subject : sort-of regression due to "kconfig: speed up all*config + randconfig"
Submitter : Alexey Dobriyan <adob...@gmail.com>
Date : 2008-09-05 22:50 (17 days old)
References : http://marc.info/?l=linux-kernel&m=122065498013858&w=4

Rafael J. Wysocki

unread,

Sep 21, 2008, 4:50:10 PM9/21/08

to

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26. Please verify if it still should be listed and let me know
(either way).

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11611
Subject : Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325

Submitter : Rafael J. Wysocki <r...@sisk.pl>

Date : 2008-09-20 23:24 (2 days old)
References : http://marc.info/?l=linux-kernel&m=122195277606974&w=4

Cyrill Gorcunov

unread,

Sep 22, 2008, 2:10:08 AM9/22/08

to

[Rafael J. Wysocki - Sun, Sep 21, 2008 at 08:54:19PM +0200]

| This message has been generated automatically as a part of a report
| of recent regressions.
|
| The following bug entry is on the current list of known regressions
| from 2.6.26. Please verify if it still should be listed and let me know
| (either way).
|
|
| Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
| Subject : kernel panic: softlockup in tick_periodic() ???
| Submitter : Joshua Hoblitt <j_ke...@hoblitt.com>
| Date : 2008-09-11 16:46 (11 days old)
| References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4
| Handled-By : Thomas Gleixner <tg...@linutronix.de>
| Cyrill Gorcunov <gorc...@gmail.com>
| Ingo Molnar <mi...@elte.hu>
|
|

There are really multiple issues touched in report. nmi_watchdog
hangs, rtc device creation, NULL deref...

I've asked Joshua for more information. Since he must to use
netdev tree for a while maybe we could wait 'till next merge
window will be closed and check if nmi_watchdog does work.
So the work in progress.

- Cyrill -

Dave Airlie

unread,

Sep 22, 2008, 3:00:17 AM9/22/08

to

On Mon, Sep 22, 2008 at 9:51 AM, David Miller <da...@davemloft.net> wrote:
> From: "Rafael J. Wysocki" <r...@sisk.pl>
> Date: Sun, 21 Sep 2008 20:54:13 +0200 (CEST)
>
>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382
>> Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
>> Submitter : David Vrabel <david....@csr.com>
>> Date : 2008-08-08 10:47 (45 days old)
>> References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4
>> Handled-By : Christopher Li <chr...@vmware.com>
>
> Fixed by:
>
> commit 78566fecbb12a7616ae9a88b2ffbc8062c4a89e3
> Author: Christopher Li <chr...@vmware.com>
> Date: Fri Sep 5 14:04:05 2008 -0700
>
> e1000: prevent corruption of EEPROM/NVM
>
> Andrey reports e1000 corruption, and that a patch in vmware's ESX fixed
> it.
>
> The EEPROM corruption is triggered by concurrent access of the EEPROM
> read/write. Putting a lock around it solve the problem.
>

Just noticed I replied to davem and not to everyone.. so I did some
further hunting.

Okay so e1000e seems to have a problem in this area, that this *DOESN'T* fix.

I've reconstructed my boot timeline from message logs

Sep 3rd, I booted rawhide kernel 2.6.27-0.290.rc5.fc10.i686

I suspended/resume a few times in between with no issues.

Sep 8th I booted my own 2.6.27-rc5 kernel based from
ec0c15afb41fd9ad45b53468b60db50170e22346

This got a corrupted e1000e checksum and every kernel since has.

Dave.

David Miller

unread,

Sep 22, 2008, 3:10:09 AM9/22/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Mon, 22 Sep 2008 16:59:51 +1000

> I've reconstructed my boot timeline from message logs
>
> Sep 3rd, I booted rawhide kernel 2.6.27-0.290.rc5.fc10.i686

What upstream SHA1 is this based upon?

> I suspended/resume a few times in between with no issues.
>
> Sep 8th I booted my own 2.6.27-rc5 kernel based from
> ec0c15afb41fd9ad45b53468b60db50170e22346
>
> This got a corrupted e1000e checksum and every kernel since has.

Ok.

Alan Stern

unread,

Sep 22, 2008, 7:00:21 AM9/22/08

to

On Sun, 21 Sep 2008, Justin Mattock wrote:

> From over here, I did a bad install
> of isight-firmware-tools, causing hal and udev
> to clash. After making sure the package was either
> using hal or udev, there is no message of disable irq #23.
> If its not too much trouble is there a way to verify that this was
> the case, i.g. if udev creates a dev, then hal creates the same device
> will this cause ehci_hcd to have messages of this kind? If so
> then thats what happened, if not then theres something else causing this.

You didn't read what I wrote earlier, did you? The "HC died" message
should NEVER occur! It doesn't matter what games you play with hal and
udev -- it should NEVER occur. Not ever.

And since the "HC died" is what causes IRQ #23 to be disabled, that
shouldn't happen either.

Alan Stern

Justin Mattock

unread,

Sep 22, 2008, 12:30:12 PM9/22/08

to

On Mon, Sep 22, 2008 at 3:53 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> On Sun, 21 Sep 2008, Justin Mattock wrote:
>
>> From over here, I did a bad install
>> of isight-firmware-tools, causing hal and udev
>> to clash. After making sure the package was either
>> using hal or udev, there is no message of disable irq #23.
>> If its not too much trouble is there a way to verify that this was
>> the case, i.g. if udev creates a dev, then hal creates the same device
>> will this cause ehci_hcd to have messages of this kind? If so
>> then thats what happened, if not then theres something else causing this.
>
> You didn't read what I wrote earlier, did you? The "HC died" message
> should NEVER occur! It doesn't matter what games you play with hal and
> udev -- it should NEVER occur. Not ever.
>
> And since the "HC died" is what causes IRQ #23 to be disabled, that
> shouldn't happen either.
>
> Alan Stern
>
>

appologize for not fully understanidng,
I'm just getting confused with why and what is causing this
to occur. The only reason for playing with hal and udev
is to have this message appear, if I leave them out of the picture
the system runs fine.
Anyways, I'm up to trying anything at this point, and again
appologize for causing any heat.

--
Justin P. Mattock

Jiri Kosina

unread,

Sep 22, 2008, 6:20:06 PM9/22/08

to

On Mon, 22 Sep 2008, Dave Airlie wrote:

> Sep 8th I booted my own 2.6.27-rc5 kernel based from
> ec0c15afb41fd9ad45b53468b60db50170e22346
> This got a corrupted e1000e checksum and every kernel since has.

Have you restored the EEPROM contents after it got corrupted for the first
time?

Once the EEPROM contents get corrupted, the card will then be broken
forever even on kernel that gets this fixed one day.

This is pretty serious bug in fact, as it renders hardware of poor users
unusable, and just patching kernel is then not enough to put things back
to shape.

--
Jiri Kosina
SUSE Labs

David Miller

unread,

Sep 22, 2008, 6:30:09 PM9/22/08

to

From: Jiri Kosina <jko...@suse.cz>
Date: Tue, 23 Sep 2008 00:15:08 +0200 (CEST)

> On Mon, 22 Sep 2008, Dave Airlie wrote:
>
> > Sep 8th I booted my own 2.6.27-rc5 kernel based from
> > ec0c15afb41fd9ad45b53468b60db50170e22346
> > This got a corrupted e1000e checksum and every kernel since has.
>
> Have you restored the EEPROM contents after it got corrupted for the first
> time?
>
> Once the EEPROM contents get corrupted, the card will then be broken
> forever even on kernel that gets this fixed one day.
>
> This is pretty serious bug in fact, as it renders hardware of poor users
> unusable, and just patching kernel is then not enough to put things back
> to shape.

The top priority is to root cause this, so that we can stop the
problem from happening as fast as possible, and I'm still waiting for
the SHA1 ID that was used for the last kernel Dave booted before the
problem occurred which is pretty damn critical for making forward
progress here.

It could even be some PCI or x86 layer change that caused the corruption,
we don't even know yet.

Dave Airlie

unread,

Sep 22, 2008, 9:30:11 PM9/22/08

to

On Tue, Sep 23, 2008 at 8:28 AM, David Miller <da...@davemloft.net> wrote:
> From: Jiri Kosina <jko...@suse.cz>
> Date: Tue, 23 Sep 2008 00:15:08 +0200 (CEST)
>
>> On Mon, 22 Sep 2008, Dave Airlie wrote:
>>
>> > Sep 8th I booted my own 2.6.27-rc5 kernel based from
>> > ec0c15afb41fd9ad45b53468b60db50170e22346
>> > This got a corrupted e1000e checksum and every kernel since has.
>>
>> Have you restored the EEPROM contents after it got corrupted for the first
>> time?
>>
>> Once the EEPROM contents get corrupted, the card will then be broken
>> forever even on kernel that gets this fixed one day.
>>
>> This is pretty serious bug in fact, as it renders hardware of poor users
>> unusable, and just patching kernel is then not enough to put things back
>> to shape.
>
> The top priority is to root cause this, so that we can stop the
> problem from happening as fast as possible, and I'm still waiting for
> the SHA1 ID that was used for the last kernel Dave booted before the
> problem occurred which is pretty damn critical for making forward
> progress here.

It was exactly 2.6.27-rc5 + Fedora at the time but we rarely touch
these areas, most of the extra code is in other places, and since
people are seeing it on !Fedora
also I would assume it wasn't these.

I think people have seen it on earlier kernels maybe but not sure.

really Intel needs to get a fix of some sort out so we can repair the
hw so we can root cause the probem.

Dave.

David Miller

unread,

Sep 22, 2008, 10:00:14 PM9/22/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Tue, 23 Sep 2008 11:26:52 +1000

> On Tue, Sep 23, 2008 at 8:28 AM, David Miller <da...@davemloft.net> wrote:
> > From: Jiri Kosina <jko...@suse.cz>
> > Date: Tue, 23 Sep 2008 00:15:08 +0200 (CEST)
> >
> >> On Mon, 22 Sep 2008, Dave Airlie wrote:
> >>
> >> > Sep 8th I booted my own 2.6.27-rc5 kernel based from
> >> > ec0c15afb41fd9ad45b53468b60db50170e22346
> >> > This got a corrupted e1000e checksum and every kernel since has.

...

> It was exactly 2.6.27-rc5 + Fedora at the time but we rarely touch
> these areas, most of the extra code is in other places, and since
> people are seeing it on !Fedora
> also I would assume it wasn't these.
>
> I think people have seen it on earlier kernels maybe but not sure.

So I went through the changes from 2.6.27-rc5 until the SHA1
ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
definitely no E1000 or E1000E changes during that time.

Included in there is the HPET revert and other similarly themed
changes.

commit b4609472116bb806a95e98d04767189406c74c70
Author: Linus Torvalds <torv...@linux-foundation.org>
Date: Fri Aug 29 14:38:03 2008 -0700

Revert "x86: fix HPET regression in 2.6.26 versus 2.6.25, check hpet against BAR, v3"

This reverts commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd.

Some power management related changes stand out slightly:

commit 9d3593574702ae1899e23a1535da1ac71f928042
Author: John Kacur <jka...@gmail.com>
Date: Tue Sep 2 14:36:13 2008 -0700

pm_qos_requirement might sleep

and

commit 74c4633da7994eddcfcd2762a448c6889cc2b5bd
Author: Rafael J. Wysocki <r...@sisk.pl>
Date: Tue Sep 2 14:36:11 2008 -0700

rtc-cmos: wake again from S5

The rest of the changes in that range look completely benign.

Andy Wettstein

unread,

Sep 22, 2008, 10:20:07 PM9/22/08

to

On Sun, Sep 21, 2008 at 08:54:21PM +0200, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11568
> Subject : spontaneous reboot on resume with 2.6.27
> Submitter : Andy Wettstein <ajw...@gmail.com>
> Date : 2008-09-14 20:00 (8 days old)

Just verified it is still a problem with 2.6.27-rc7.

Thomas Gleixner

unread,

Sep 23, 2008, 7:00:10 AM9/23/08

to

On Sun, 21 Sep 2008, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
> Subject : kernel panic: softlockup in tick_periodic() ???

The softlockup issue itself is fixed, but there are issues with
nmi_watchdog. I think we should remove the regression and keep the bug
alive to chase the other issues.

Thanks,

tglx

Rafael J. Wysocki

unread,

Sep 23, 2008, 10:00:20 AM9/23/08

to

On Tuesday, 23 of September 2008, Thomas Gleixner wrote:
> On Sun, 21 Sep 2008, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.26. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
> > Subject : kernel panic: softlockup in tick_periodic() ???
>
> The softlockup issue itself is fixed, but there are issues with
> nmi_watchdog. I think we should remove the regression and keep the bug
> alive to chase the other issues.

Well, for the sake of documentation I'd prefer to close this bug and create a
new non-regression one for the other issues if that's not a problem.

Thanks,
Rafael

Jiri Kosina

unread,

Sep 23, 2008, 10:30:09 AM9/23/08

to

On Mon, 22 Sep 2008, David Miller wrote:

> So I went through the changes from 2.6.27-rc5 until the SHA1
> ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
> definitely no E1000 or E1000E changes during that time.

Some recent comments on [1] seem to indicate that this is somehow coupled
into prior problems/panics with Intel graphics.

David, was this also your case, or did the EEPROM got garbled out of a
sudden?

[1] https://bugzilla.novell.com/show_bug.cgi?id=425480

--
Jiri Kosina
SUSE Labs

Renato S. Yamane

unread,

Sep 23, 2008, 12:50:07 PM9/23/08

to

Jiri Kosina wrote:
> Some recent comments on [1] seem to indicate that this is somehow coupled
> into prior problems/panics with Intel graphics.
>
> David, was this also your case, or did the EEPROM got garbled out of a
> sudden?
> [1] https://bugzilla.novell.com/show_bug.cgi?id=425480

And...
<http://lwn.net/Articles/299787>
<http://bugzilla.kernel.org/show_bug.cgi?id=11382>
<https://bugzilla.redhat.com/show_bug.cgi?id=459202>
<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555>

Best regards,
Renato

Dave Airlie

unread,

Sep 23, 2008, 5:10:09 PM9/23/08

to

On Wed, Sep 24, 2008 at 12:29 AM, Jiri Kosina <jko...@suse.cz> wrote:
> On Mon, 22 Sep 2008, David Miller wrote:
>
>> So I went through the changes from 2.6.27-rc5 until the SHA1
>> ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
>> definitely no E1000 or E1000E changes during that time.
>
> Some recent comments on [1] seem to indicate that this is somehow coupled
> into prior problems/panics with Intel graphics.
>
> David, was this also your case, or did the EEPROM got garbled out of a
> sudden?

I have no evidence in my logs of a graphics panic, but I do do a lot
of graphics devel,
so it might be a possiblity but I'd hate to handwave it away at that.

Dave.

David Miller

unread,

Sep 23, 2008, 5:10:10 PM9/23/08

to

From: Jiri Kosina <jko...@suse.cz>
Date: Tue, 23 Sep 2008 16:29:16 +0200 (CEST)

> On Mon, 22 Sep 2008, David Miller wrote:
>
> > So I went through the changes from 2.6.27-rc5 until the SHA1
> > ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
> > definitely no E1000 or E1000E changes during that time.
>
> Some recent comments on [1] seem to indicate that this is somehow coupled
> into prior problems/panics with Intel graphics.

My current suspicion in all of this is either the GEM kernel patches
or recent X server.

However, the eeprom/nvram programming sequence seems non-trivial on
the e1000e. You have to execute a set of precise register writes
and register polls to successfully write things out to the nvram.

This makes something like a random scribble out to MMIO space less
likely to cause this problem.

Is there some linear mapping of the nvram that could be written to
on these cards?

Dave Airlie

unread,

Sep 23, 2008, 5:10:13 PM9/23/08

to

On Wed, Sep 24, 2008 at 7:05 AM, David Miller <da...@davemloft.net> wrote:
> From: Jiri Kosina <jko...@suse.cz>
> Date: Tue, 23 Sep 2008 16:29:16 +0200 (CEST)
>
>> On Mon, 22 Sep 2008, David Miller wrote:
>>
>> > So I went through the changes from 2.6.27-rc5 until the SHA1
>> > ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
>> > definitely no E1000 or E1000E changes during that time.
>>
>> Some recent comments on [1] seem to indicate that this is somehow coupled
>> into prior problems/panics with Intel graphics.
>
> My current suspicion in all of this is either the GEM kernel patches
> or recent X server.
>

I don't think OpenSUSE was shipping any of the GEM bits.

Dave.

David Miller

unread,

Sep 23, 2008, 6:10:15 PM9/23/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Wed, 24 Sep 2008 07:09:09 +1000

> On Wed, Sep 24, 2008 at 7:05 AM, David Miller <da...@davemloft.net> wrote:
> > From: Jiri Kosina <jko...@suse.cz>
> > Date: Tue, 23 Sep 2008 16:29:16 +0200 (CEST)
> >
> >> On Mon, 22 Sep 2008, David Miller wrote:
> >>
> >> > So I went through the changes from 2.6.27-rc5 until the SHA1
> >> > ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
> >> > definitely no E1000 or E1000E changes during that time.
> >>
> >> Some recent comments on [1] seem to indicate that this is somehow coupled
> >> into prior problems/panics with Intel graphics.
> >
> > My current suspicion in all of this is either the GEM kernel patches
> > or recent X server.
> >
>
> I don't think OpenSUSE was shipping any of the GEM bits.

Good data point, can someone confirm this? Also, what X server version
is the effected OpenSUSE shipping?

David Miller

unread,

Sep 23, 2008, 6:10:16 PM9/23/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Wed, 24 Sep 2008 07:03:53 +1000

> I have no evidence in my logs of a graphics panic, but I do do a lot
> of graphics devel,
> so it might be a possiblity but I'd hate to handwave it away at that.

Let's not handwave, but rather try to figure out if that is part of
the pattern.

Right now we don't have any real leads, so data acquisition is really
important at this phase.

Jiri Kosina

unread,

Sep 23, 2008, 6:20:06 PM9/23/08

to

On Tue, 23 Sep 2008, Jeff Kirsher wrote:

> >> I don't think OpenSUSE was shipping any of the GEM bits.
> > Good data point, can someone confirm this? Also, what X server version
> > is the effected OpenSUSE shipping?

> OpenSuSE 11 ships x server version 7.3.

Opensuse 11 is fine.

The problem can be reproduced [not only] on opensuse 11.1 beta1, which has

xorg-x11-7.4-1.6.x86_64.rpm

--
Jiri Kosina

Jeff Kirsher

unread,

Sep 23, 2008, 6:20:08 PM9/23/08

to

On Tue, Sep 23, 2008 at 3:07 PM, David Miller <da...@davemloft.net> wrote:
> From: "Dave Airlie" <air...@gmail.com>
> Date: Wed, 24 Sep 2008 07:09:09 +1000
>
>> On Wed, Sep 24, 2008 at 7:05 AM, David Miller <da...@davemloft.net> wrote:
>> > From: Jiri Kosina <jko...@suse.cz>
>> > Date: Tue, 23 Sep 2008 16:29:16 +0200 (CEST)
>> >
>> >> On Mon, 22 Sep 2008, David Miller wrote:
>> >>
>> >> > So I went through the changes from 2.6.27-rc5 until the SHA1
>> >> > ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
>> >> > definitely no E1000 or E1000E changes during that time.
>> >>
>> >> Some recent comments on [1] seem to indicate that this is somehow coupled
>> >> into prior problems/panics with Intel graphics.
>> >
>> > My current suspicion in all of this is either the GEM kernel patches
>> > or recent X server.
>> >
>>
>> I don't think OpenSUSE was shipping any of the GEM bits.
>
> Good data point, can someone confirm this? Also, what X server version
> is the effected OpenSUSE shipping?
> --

OpenSuSE 11 ships x server version 7.3.

--
Cheers,
Jeff

Chris Mason

unread,

Sep 23, 2008, 9:20:07 PM9/23/08

to

On Sun, 2008-09-21 at 20:54 +0200, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>

I'm unable to reproduce this on 2.6.27-rc7. I don't think it has been
fixed, but I'm having a hard time finding a reliable way to trigger it
on newer kernels.

-chris

David Miller

unread,

Sep 24, 2008, 12:20:08 AM9/24/08

to

From: Jiri Kosina <jko...@suse.cz>
Date: Wed, 24 Sep 2008 00:19:00 +0200 (CEST)

> On Tue, 23 Sep 2008, Jeff Kirsher wrote:
>
> > >> I don't think OpenSUSE was shipping any of the GEM bits.
> > > Good data point, can someone confirm this? Also, what X server version
> > > is the effected OpenSUSE shipping?
> > OpenSuSE 11 ships x server version 7.3.
>
> Opensuse 11 is fine.
>
> The problem can be reproduced [not only] on opensuse 11.1 beta1, which has
>
> xorg-x11-7.4-1.6.x86_64.rpm

I did some snooping around, and while doing so I noticed that the PCI
mmap code for x86 doesn't do one bit of range checking on the size, or
any other aspect of the request, wrt. the MMIO regions actually mapped
in the BARs of the PCI device.

Yikes!

It just does a reserve_memtype() on the address range, and says "ok".

So if, for example, the X server tries to mmap() more than an MMIO bar
actually maps, the kernel lets the user do this.

It would be very interesting to add the appropriate checks to
pci_mmap_page_range() in arch/x86/pci/i386.c, anyone who wants to do
this can use the code in arch/sparc64/kernel/pci.c:
__pci_mmap_make_offset() as a guide, and see what happens.

If the MMIO space regions of the video cards sit right before the
E1000E ones on the effected systems, that would pretty much
convince me that this is the kind of problem we are having here.

This also reminds me that there was that whole set of issues that
had to get worked out wrt. write-caching of mappings on x86.

Dave Airlie

unread,

Sep 24, 2008, 1:50:07 AM9/24/08

to

I'm still dubious about this, wouldn't we see other wierdass side
effects if X was trashing the BARs on other devices?

I think tglx is on the right path, same problem as e1000, code is
stupid, it can reenter the nvram read/write code from irq
context, and pwn itself.

Dave.

David Newall

unread,

Sep 24, 2008, 2:10:09 AM9/24/08

to

David Miller wrote:
> Right now we don't have any real leads, so data acquisition is really
> important at this phase.

Isn't this reliably reproducible? Assuming yes, Intel are such swell
guys that you might ask them to ship a few dozen cards to you to break
until you've tracked down the problem. I mean, it's a lot easier to
find this sorts of fault when you can see it first hand than trying to
guess from third parties' reports, isn't it? For some reasonable value
of "you", that so.

David Miller

unread,

Sep 24, 2008, 3:40:10 AM9/24/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Wed, 24 Sep 2008 15:45:46 +1000

> I'm still dubious about this, wouldn't we see other wierdass side
> effects if X was trashing the BARs on other devices?

Sure. My theory is that it's a recent xorg change causing this,
so I've been going through GIT history for xserver, libpciaccess,
and the intel driver for the past year looking for clues.

If there is usually a gap after the video device, there would just
be no response from the PCI bus, and the way that's handled is
chipset specific. At least a while back, most x86 systems would
silently ignore writes and return all 1's in such a case, but
they may be generating bus error events these days. I simply don't
know.

> I think tglx is on the right path, same problem as e1000, code is
> stupid, it can reenter the nvram read/write code from irq
> context, and pwn itself.

The e1000e side here is reproducable way too easily for it to be the
same case, as far as I see it.

The e1000 driver has probably had this problem for years and we've
only recently had some concrete cases of it triggering.

Also, what utility are you running on your system that is even
accessing the NVRAM on the e1000e card? Knowing that might help
us understand why this problem has appeared now. Maybe there is
some diagnostic or monitoring tool that is now becoming prevalent
in these distributions where it triggers.

This problem started happening seemingly "all of a sudden", even to
people who have been keeping sort-of recent with their kernels, such
as yourself.

Yet we can't get any sense yet what range of kernel versions are in
use when the problem triggers.

I'm about to leave for a week or so in Paris for the netfilter
workshop, so I hope that someone other than myself will do some data
mining like I have instead of (merely) tossing theories around and
finger pointing.

Dave Airlie

unread,

Sep 24, 2008, 5:00:17 AM9/24/08

to

On Wed, Sep 24, 2008 at 5:36 PM, David Miller <da...@davemloft.net> wrote:
> From: "Dave Airlie" <air...@gmail.com>
> Date: Wed, 24 Sep 2008 15:45:46 +1000
>
>> I'm still dubious about this, wouldn't we see other wierdass side
>> effects if X was trashing the BARs on other devices?
>
> Sure. My theory is that it's a recent xorg change causing this,
> so I've been going through GIT history for xserver, libpciaccess,
> and the intel driver for the past year looking for clues.
>
> If there is usually a gap after the video device, there would just
> be no response from the PCI bus, and the way that's handled is
> chipset specific. At least a while back, most x86 systems would
> silently ignore writes and return all 1's in such a case, but
> they may be generating bus error events these days. I simply don't
> know.

The only thing I can think off then is either the pciaccess conversion
of the intel Xorg driver,
or maybe something going wrong since PAT support was added.

>
>> I think tglx is on the right path, same problem as e1000, code is
>> stupid, it can reenter the nvram read/write code from irq
>> context, and pwn itself.
>
> The e1000e side here is reproducable way too easily for it to be the
> same case, as far as I see it.
>
> The e1000 driver has probably had this problem for years and we've
> only recently had some concrete cases of it triggering.
>
> Also, what utility are you running on your system that is even
> accessing the NVRAM on the e1000e card? Knowing that might help
> us understand why this problem has appeared now. Maybe there is
> some diagnostic or monitoring tool that is now becoming prevalent
> in these distributions where it triggers.

The driver seems quite happy to access the NVRAM, I think Thomas has
some backtraces that show
it clearly doing silly reentrant things...

>
> This problem started happening seemingly "all of a sudden", even to
> people who have been keeping sort-of recent with their kernels, such
> as yourself.
>
> Yet we can't get any sense yet what range of kernel versions are in
> use when the problem triggers.

I've seen it reported at least at 2.6.27-rc1 and maybe even one of
Fedora's -rc0 kernels.

Dave.

David Miller

unread,

Sep 24, 2008, 5:10:07 AM9/24/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Wed, 24 Sep 2008 18:59:34 +1000

> On Wed, Sep 24, 2008 at 5:36 PM, David Miller <da...@davemloft.net> wrote:
> The driver seems quite happy to access the NVRAM, I think Thomas has
> some backtraces that show
> it clearly doing silly reentrant things...

I don't dispute that the locking is dodgy and likely needs to be fixed
like e1000.

I'm asking what userland tool or kernel event is triggering the nvram
access.

It shouldn't even touch the thing after probing and initializing
the card.

Dave Airlie

unread,

Sep 24, 2008, 5:20:05 AM9/24/08

to

On Wed, Sep 24, 2008 at 7:01 PM, David Miller <da...@davemloft.net> wrote:
> From: "Dave Airlie" <air...@gmail.com>
> Date: Wed, 24 Sep 2008 18:59:34 +1000
>
>> On Wed, Sep 24, 2008 at 5:36 PM, David Miller <da...@davemloft.net> wrote:
>> The driver seems quite happy to access the NVRAM, I think Thomas has
>> some backtraces that show
>> it clearly doing silly reentrant things...
>
> I don't dispute that the locking is dodgy and likely needs to be fixed
> like e1000.
>
> I'm asking what userland tool or kernel event is triggering the nvram
> access.
>
> It shouldn't even touch the thing after probing and initializing
> the card.

Hopefully tglx can supply some traces, I think getting an interrupt
during device startup
can possibly access the nvram

http://www.tglx.de/~tglx/wtf2.txt

seems to suggest bad things could happen.

Dave.

Jonathan Corbet

unread,

Sep 24, 2008, 12:30:13 PM9/24/08

to

On Wed, 24 Sep 2008 00:36:38 -0700 (PDT)
David Miller <da...@davemloft.net> wrote:

> I'm about to leave for a week or so in Paris for the netfilter
> workshop, so I hope that someone other than myself will do some data
> mining like I have instead of (merely) tossing theories around and
> finger pointing.

A data point, just in case it helps... I've not had time to update my
desktop system, so this all-Intel, ICH9, e1000e-based box has been stuck
at 2.6.27-rc3. It has rawhide as of shortly after the floodgates
reopened (but with my own kernel); that means X server 1.5.0 and i810
2.4.2-3.

It's happy as a clam. I'm not sure how often this problem bites, but it
hasn't gotten me.

jon

Jiri Kosina

unread,

Sep 24, 2008, 12:40:09 PM9/24/08

to

On Wed, 24 Sep 2008, Dave Airlie wrote:

> Hopefully tglx can supply some traces, I think getting an interrupt
> during device startup can possibly access the nvram
> http://www.tglx.de/~tglx/wtf2.txt
> seems to suggest bad things could happen.

Actually another user has just reported [1] that his e1000e card got
screwed up exactly at the point when the installer was probing the X
configuration. So this really seems a lot like some lethal interaction
between intel graphics and the network card.

Dave (Airlie, too many Daves on CC here really), do you by any chance see
any recent change in kernel intel graphic parts of DRM be causing this
breakage?

[1] https://bugzilla.novell.com/show_bug.cgi?id=425480#c69

--
Jiri Kosina
SUSE Labs

Jiri Kosina

unread,

Sep 24, 2008, 12:40:19 PM9/24/08

to

On Wed, 24 Sep 2008, Jiri Kosina wrote:

> Dave (Airlie, too many Daves on CC here really), do you by any chance
> see any recent change in kernel intel graphic parts of DRM be causing
> this breakage?

BTW, why is the PAT fix implented in commit 242e3df80 needed only for
radeons?

Jiri Kosina

unread,

Sep 24, 2008, 1:00:20 PM9/24/08

to

On Wed, 24 Sep 2008, Jonathan Corbet wrote:

> A data point, just in case it helps... I've not had time to update my
> desktop system, so this all-Intel, ICH9, e1000e-based box has been stuck
> at 2.6.27-rc3. It has rawhide as of shortly after the floodgates
> reopened (but with my own kernel); that means X server 1.5.0 and i810
> 2.4.2-3.
> It's happy as a clam. I'm not sure how often this problem bites, but it
> hasn't gotten me.

Thanks for the information.

Seems like it quite often triggers during the very first probing of
graphics card during the initial X startup. Karsten is currently writing a
tool that will safely restore the EEPROM contents to the card. When this
gets done, testing will get much easier and hopefully we'll be able to
isolate whether it is e1000e driver (I currently don't think so), DRM
kernel code, or xorg 7.4 causing this.

Thanks,

--
Jiri Kosina
SUSE Labs

Jiri Kosina

unread,

Sep 24, 2008, 2:10:09 PM9/24/08

to

On Wed, 24 Sep 2008, Jiri Kosina wrote:

> > Dave (Airlie, too many Daves on CC here really), do you by any chance
> > see any recent change in kernel intel graphic parts of DRM be causing
> > this breakage?
> BTW, why is the PAT fix implented in commit 242e3df80 needed only for
> radeons?

Further important observation -- as far as I can see, all affected
machines by this bug whatsoever (and the number of reportes is increasing)
were using i915 DRM.

Rafael J. Wysocki

unread,

Sep 24, 2008, 2:20:06 PM9/24/08

to

On Wednesday, 24 of September 2008, Chris Mason wrote:
> On Sun, 2008-09-21 at 20:54 +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.26. Please verify if it still should be listed and let me know
> > (either way).
> >
>
> I'm unable to reproduce this on 2.6.27-rc7. I don't think it has been
> fixed, but I'm having a hard time finding a reliable way to trigger it
> on newer kernels.

Thanks for the update.

For now, I'll close it as 'unreproducible'. Please reopen if it happens again.

Thanks,
Rafael

Kyle McMartin

unread,

Sep 24, 2008, 3:20:17 PM9/24/08

to

On Wed, Sep 24, 2008 at 12:36:38AM -0700, David Miller wrote:
> The e1000e side here is reproducable way too easily for it to be the
> same case, as far as I see it.
>

I've been working on a patch to detect (using a timer and checking at
up/down) whether or not the flash has been corrupted, and, if it is
rewrite it with the saved good copy (which obviously only helps if
it's the same boot.)

Unfortunately, I don't have enough time to finish it before I go away
for the weekend, so I'll toss it over the wall and see if it sticks to
anything.

At a glance, one would need to add support for rewriting
adapter->hw.flash from ethtool if someone reprograms the good firmware
back, and writing the good flash back on down/remove if it detects
a change.

Bear in mind, super quick hack, and I haven't even run-tested it yet.

If nobody decides to run with it, I'll probably give it another poke
late tonight.

Definitely-not-signed-off-by-or-tested-by: Kyle

At the very least, if someone pokes in a hexdump of the firmware, at
least we might be able to see some of the method to the madness of the
corruption pattern.

diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
index ac4e506..08cce8c 100644
--- a/drivers/net/e1000e/e1000.h
+++ b/drivers/net/e1000e/e1000.h
@@ -168,6 +168,7 @@ struct e1000_adapter {
struct timer_list watchdog_timer;
struct timer_list phy_info_timer;
struct timer_list blink_timer;
+ struct timer_list flash_timer;

struct work_struct reset_task;
struct work_struct watchdog_task;
diff --git a/drivers/net/e1000e/hw.h b/drivers/net/e1000e/hw.h
index 74f263a..ca3f645 100644
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@ -863,6 +863,11 @@ struct e1000_hw {

u8 __iomem *hw_addr;
u8 __iomem *flash_address;
+ int flash_len;
+
+ u8 *flash;
+ u8 *flash_backup;
+ spinlock_t flashlock;

struct e1000_mac_info mac;
struct e1000_fc_info fc;
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index d266510..13f05f8 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2535,6 +2535,7 @@ void e1000e_down(struct e1000_adapter *adapter)

del_timer_sync(&adapter->watchdog_timer);
del_timer_sync(&adapter->phy_info_timer);
+ del_timer_sync(&adapter->flash_timer);

netdev->tx_queue_len = adapter->tx_queue_len;
netif_carrier_off(netdev);
@@ -2922,6 +2923,33 @@ static void e1000_update_phy_info(unsigned long data)
e1000_get_phy_info(&adapter->hw);
}

+static inline int e1000_test_flash(struct e1000_adapter *adapter)
+{
+ int ret = 0;
+
+ if (adapter->hw.flash && adapter->hw.flash_backup) {
+ spin_lock(&adapter->hw.flashlock);
+ memcpy(adapter->hw.flash_backup, adapter->hw.flash_address,
+ adapter->hw.flash_len);
+ ret = memcmp(adapter->hw.flash, adapter->hw.flash_backup,
+ adapter->hw.flash_len);
+ spin_unlock(&adapter->hw.flashlock);
+ if (ret) {
+ /* dump_eeprom(adapter); */
+ printk(KERN_ERR "AWOOOGA AWOOOGA flash changed\n");
+ }
+ }
+
+ return ret;
+}
+
+static void e1000_flash_test(unsigned long data)
+{
+ struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+ e1000_test_flash(adapter);
+ mod_timer(&adapter->flash_timer, jiffies+(20*HZ));
+}
+
/**
* e1000e_update_stats - Update the board statistics counters
* @adapter: board private structure
@@ -4439,6 +4467,22 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
adapter->hw.flash_address = ioremap(flash_start, flash_len);
if (!adapter->hw.flash_address)
goto err_flashmap;
+
+ adapter->hw.flash_len = (int)flash_len;
+ /* stash away a copy of the flash, and allocate
+ space for a second copy... */
+ if (!adapter->hw.flash) {
+ u8 *flash = kmalloc(flash_len, GFP_KERNEL);
+ u8 *flash_backup = kmalloc(flash_len, GFP_KERNEL);
+ if (flash && flash_backup) {
+ memcpy(flash, adapter->hw.flash_address,
+ adapter->hw.flash_len);
+ adapter->hw.flash = flash;
+ adapter->hw.flash_backup = flash_backup;
+ spin_lock_init(&adapter->hw.flashlock);
+ }
+ }
+
}

/* construct the net_device struct */
@@ -4570,6 +4614,10 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
adapter->phy_info_timer.function = &e1000_update_phy_info;
adapter->phy_info_timer.data = (unsigned long) adapter;

+ init_timer(&adapter->flash_timer);
+ adapter->flash_timer.function = &e1000_flash_test;
+ adapter->flash_timer.data = (unsigned long) adapter;
+
INIT_WORK(&adapter->reset_task, e1000_reset_task);
INIT_WORK(&adapter->watchdog_task, e1000_watchdog_task);

@@ -4641,6 +4689,9 @@ static int __devinit e1000_probe(struct pci_dev *pdev,

e1000_print_device_info(adapter);

+ /* every twenty seconds, test the flash */
+ mod_timer(&adapter->flash_timer, jiffies+(HZ*20));
+
return 0;

err_register:
@@ -4690,6 +4741,7 @@ static void __devexit e1000_remove(struct pci_dev *pdev)
set_bit(__E1000_DOWN, &adapter->state);
del_timer_sync(&adapter->watchdog_timer);
del_timer_sync(&adapter->phy_info_timer);
+ del_timer_sync(&adapter->flash_timer);

flush_scheduled_work();

Jesse Brandeburg

unread,

Sep 24, 2008, 3:30:27 PM9/24/08

to

On Wed, Sep 24, 2008 at 12:10 PM, Kyle McMartin <ky...@mcmartin.ca> wrote:
> At the very least, if someone pokes in a hexdump of the firmware, at
> least we might be able to see some of the method to the madness of the
> corruption pattern.

Thanks Kyle!

attached is a patch to dump the eeprom to dmesg (first 64 bytes) at
boot for e1000e, which kind of goes along with your AWOOGA part of
your patch.

e1000e-dump-eeprom-to-dmesg.txt

David Miller

unread,

Sep 24, 2008, 4:00:17 PM9/24/08

to

From: Kyle McMartin <ky...@mcmartin.ca>
Date: Wed, 24 Sep 2008 15:10:22 -0400

> I've been working on a patch to detect (using a timer and checking at
> up/down) whether or not the flash has been corrupted, and, if it is
> rewrite it with the saved good copy (which obviously only helps if
> it's the same boot.)

Looks interesting, I hope someone runs with it :-)

If the flash is seen as corrupt, we should print the current process
that is running at the time, and perhaps a pt_regs dump, as these
might provide the most important clues to diagnosing this.

Dave Airlie

unread,

Sep 24, 2008, 4:10:09 PM9/24/08

to

On Thu, Sep 25, 2008 at 2:33 AM, Jiri Kosina <jko...@suse.cz> wrote:
> On Wed, 24 Sep 2008, Dave Airlie wrote:
>
>> Hopefully tglx can supply some traces, I think getting an interrupt
>> during device startup can possibly access the nvram
>> http://www.tglx.de/~tglx/wtf2.txt
>> seems to suggest bad things could happen.
>
> Actually another user has just reported [1] that his e1000e card got
> screwed up exactly at the point when the installer was probing the X
> configuration. So this really seems a lot like some lethal interaction
> between intel graphics and the network card.
>
> Dave (Airlie, too many Daves on CC here really), do you by any chance see
> any recent change in kernel intel graphic parts of DRM be causing this
> breakage?
>

Okay some from the kernel if this isn't in 2.6.26, the drm has
introduced no patches
I can even remotely claim might affect this. So its either userspace
or PAT related.

Dave.

Dave Airlie

unread,

Sep 24, 2008, 4:20:08 PM9/24/08

to

On Thu, Sep 25, 2008 at 2:37 AM, Jiri Kosina <jko...@suse.cz> wrote:
> On Wed, 24 Sep 2008, Jiri Kosina wrote:
>
>> Dave (Airlie, too many Daves on CC here really), do you by any chance
>> see any recent change in kernel intel graphic parts of DRM be causing
>> this breakage?
>
> BTW, why is the PAT fix implented in commit 242e3df80 needed only for
> radeons?
>

Good question, mainly because only radeons showed the illegal mapping crash,
which was mapping via sysfs _wc files and then doing a UC mapping in
the kernel over the
same address space would fail. However this was VRAM related and these
things don't have VRAM.

Dave.

Theodore Tso

unread,

Sep 24, 2008, 4:50:12 PM9/24/08

to

On Wed, Sep 24, 2008 at 10:27:30AM -0600, Jonathan Corbet wrote:
> On Wed, 24 Sep 2008 00:36:38 -0700 (PDT)
> David Miller <da...@davemloft.net> wrote:
>
> > I'm about to leave for a week or so in Paris for the netfilter
> > workshop, so I hope that someone other than myself will do some data
> > mining like I have instead of (merely) tossing theories around and
> > finger pointing.
>
> A data point, just in case it helps... I've not had time to update my
> desktop system, so this all-Intel, ICH9, e1000e-based box has been stuck
> at 2.6.27-rc3. It has rawhide as of shortly after the floodgates
> reopened (but with my own kernel); that means X server 1.5.0 and i810
> 2.4.2-3.

I'm running a 2.6.26-rc6 kernel on a X61s laptop, which is an
all-Intel ICH8, using the e1000e driver, and I haven't been been
bitten with the problem either. I'm using an Ubuntu Hardy userspace,
which means I'm using an 1.4.0.90 X Server with an i915 drm version
1.6.0 20060119, and my e1000 EEPROM hasn't been blasted to oblivion
yet!

Personally, I don't plan on upgrading to a newer userspace until we
figure out what the heck is going on. :-)

- Ted

Jiri Kosina

unread,

Sep 24, 2008, 6:40:07 PM9/24/08

to

On Wed, 24 Sep 2008, Kyle McMartin wrote:

> I've been working on a patch to detect (using a timer and checking at
> up/down) whether or not the flash has been corrupted, and, if it is
> rewrite it with the saved good copy (which obviously only helps if
> it's the same boot.)

Thanks, looks interesting e1000e hack that might possibly be of some help.

BUT! please have a look at

http://lkml.org/lkml/2008/9/24/133

Looks like this device got a lot of 0xff written somewhere in its config
space, right? But it isn't Intel card at all.

--
Jiri Kosina
SUSE Labs

Parag Warudkar

unread,

Sep 24, 2008, 7:00:21 PM9/24/08

to

On Wed, Sep 24, 2008 at 12:33 PM, Jiri Kosina <jko...@suse.cz> wrote:

> Actually another user has just reported [1] that his e1000e card got
> screwed up exactly at the point when the installer was probing the X
> configuration. So this really seems a lot like some lethal interaction
> between intel graphics and the network card.
>

Another data point in the support of this theory - I've been running
all various 2.6.27-rc releases (including rc7) on my HP machine which
has an embedded 82566 and Radeon x1650 graphics - and so far I have
not seen any problems.

Parag

Jiri Kosina

unread,

Sep 24, 2008, 7:20:10 PM9/24/08

to

On Tue, 23 Sep 2008, David Miller wrote:

> I did some snooping around, and while doing so I noticed that the PCI
> mmap code for x86 doesn't do one bit of range checking on the size, or
> any other aspect of the request, wrt. the MMIO regions actually mapped
> in the BARs of the PCI device.

Ugh, indeed. Added Ingo and Jesse to CC.

> Yikes!
>
> It just does a reserve_memtype() on the address range, and says "ok".
>
> So if, for example, the X server tries to mmap() more than an MMIO bar
> actually maps, the kernel lets the user do this.
>
> It would be very interesting to add the appropriate checks to
> pci_mmap_page_range() in arch/x86/pci/i386.c, anyone who wants to do
> this can use the code in arch/sparc64/kernel/pci.c:
> __pci_mmap_make_offset() as a guide, and see what happens.

Absolutely. Or we can even do some dirty hackery in userspace, like
LD_PRELOADing X server and checking mmaps() that are close to MMIO regions
of affected devices.

> If the MMIO space regions of the video cards sit right before the
> E1000E ones on the effected systems, that would pretty much
> convince me that this is the kind of problem we are having here.

Unfortunately, looking at the lspci outputs that are in
https://bugzilla.novell.com/show_bug.cgi?id=425480 it seems to me that the
MMIO regions are quite far away from each other.

--
Jiri Kosina
SUSE Labs

Jesse Barnes

unread,

Sep 24, 2008, 8:30:07 PM9/24/08

to

On Wednesday, September 24, 2008 4:15 pm Jiri Kosina wrote:
> On Tue, 23 Sep 2008, David Miller wrote:
> > I did some snooping around, and while doing so I noticed that the PCI
> > mmap code for x86 doesn't do one bit of range checking on the size, or
> > any other aspect of the request, wrt. the MMIO regions actually mapped
> > in the BARs of the PCI device.
>
> Ugh, indeed. Added Ingo and Jesse to CC.
>
> > Yikes!
> >
> > It just does a reserve_memtype() on the address range, and says "ok".
> >
> > So if, for example, the X server tries to mmap() more than an MMIO bar
> > actually maps, the kernel lets the user do this.
> >
> > It would be very interesting to add the appropriate checks to
> > pci_mmap_page_range() in arch/x86/pci/i386.c, anyone who wants to do
> > this can use the code in arch/sparc64/kernel/pci.c:
> > __pci_mmap_make_offset() as a guide, and see what happens.
>
> Absolutely. Or we can even do some dirty hackery in userspace, like
> LD_PRELOADing X server and checking mmaps() that are close to MMIO regions
> of affected devices.
>
> > If the MMIO space regions of the video cards sit right before the
> > E1000E ones on the effected systems, that would pretty much
> > convince me that this is the kind of problem we are having here.
>
> Unfortunately, looking at the lspci outputs that are in
> https://bugzilla.novell.com/show_bug.cgi?id=425480 it seems to me that the
> MMIO regions are quite far away from each other.

Moreover, we don't actually do any writing (that I know of) of the ROM image
from the X drivers or the kernel. In fact, in many cases X should be
accessing the RAM copy of the ROM at 0xc0000 rather than via the ROM BAR.

That said, adding a check to the x86 code would be a good thing to do; I'll
hack up a patch tomorrow unless someone beats me to it.

--
Jesse Barnes, Intel Open Source Technology Center

Dave Airlie

unread,

Sep 24, 2008, 8:30:09 PM9/24/08

to

On Thu, Sep 25, 2008 at 9:15 AM, Jiri Kosina <jko...@suse.cz> wrote:
> On Tue, 23 Sep 2008, David Miller wrote:
>
>> I did some snooping around, and while doing so I noticed that the PCI
>> mmap code for x86 doesn't do one bit of range checking on the size, or
>> any other aspect of the request, wrt. the MMIO regions actually mapped
>> in the BARs of the PCI device.
>
> Ugh, indeed. Added Ingo and Jesse to CC.
>
>> Yikes!
>>
>> It just does a reserve_memtype() on the address range, and says "ok".
>>
>> So if, for example, the X server tries to mmap() more than an MMIO bar
>> actually maps, the kernel lets the user do this.
>>
>> It would be very interesting to add the appropriate checks to
>> pci_mmap_page_range() in arch/x86/pci/i386.c, anyone who wants to do
>> this can use the code in arch/sparc64/kernel/pci.c:
>> __pci_mmap_make_offset() as a guide, and see what happens.
>
> Absolutely. Or we can even do some dirty hackery in userspace, like
> LD_PRELOADing X server and checking mmaps() that are close to MMIO regions
> of affected devices.
>
>> If the MMIO space regions of the video cards sit right before the
>> E1000E ones on the effected systems, that would pretty much
>> convince me that this is the kind of problem we are having here.
>
> Unfortunately, looking at the lspci outputs that are in
> https://bugzilla.novell.com/show_bug.cgi?id=425480 it seems to me that the
> MMIO regions are quite far away from each other.
>

Yup on my laptop these were far away and I wondered what could mangle
things that badly.

Well I'm out of the race, my attempts to re-write my eeprom using an
eeprom from an equivalent laptop
have totally failed and my BIOS won't boot anymore - so my laptop is == a brick.

Dave.

Jiri Kosina

unread,

Sep 24, 2008, 8:40:07 PM9/24/08

to

On Wed, 24 Sep 2008, Jesse Barnes wrote:

> That said, adding a check to the x86 code would be a good thing to do;
> I'll hack up a patch tomorrow unless someone beats me to it.

The problem here is that what we desperately need first is a method to
restore the original EEPROM contents after it gets corrupted (David Airlie
has, sadly, apparently bricked his notebook while trying to do so).
Without this, we can put a lot of debugging/protecting patches into the
kernel, but we won't be able to succesfully verify anything, because
testing wouldn't be possible.

Added Jesse and Karsten to CC, as they are working on such a tool right
now, as far as I know.

--
Jiri Kosina
SUSE Labs

Chuck Ebbert

unread,

Sep 24, 2008, 8:50:08 PM9/24/08

to

On Sun, 21 Sep 2008 20:54:23 +0200 (CEST)
"Rafael J. Wysocki" <r...@sisk.pl> wrote:

> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.26. Please verify if it still should be listed and let me know
> (either way).
>
>

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608
> Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request
> Submitter : John Daiker <daike...@gmail.com>
> Date : 2008-09-16 23:00 (6 days old)
> References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4
>
>

As I said in the bugzilla entry:

Oops: 000b

Bit 3 is set -- the processor detected 1's in reserved bits of the page directory.

That can't be good...

Jiri Kosina

unread,

Sep 24, 2008, 9:30:12 PM9/24/08

to

On Thu, 25 Sep 2008, Dave Airlie wrote:

> Well I'm out of the race, my attempts to re-write my eeprom using an
> eeprom from an equivalent laptop have totally failed and my BIOS won't
> boot anymore - so my laptop is == a brick.

Uh oh. Shouldn't we put something like the patch below in Linus' tree
unless we get this sorted out? Otherwise more and more people who use -rc
kernels will run into this, and will get their hardware [hopefully
temporarily, but not all users are able to re-flash their network card
EEPROMs, right] bricked.

I know that it is quite aggressive and is going to disable wired
networking on a lot of systems that have been functioning properly,
therefore RFC ...

From: Jiri Kosina <jko...@suse.cz>
Subject: [PATCH] [RFC] E1000E: temporarily disable e1000e driver

E1000E: temporarily disable e1000e driver

There is a serious bug somewhere, that renders e1000e network cards
unusable on certain hardware configurations by rewriting EEPROM with 0xff
all over. Debugging this is not trivial, because:

- it is not yet even clear whether the bug is caused by userspace (new
version of xorg drivers, bad interaction with PAT, ...) or some bug in
kernel code; it's even not yet certain at which exact combination of
software versions and hardware configuration this started to trigger
- you have only one attempt to test potential fix. If the fix doesn't
work, the eeprom of the card is hosed

and therefore fixing this has potential to take some time.

The tool that will safely restore the previous contents of EEPROM is
currently being written, but even this is not trivial (Dave Airlie has
turned his notebook into brick while trying to restore the EEPROM
contents).

Let's therefore mark this driver as broken (though it is very well
possible that this particular driver is not at fault at all) until this
gets resolved, so that users of -rc kernels don't get their network cards
totally unusable.

References (information about sw/hw configurations of affected systems
might be found in the bugzillas):

http://lkml.org/lkml/2008/8/8/123
http://lkml.org/lkml/2008/9/22/23

http://bugzilla.kernel.org/show_bug.cgi?id=11382

https://bugzilla.novell.com/show_bug.cgi?id=425480
https://bugzilla.redhat.com/show_bug.cgi?id=459202
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555
https://qa.mandriva.com/show_bug.cgi?id=44147

Signed-off-by: Jiri Kosina <jko...@suse.cz>

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 4a11296..2d7a7f2 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1938,7 +1938,7 @@ config E1000_DISABLE_PACKET_SPLIT

config E1000E
tristate "Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support"
- depends on PCI && (!SPARC32 || BROKEN)
+ depends on PCI && BROKEN
---help---
This driver supports the PCI-Express Intel(R) PRO/1000 gigabit
ethernet family of adapters. For PCI or PCI-X e1000 adapters,

Frans Pop

unread,

Sep 24, 2008, 10:10:13 PM9/24/08

to

> Uh oh. Shouldn't we put something like the patch below in Linus' tree
> unless we get this sorted out? Otherwise more and more people who
> use -rc kernels will run into this, and will get their hardware
> [hopefully temporarily, but not all users are able to re-flash their
> network card EEPROMs, right] bricked.

Something else to worry about is bisections. People seeing an unrelated
issue with .27 after release may well be asked to do a bisection and
could then run into the issue even if it is fixed before the release.

Guess we'll need to wait and see what the root cause is to know if that's
a real concern or not.

> - it is not yet even clear whether the bug is caused by userspace (new
> version of xorg drivers, bad interaction with PAT, ...) or some bug in
> kernel code; it's even not yet certain at which exact combination of
> software versions and hardware configuration this started to trigger

Extra datapoint. As far as I've seen this problem has not yet been
reported by any people running Debian. This could point to X.Org as
Debian currently has 7.3 while I think the reports so far have been with
7.4.

I have been running .27-rc kernels myself on a HP 2510p laptop running
Debian/lenny which does have the "bad" NIC (ICH9), but it's still working
for me. I do have some vague resume from suspend problems, but for now
I'm assuming those are unrelated.
I have been running the kernels both with and without PAT enabled.

Cheers,
FJP

Jeff Garzik

unread,

Sep 24, 2008, 10:30:14 PM9/24/08

to

Jiri Kosina wrote:
> On Thu, 25 Sep 2008, Dave Airlie wrote:
>
>> Well I'm out of the race, my attempts to re-write my eeprom using an
>> eeprom from an equivalent laptop have totally failed and my BIOS won't
>> boot anymore - so my laptop is == a brick.
>
> Uh oh. Shouldn't we put something like the patch below in Linus' tree
> unless we get this sorted out? Otherwise more and more people who use -rc
> kernels will run into this, and will get their hardware [hopefully
> temporarily, but not all users are able to re-flash their network card
> EEPROMs, right] bricked.
>
> I know that it is quite aggressive and is going to disable wired
> networking on a lot of systems that have been functioning properly,
> therefore RFC ...
>
>
>
> From: Jiri Kosina <jko...@suse.cz>
> Subject: [PATCH] [RFC] E1000E: temporarily disable e1000e driver
>
> E1000E: temporarily disable e1000e driver

That seems a bit drastic, particularly when the debugging was beginning
to point to another culprit.

We have equal case at this point to disable r8169 and i915_drm, no?

Jeff

Nick Piggin

unread,

Sep 24, 2008, 11:10:10 PM9/24/08

to

On Wed, Sep 24, 2008 at 08:46:55PM -0400, Chuck Ebbert wrote:
> On Sun, 21 Sep 2008 20:54:23 +0200 (CEST)
> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
>
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.26. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608
> > Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request
> > Submitter : John Daiker <daike...@gmail.com>
> > Date : 2008-09-16 23:00 (6 days old)
> > References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4
> >
> >
>
> As I said in the bugzilla entry:
>
> Oops: 000b
>
> Bit 3 is set -- the processor detected 1's in reserved bits of the page directory.
>
> That can't be good...

54384.988151] BUG: unable to handle kernel paging request at ffff8800601dd000
[54384.992095] IP: [<ffffffff80375457>] clear_page_c+0x7/0x10
[54384.992095] PGD 202063 PUD 8067 PMD 65d54163 PTE 80002020601dd163
[54384.992095] Oops: 000b [1] SMP DEBUG_PAGEALLOC

I initially suspect PAT (maybe via DEBUG_PAGEALLOC)... but let's see if the
3rd line here is useful.

xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...RR.actuwp
PGD: 001000000010000001100011

xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...RR.actuwp
PUD: 1000000001100111

xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...Rs.actuwp
PMD: 01100101110101010100000101100011

xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...gP.actuwp
PTE: 1000000000000000001000000010000001100000000111011101000101100011
3210987654321098765432109876543210987654321098765432109876543210

Is this a 36-bit physical address CPU? In which case you have 2 bits in
the pte that are outside "maxphys". Or if it is a 40-bit CPU, then you
have just 1 bit outside maxphys, in which case I'd say it is memory
corruption (maybe a hardware bug, maybe a scribble from elsewhere). So
I'm wrong about PAT.

Interestingly, the PMD also has a 1 set in a reserved bit (page global),
but according to the Intel docs, the CPU doesn't check that bit, so it
is not faulting there.

Does the machine survive memtest? Is the bug reproduceable? If the
answer is no to either of these, I think we can take it off the
regression list. Otherwise, is it possible to track down to a specific
commit?

Thanks,
Nick

Dave Airlie

unread,

Sep 25, 2008, 12:00:17 AM9/25/08

to

On Thu, Sep 25, 2008 at 12:28 PM, Jeff Garzik <je...@garzik.org> wrote:
> Jiri Kosina wrote:
>>
>> On Thu, 25 Sep 2008, Dave Airlie wrote:
>>
>>> Well I'm out of the race, my attempts to re-write my eeprom using an
>>> eeprom from an equivalent laptop have totally failed and my BIOS won't boot
>>> anymore - so my laptop is == a brick.
>>
>> Uh oh. Shouldn't we put something like the patch below in Linus' tree
>> unless we get this sorted out? Otherwise more and more people who use -rc
>> kernels will run into this, and will get their hardware [hopefully
>> temporarily, but not all users are able to re-flash their network card
>> EEPROMs, right] bricked.
>>
>> I know that it is quite aggressive and is going to disable wired
>> networking on a lot of systems that have been functioning properly,
>> therefore RFC ...
>>
>>
>>
>> From: Jiri Kosina <jko...@suse.cz>
>> Subject: [PATCH] [RFC] E1000E: temporarily disable e1000e driver
>>
>> E1000E: temporarily disable e1000e driver
>
> That seems a bit drastic, particularly when the debugging was beginning to
> point to another culprit.
>
> We have equal case at this point to disable r8169 and i915_drm, no?
>

No we actually are more likely unable to do anything from the kernel,
if its happening from userspace

firstly we need a reflash utility that is safe, otherwise people who
have the issue can't reproduce it,
and people who don't have the issue don't want to play with it.

I think e1000e may enable a BAR or something that causes the issue to
break this hw., I haven't seen it broken on any
machine where e1000e wasn't loaded yet. Again the r8169 might be the
same issue, but it maybe because the bar was enabled.

Dave.

David Miller

unread,

Sep 25, 2008, 12:10:06 AM9/25/08

to

From: "Dave Airlie" <air...@gmail.com>
Date: Thu, 25 Sep 2008 13:51:23 +1000

> I think e1000e may enable a BAR or something that causes the issue to
> break this hw., I haven't seen it broken on any
> machine where e1000e wasn't loaded yet. Again the r8169 might be the
> same issue, but it maybe because the bar was enabled.

All PCI device drivers in the kernel first do pci_enable_device()
which essentially enables all BARs.

The flash lives in BAR 1 of the E1000E, for example.

Jesse Brandeburg

unread,

Sep 25, 2008, 12:30:12 AM9/25/08

to

Dave Airlie wrote:
>>> If the MMIO space regions of the video cards sit right before the
>>> E1000E ones on the effected systems, that would pretty much
>>> convince me that this is the kind of problem we are having here.
>> Unfortunately, looking at the lspci outputs that are in
>> https://bugzilla.novell.com/show_bug.cgi?id=425480 it seems to me that the
>> MMIO regions are quite far away from each other.

on my ich9 based system the e1000e BAR1 regions are back to back with
both the vga memory map and the audio mem, either of which could be the
mangler, but more likely vga device (say X maybe) since it is mapped
directly in front of the e1000e BAR1 space.

> Yup on my laptop these were far away and I wondered what could mangle
> things that badly.
>
> Well I'm out of the race, my attempts to re-write my eeprom using an
> eeprom from an equivalent laptop
> have totally failed and my BIOS won't boot anymore - so my laptop is == a brick.

I'm really sorry to hear that, I wonder if the laptop has an "emergency
bios update" mode like many PCs used to through a jumper. Dave A., let
us know if you make any recovery progress.

I plan to try some random writes tomorrow to my BAR1 space and see if my
flash gets erased.

Jesse

Jiri Kosina

unread,

Sep 25, 2008, 8:30:17 AM9/25/08

to

On Wed, 24 Sep 2008, Dave Airlie wrote:

> > My current suspicion in all of this is either the GEM kernel patches
> > or recent X server.
> I don't think OpenSUSE was shipping any of the GEM bits.

Actually there is no way of not shipping GEM when shipping xorg 7.4, isn't
it?

So definitely GEM could be potential cause here, I think.

--
Jiri Kosina
SUSE Labs

Jesse Barnes

unread,

Sep 25, 2008, 12:10:14 PM9/25/08

to

On Wednesday, September 24, 2008 5:33 pm Jiri Kosina wrote:
> On Wed, 24 Sep 2008, Jesse Barnes wrote:
> > That said, adding a check to the x86 code would be a good thing to do;
> > I'll hack up a patch tomorrow unless someone beats me to it.
>
> The problem here is that what we desperately need first is a method to
> restore the original EEPROM contents after it gets corrupted (David Airlie
> has, sadly, apparently bricked his notebook while trying to do so).
> Without this, we can put a lot of debugging/protecting patches into the
> kernel, but we won't be able to succesfully verify anything, because
> testing wouldn't be possible.
>
> Added Jesse and Karsten to CC, as they are working on such a tool right
> now, as far as I know.

I should be able to test the mmap fix independently of the e1000 breakage at
least... lemme try it out now...

--
Jesse Barnes, Intel Open Source Technology Center

Krzysztof Halasa

unread,

Sep 25, 2008, 12:30:23 PM9/25/08

to

Jesse Brandeburg <jesse.br...@gmail.com> writes:

> I'm really sorry to hear that, I wonder if the laptop has an
> "emergency bios update" mode like many PCs used to through a jumper.
> Dave A., let us know if you make any recovery progress.

I guess it's more about the E1000's serial configuration EEPROM, the
registers seem to live in BAR0 (EECD and for reading perhaps EERD).
Corrupted EEPROM (and thus PCI config registers) can easily result in
a dead machine.

I will be writing a tool for writing 82541PI EEPROMs on a custom
board soon (unless there is one available, for Linux, of course),
I only have to fight non-working JTAG first :-)

> I plan to try some random writes tomorrow to my BAR1 space and see if
> my flash gets erased.

I'm not sure it's the flash that is corrupted. Anyway booting the
laptop should be quite easy (physically disabling the EEPROM on boot
should do the trick), though it would require taking the machine
apart.
--
Krzysztof Halasa

Jiri Kosina

unread,

Sep 25, 2008, 1:30:11 PM9/25/08

to

On Thu, 25 Sep 2008, Frans Pop wrote:

> Extra datapoint. As far as I've seen this problem has not yet been
> reported by any people running Debian. This could point to X.Org as
> Debian currently has 7.3 while I think the reports so far have been with
> 7.4.

Yes, I think that xorg/xorg i915 driver/libdrm/GEM/whatever are the
biggest suspect currently, according to the data that has been gathered so
far.

Still, what confuses me a little bit -- the EEPROM of the card is set to
all 0xff, once the corruption happens. Isn't that a quite a coincidence,
that bytes representing "nothing" in this context are used?

If being set to 0 (it's so easy to call memset(0) on a bogus pointer,
there are usually lots of them in the code) or to random garbage, it would
seem to be much more understandable, than 0xff.

--
Jiri Kosina
SUSE Labs

--

Jiri Kosina

unread,

Sep 25, 2008, 2:10:14 PM9/25/08

to

On Wed, 24 Sep 2008, Jonathan Corbet wrote:

> A data point, just in case it helps... I've not had time to update my
> desktop system, so this all-Intel, ICH9, e1000e-based box has been stuck
> at 2.6.27-rc3. It has rawhide as of shortly after the floodgates
> reopened (but with my own kernel); that means X server 1.5.0 and i810
> 2.4.2-3.

If any of you guys has Lenovo thinkpad (T60p ideally) with 8086:104b
revision 3 card, could you please send me the respective "ethtool -e"
dump?

Thanks,

H. Peter Anvin

unread,

Sep 25, 2008, 2:50:14 PM9/25/08

to

Jiri Kosina wrote:
>
> Yes, I think that xorg/xorg i915 driver/libdrm/GEM/whatever are the
> biggest suspect currently, according to the data that has been gathered so
> far.
>
> Still, what confuses me a little bit -- the EEPROM of the card is set to
> all 0xff, once the corruption happens. Isn't that a quite a coincidence,
> that bytes representing "nothing" in this context are used?
>

Typical card EEPROMs are serial - either I2C or SPI. I believe the
Intel cards use SPI EEPROMs, but I'm not sure.

[Disclaimer: I don't actually know SPI all that well; I know I2C better.
However, I'm pretty sure the following argument does apply to both.]

Consider a corruption which turns a read command into a write command --
often just a single bit difference. Now, the EEPROM will expect data in
to write, but nothing will be driving the data line, so it will
typically be a 1. As the host tries to read, it will therefore fill the
EEPROM with all ones.

-hpa

H. Peter Anvin

unread,

Sep 25, 2008, 2:50:35 PM9/25/08

to

Jiri Kosina wrote:
> On Wed, 24 Sep 2008, Kyle McMartin wrote:
>
>> I've been working on a patch to detect (using a timer and checking at
>> up/down) whether or not the flash has been corrupted, and, if it is
>> rewrite it with the saved good copy (which obviously only helps if
>> it's the same boot.)
>
> Thanks, looks interesting e1000e hack that might possibly be of some help.
>
> BUT! please have a look at
>
> http://lkml.org/lkml/2008/9/24/133
>
> Looks like this device got a lot of 0xff written somewhere in its config
> space, right? But it isn't Intel card at all.
>

That looks like the device disappeared completely.

-hpa

Jesse Barnes

unread,

Sep 25, 2008, 3:00:33 PM9/25/08

to

On Thursday, September 25, 2008 10:24 am Jiri Kosina wrote:
> On Thu, 25 Sep 2008, Frans Pop wrote:
> > Extra datapoint. As far as I've seen this problem has not yet been
> > reported by any people running Debian. This could point to X.Org as
> > Debian currently has 7.3 while I think the reports so far have been with
> > 7.4.
>
> Yes, I think that xorg/xorg i915 driver/libdrm/GEM/whatever are the
> biggest suspect currently, according to the data that has been gathered so
> far.

We have confirmation that this isn't GEM related; according to the Novell bug
at https://bugzilla.novell.com/show_bug.cgi?id=425480 people have hit the
problem with kernels w/o GEM.

That doesn't rule out i915 (though I don't think any changes have gone in
since 2.6.26 that would have caused this) or xf86-video-intel. It's possible
that X is getting confused about BAR mappings somehow, resulting in a
clobbered e1000e NVRAM, but why would the kernel version matter in that case?
The only thing that comes to mind would be PAT...

Recent versions of the X drivers (using recent libpciaccess code) will try to
map the resourceN_wc file in sysfs. It's possible that the map size we end
up using is wrong, leading to the situation Dave described earlier where we
map too much MMIO space.

> Still, what confuses me a little bit -- the EEPROM of the card is set to
> all 0xff, once the corruption happens. Isn't that a quite a coincidence,
> that bytes representing "nothing" in this context are used?

Presumably one has to write all ones to the EEPROM BAR of the e1000 device to
see that pattern? Or is there some way of configuring the EEPROM such that
it'll fail to respond to read cycles resulting in all ones for every read
back (i.e. target abort)?

Jesse