Linux 2.6.34-rc3

Linus Torvalds

unread,

Mar 30, 2010, 2:00:03 PM3/30/10

to

Ok, so -rc2 was messy, no question about it. I'm too much of a softie to
hold back some peoples work, so my hard-line -rc1 didn't work out the way
I wanted. But _next_ time! For sure this time.

Anyway, from a messy -rc2 we now have a -rc3 that should be in much better
shape. Regressions fixed, and the ShortLog is short enough to be worth
posting to lkml (-rc1 never is, and -rc2 seldom is. It's not like -rc2's
are generally wondeful, this time around wasn't _that_ much different).

One regression fix that is worth pointing out is the EXT3_STATE_NEW
handling in ext3, because the regression that one fixed was potentially
quite nasty. It wouldn't cause data corruption, but it _could_ cause
at least corrupt security labels.

So if you have SELinux enabled (either in permissive or enforcing mode)
_and_ you ran 2.6.32-rc[12] _and_ your filesystem is ext3, you should not
just update, you should make sure your extended attributes are fixed. The
easiest way to fix them is likely to just check the "Relabel on next boot"
checkmark in the SELinux config GUI ("system-config-selinux" if you don't
do that whole admin menu thing), and reboot into 2.6.34-rc3.

[ And you can use something like 'restorecon -rv /' or whatever after
booting into the fixed kernel. See your nearest SELinux manual for real
details. ]

You might even want to do the whole "touch /forcefsck" before rebooting to
make sure fsck runs (I don't think it matters, but it won't hurt - the
relabeling will be so slow that whatever time your fsck takes is totally
irrelevant, so do them both and get it over with).

Of course, I suspect most people who run experimental kernels are also the
kinds of people who have turned off SELinux in annoyance long ago, or tend
to be the kinds of people who long since upgraded to ext4 (which didn't
have the problem), but hey, what do I know.

In short - if you started seeing odd security messages after running early
2.6.34 -rc kernels, now you know what was going on.

Other than that? Random fixes and updates all over. Mostly drivers and
filesystems, and mostly fairly small things. If you had PCI resource
conflict problems with the early -rc's due to the _CRS window thing, for
example, that should hopefully be fixed. See the appended shortlog for
other details.

Linus

---
Abraham Arce (1):
KS8851: Avoid NULL pointer in set rx mode

Adel Gadllah (1):
iwlwifi: Silence tfds_in_queue message

Adrian Hunter (1):
mmc: fix incorrect interpretation of card type bits

Al Viro (1):
Restore LOOKUP_DIRECTORY hint handling in final lookup on open()

Alexander Duyck (3):
igb: only use vlan_gro_receive if vlans are registered
skbuff: remove unused dma_head & dma_maps fields
igb: use correct bits to identify if managability is enabled

Alexandra Kossovsky (1):
tcp: Fix OOB POLLIN avoidance.

Amerigo Wang (1):
netpoll: warn when there are spaces in parameters

Ameya Palande (1):
regulator: Get rid of lockdep warning

Amit Kumar Salecha (4):
netxen: fix bios version calculation
netxen: fix warning in ioaddr for NX3031 chip
netxen: added sanity check for pci map
netxen: update version to 4.0.73

Amit Shah (2):
virtio: console: Generate a kobject CHANGE event on adding 'name' attribute
virtio: console: Check if port is valid in resize_console

Andreas Bombe (1):
sh64: Remove long unused mid_sched macro

Andreas Herrmann (1):
x86, amd: Restrict usage of c1e_idle()

Andrei Emeltchenko (1):
Bluetooth: Fix kernel crash on L2CAP stress tests

Andrew Morton (2):
timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
kernel/sched.c: Suppress unused var warning

Andy Gospodarek (1):
bonding: fix broken multicast with round-robin mode

Anton Blanchard (1):
ppc64 sys_ipc breakage in 2.6.34-rc2

Arnaldo Carvalho de Melo (2):
perf top: Improve the autosizing of column lenghts
perf top: Add missing initialization to zero

Axel Lin (2):
lp3971: Fix setting val for LDO2 and LDO4
lp3971: Fix BUCK_VOL_CHANGE_SHIFT logic

Ben Blum (1):
cgroups: net_cls as module

Ben Menchaca (1):
gianfar: fix undo of reserve()

Benjamin Li (1):
bnx2: Fix netpoll crash.

Bjorn Helgaas (11):
resources: add interfaces that return conflict information
PCI: for address space collisions, show conflicting resource
PCI: break out primary/secondary/subordinate for readability
PCI: make disabled window printk style match the enabled ones
PCI: print resources consistently with %pR
PCI: complain about devices that seem to be broken
PCI: don't say we claimed a resource if we failed
x86/PCI: remove redundant warnings
frv/PCI: remove redundant warnings
x86/PCI: for host bridge address space collisions, show conflicting resource
x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1

Borislav Petkov (2):
edac, mce: Filter out invalid values
fs/binfmt_aout.c: fix pointer warnings

Brandon L Black (1):
net: Add MSG_WAITFORONE flag to recvmmsg

Carolyn Wyborny (1):
igb: Add support for 82576 ET2 Quad Port Server Adapter

Cheng Renquan (1):
ceph: some documentations fixes

Chris Leech (1):
ixgbe: filter FIP frames into the FCoE offload queues

Chris Wilson (1):
drm/i915: Avoid NULL deref in get_pages() unwind after error.

Christian Borntraeger (1):
[S390] system.h: Fix compile error for 1 and 2 byte cmpxchg

Christian Lamparter (2):
[ARM] Kirkwood: WPS button keycode mapping
[ARM] Orion5x: replace KEY_WLAN with KEY_WPS_BUTTON

Clemens Ladisch (4):
firewire: core: fw_iso_resource_manage: fix error handling
firewire: ohci: add cycle timer quirk for the TI TSB12LV22
ALSA: cmipci: work around invalid PCM pointer
PCI quirk: RS780/RS880: work around missing MSI initialization

Colin Ian King (1):
softlockup: Stop spurious softlockup messages due to overflow

Crane Cai (1):
i2c-scmi: Support IBM SMBus CMI devices

Daisuke Nishimura (1):
memcg: disable move charge in no mmu case

Dan Carpenter (11):
drm/i915: fix small leak on overlay error path
sunrpc: handle allocation errors from __rpc_lookup_create()
pxa168fb: fix incorrect resource calculation
AFS: Potential null dereference
regulator: handle kcalloc() failure
ceph: handle kmalloc() failure
af_key: return error if pfkey_xfrm_policy2msg_prep() fails
memcontrol: fix potential null deref
kcore: fix test for end of list
fscache: add missing unlock
hwmon: (w83793) Saving negative errors in unsigned

Daniel Chen (1):
ALSA: ac97: Add Toshiba P500 to ac97 jack sense blacklist

Daniel Mack (2):
ASoC: pxa-pcm-lib: initialize DMA channel to -1
[ARM] pxa/raumfeld: fix button name

Daniel T Chen (3):
ALSA: hda: Fix 0 dB offset for HP laptops using CX20551 (Waikiki)
ALSA: ac97: Add IBM ThinkPad R40e to Headphone/Line Jack Sense blacklist
ALSA: hda: Use LPIB for ga-ma770-ud3 board

Daniel Taylor (1):
fs/partitions/msdos: add support for large disks

Daniel Vetter (1):
drm/intel: fix up set_tiling for untiled->tiled transition

Darrick J. Wong (2):
acpi: Support IBM SMBus CMI devices
i2c-scmi: Provide module aliases for automatic loading

Dave Airlie (1):
slow-work: use get_ref wrapper instead of directly calling get_ref

David Howells (9):
nommu: fix an incorrect comment in the do_mmap_shared_file()
Document Linux's circular buffering capabilities
FDPIC: For-loop in elf_core_vma_data_size() is incorrect
do_sync_read/write() should set kiocb.ki_nbytes to be consistent
NOMMU: Revert 'nommu: get_user_pages(): pin last page on non-page-aligned start'
NOMMU: Fix __get_user_pages() to pin last page on offset buffers
SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG
frv/chris: fix lines with a missing semicolons
KEYS: Add MAINTAINERS record

David Härdeman (1):
kfifo: fix KFIFO_INIT in include/linux/kfifo.h

David S. Miller (7):
via-velocity: Fix FLOW_CNTL_TX_RX handling in set_mii_flow_control()
isdn: Add netdev to lists in MAINTAINERS entry.
Revert "r8169: enable 64-bit DMA by default for PCI Express devices (v2)"
Revert "via82cxxx: workaround h/w bugs"
tulip: Add missing parens.
Revert "ide: skip probe if there are no devices on the port (v2)"
sparc64: Properly truncate pt_regs framepointer in perf callback.

Dean Nelson (4):
PCI: fix return value from pcix_get_max_mmrbc()
PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
PCI: cleanup error return for pcix get and set mmrbc functions
hwmon: (coretemp) Add missing newline to dev_warn() message

Derek Kelly (1):
ALSA: hda - Add support of Nvidia GT220 HDMI

Dmitry Torokhov (1):
Regulators: max8925-regulator - clean up driver data after removal

Dominik Brodowski (4):
pcmcia: do not use ioports < 0x100 on x86
pcmcia: allow for four multifunction subdevices (again)
power: support _noirq actions on device types and classes
pcmcia: use dev_pm_ops for class pcmcia_socket_class

Emil Tantilov (4):
igb: do not modify tx_queue_len on link speed change
igbvf: do not modify tx_queue_len on link speed change
e1000e: do not modify tx_queue_len on link speed change
e1000: do not modify tx_queue_len on link speed change

Eric Anholt (6):
drm/i915: Don't bother with the BKL for GEM ioctls.
drm/i915: Enable VS timer dispatch.
agp/intel: Respect the GTT size on Sandybridge for scratch page setup.
agp/intel: Don't do the chipset flush on Sandybridge.
drm/i915: Set up the documented clock gating on Sandybridge and Ironlake.
drm/i915: Stop trying to use ACPI lid status to determine LVDS connection.

Eric Dumazet (3):
net: Potential null skb->dev dereference
netfilter: xt_hashlimit: dl_seq_stop() fix
netfilter: xt_hashlimit: IPV6 bugfix

Eric Miao (3):
[ARM] mmp: fix for variables in uncompress.h being discarded
[ARM] pxa: remove unnecessary 'select FB_W100' from some platforms
[ARM] pxa/sharpsl: add dependency of max1111 driver to sharpsl_pm

Eric Sandeen (1):
ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs

Eric W. Biederman (1):
netxen: The driver doesn't work on NX_P3_B1 so cause probe to fail.

FUJITA Tomonori (1):
Documentation: rename PCI/PCI-DMA-mapping.txt to DMA-API-HOWTO.txt

Felix Fietkau (1):
ath9k: fix BUG_ON triggered by PAE frames

Francois Romieu (1):
r8169: fix broken register writes

Grazvydas Ignotas (1):
wl1251: fix potential crash

Greg Rose (7):
ixgbevf: Fix VF Stats accounting after reset
ixgbevf: Shorten up delay timer for watchdog task
ixgbevf: Message formatting cleanups
ixgbevf: Fix signed/unsigned int error
ixgbe: In SR-IOV mode insert delay before bring the adapter up
ixgbe: Change where clear_to_send_flag is reset to zero.
ixgbe: Do not run all Diagnostic offline tests when VFs are active

Greg Thelen (1):
memcg: fix typo in memcg documentation

Guennadi Liakhovetski (3):
ASoC: SIU driver shall select FW_LOADER
SH: fix SCIFA SCASCR register bit definitions
SH: remove superfluous warning from the serial driver

Guenter Roeck (1):
ipv4: Don't drop redirected route cache entry unless PTMU actually expired

Guo-Fu Tseng (3):
jme: Fix VLAN memory leak
jme: Protect vlgrp structure by pause RX actions.
jme: Advance driver version number

H Hartley Sweeten (2):
[ARM] locomo: fix SPI register offset
[ARM] locomo: fix unpaired spin_lock_irqsave

Hans-Joachim Picht (1):
[S390] fix broken proc interface for sclp_async

Heiko Carstens (2):
[S390] smp: fix lowcore allocation
[S390] sclp: avoid 64 bit division

Henrik Kretzschmar (5):
genirq: Move two IRQ functions from .init.text to .text
isdn: Cleanup Sections in PCMCIA driver sedlbauer
isdn: Cleanup Sections in PCMCIA driver teles
isdn: Cleanup Sections in PCMCIA driver avma1
isdn: Cleanup Sections in PCMCIA driver elsa

Herbert Xu (1):
ipv6: Remove redundant dst NULL check in ip6_dst_check

Huang Weiyi (1):
[ARM] pxa/raumfeld: remove duplicated #include

Ian Campbell (1):
x86: Do not free zero sized per cpu areas

Jan Beulich (1):
x86: Fix placement of FIX_OHCI1394_BASE

Jan Kara (2):
ext4: Fix estimate of # of blocks needed to write indirect-mapped files
ext4: Don't use delayed allocation by default when used instead of ext3

Jani Nikula (1):
c2port: fix device_create() return value check

Jarkko Nikula (1):
ALSA: pcm_lib - fix xrun functionality

Jaswinder Singh Rajput (1):
hwmon: (asc7621) Add X58 entry in Kconfig

Jeff Dike (1):
vhost: fix error path in vhost_net_set_backend

Jeff Layton (1):
NFS: don't try to decode GETATTR if DELEGRETURN returned error

Jeff Mahoney (2):
reiserfs: fix oops while creating privroot with selinux enabled
reiserfs: properly honor read-only devices

Jens Rottmann (1):
ksz884x: fix return value of netdev_set_eeprom

Jiri Kosina (1):
x86: Remove excessive early_res debug output

Joe Perches (3):
drivers/gpu/drm/i915/intel_bios.c: fix continuation line formats
MAINTAINERS: use tab not spaces for delimiter
drivers/net: Fix continuation lines

Joern Engel (12):
Open segment file before using it
Limit max_pages for insane devices
Plug memory leak in writeseg_end_io
Prevent schedule while atomic in __logfs_readdir
Write out both superblocks on mismatch
Fix logfs_get_sb_final error path
Use deactivate_locked_super
Prevent data corruption in logfs_rewrite_block()
Simplify and fix pad_wbuf
[LogFS] Clear PagePrivate when moving journal
[LogFS] Move reserved segments with journal
[LogFS] Erase new journal segments

John Fastabend (1):
ixgbe: cleanup maximum number of tx queues

John Stultz (1):
time: Fix accumulation bug triggered by long delay.

Jon Maloy (1):
TIPC: Removed inactive maintainer

Jonathan Cameron (2):
[ARM] pxa: fix for variables in uncompress.h being discarded
[ARM] pxa: remove spi cs gpio direction to avoid clash with driver

Josep...@via.com.tw (2):
pata_via: Add VIA VX900 support
pata_via: fix VT6410/6415/6330 detection issue

Jozsef Kadlecsik (1):
netfilter: ip6table_raw: fix table priority

Julia Lawall (2):
sound/oss/vidc.c: change the field used with DMA_ACTIVE
arch/sparc/kernel: Use set_cpus_allowed_ptr

KOSAKI Motohiro (6):
sched: sched_getaffinity(): Allow less than NR_CPUS length
sched: Use proper type in sched_getaffinity()
tmpfs: mpol=bind:0 don't cause mount error.
tmpfs: handle MPOL_LOCAL mount option properly
tmpfs: cleanup mpol_parse_str()
doc: add the documentation for mpol=local

Ken Kawasaki (1):
pcnet_cs: add new id

Komuro (1):
pd6729: Coding Style fixes

Kunal Gangakhedkar (1):
ALSA: hda - Add PCI quirk for HP dv6-1110ax.

Kuninori Morimoto (3):
sh: mach-ecovec24: Add i2c_put_adapter on sh_eth_init
sh: ms7724: Add tiny-document for sound
sh: Add watch-dog register address for SH7722/SH7723/SH7724

Kyle McMartin (1):
tulip: Fix null dereference in uli526x_rx_packet()

Lai Jiangshan (2):
rcu: Fix tracepoints & lockdep false positive
rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives

Lee Schermerhorn (1):
mempolicy: fix get_mempolicy() for relative and static nodes

Lennart Schulte (1):
tcp: Fix tcp_mark_head_lost() with packets == 0

Li Zefan (1):
cgroups: remove duplicate include

Linus Torvalds (3):
Fix up prototype for sys_ipc breakage
ext3: fix broken handling of EXT3_STATE_NEW
Linux 2.6.34-rc3

Magnus Damm (1):
serial: sh-sci: fix SH-Mobile SH breakage

Mallikarjuna R Chilakala (3):
ixgbe: Fix 82599 multispeed fiber link issues due to Tx laser flapping
ixgbe: Fix 82599 KX4 Wake on LAN issue after an improper system shutdown
ixgbe: Set IXGBE_RSC_CB(skb)->DMA field to zero after unmapping the address

Marcel Holtmann (2):
Bluetooth: Fix potential bad memory access with sysfs files
Bluetooth: Convert debug files to actually use debugfs instead of sysfs

Mark Brown (2):
ASoC: Bail out of wm_hubs DC servo if calibration fails
ASoC: Remove BROKEN from i.MX audio after dependencies merged

Mark Fasheh (3):
ocfs2: set i_mode on disk during acl operations
ocfs2: Always try for maximum bits with new local alloc windows
ocfs2: Clear undo bits when local alloc is freed

Martin Schwidefsky (1):
[S390] fix boot failures with compressed kernels

Masami Hiramatsu (4):
perf probe: Fix probe_point buffer overrun
perf probe: Fix need_dwarf flag if lazy matching is used
perf probe: Fix offset to allow signed value
perf probe: Use original address instead of CU-based address

Mathieu Desnoyers (1):
CRED: Fix memory leak in error handling

Matt Fleming (3):
sh: Flush ITLB too in PTEAEX's flush_tlb_page()
sh: Replace unsafe manipulation of MMUCR
sh: Fix build after dynamic PMB rework

Matthew Wilcox (1):
PCI quirk: Disable MSI on VIA K8T890 systems

Miao Xie (2):
cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node
cpuset: alloc nodemask_t on the heap rather than the stack

Michael Chan (1):
bnx2: Use proper handler during netpoll.

Michael Grzeschik (1):
lxfb: set the H- and V-SYNC polarity of the flatpanel output

Michael Holzheu (1):
[S390] zcore: CPU registers are not saved under LPAR

Michael S. Tsirkin (3):
vhost: fix interrupt mitigation with raw sockets
vhost: fix error handling in vring ioctls
exit: fix oops in sync_mm_rss

Mike Frysinger (2):
can: bfin_can: switch to common Blackfin can header
blackfin: enable DEBUG_SECTION_MISMATCH

Mitch Williams (1):
igb: count Rx FIFO errors correctly

Neil Horman (1):
r8169: offical fix for CVE-2009-4537 (overlength frame DMAs)

Nick Bowler (1):
Staging: et131x: Properly disable FC in txmac.

Nicolas Dichtel (1):
net: ipmr/ip6mr: prevent out-of-bounds vif_table access

OGAWA Hirofumi (1):
fs/partition/msdos: fix unusable extended partition for > 512B sector

Owain G. Ainsworth (1):
drm/i915: remove an unnecessary wait_request()

Pablo Neira Ayuso (3):
netlink: fix unaligned access in nla_get_be64()
netlink: fix NETLINK_RECV_NO_ENOBUFS in netlink_set_err()
netfilter: ctnetlink: fix reliable event delivery if message building fails

Patrick McHardy (3):
net: ipmr/ip6mr: fix potential out-of-bounds vif_table access
netfilter: xt_recent: fix regression in rules using a zero hit_count
net: fix netlink address dumping in IPv4/IPv6

Paul E. McKenney (2):
rcu: Make rcu_read_lock_bh_held() allow for disabled BH
net: suppress lockdep-RCU false positive in FIB trie.

Paul Mackerras (1):
powerpc/perf_events: Fix call-graph recording, add perf_arch_fetch_caller_regs

Paul Mundt (3):
PCI: kill off pci_register_set_vga_state() symbol export.
sh: Tidy up a couple of section mismatches.
sh: Silence unintialized variable warnings in dwarf unwinder.

Paulius Zaleckas (1):
if_tunnel.h: add missing ams/byteorder.h include

Pavel Emelyanov (2):
ipv4: Cleanup struct net dereference in rt_intern_hash
ipv4: Restart rt_intern_hash after emergency rebuild (v2)

Peter Ujfalusi (2):
ASoC: tlv320dac33: Fix DSP modes
ASoC: tlv320dac33: Internal clocking changes

Prarit Bhargava (1):
hwmon: (coretemp) Fix cpu model output

Priit Laes (1):
drm/i915: Rename FBC_C3_IDLE to FBC_CTL_C3_IDLE to match other registers

Rafael J. Wysocki (1):
x86 / perf: Fix suspend to RAM on HP nx6325

Randy Dunlap (2):
scripts/kernel-doc: handle struct member __aligned
scripts/kernel-doc: fix fatal error on function prototype

Ravikiran G Thirumalai (1):
tmpfs: fix oops on mounts with mpol=default

Richard Röjfors (1):
drivers/gpio/max730x.c: add license macro

Rob Landley (1):
sparc: Fix use of uid16_t and gid16_t in asm/stat.h

Robert Love (2):
ixgbe: Don't allow user buffer count to exceed 256
ixgbe: Priority tag FIP frames

Robin Holt (1):
mm/ksm.c is doing an unneeded _notify in write_protect_page.

Russell King (3):
ARM: Fix IXP23xx build error in mach/memory.h
ARM: Update mach-types
Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation

Ryusuke Konishi (3):
nilfs2: fix duplicate call to nilfs_segctor_cancel_freev
nilfs2: fix hang-up of cleaner after log writer returned with error
nilfs2: fix imperfect completion wait in nilfs_wait_on_logs

Sachin Prabhu (1):
Skip check for mandatory locks when unlocking

Sage Weil (26):
ceph: implemented caps should always be superset of issued caps
ceph: add missing locking to protect i_snap_realm_item during split
ceph: fix inode removal from snap realm when racing with migration
ceph: fix authenticator timeout
ceph: fix authenticator buffer size calculation
ceph: release old ticket_blob buffer
ceph: clean up service ticket decoding
ceph: fix null pointer deref of r_osd in debug output
ceph: drop unnecessary WARN_ON in caps migration
ceph: fix session locking in handle_caps, ceph_check_caps
ceph: clean up handle_cap_grant, handle_caps wrt session mutex
ceph: only release unused caps with mds requests
ceph: fix mds sync() race with completing requests
ceph: fix pg pool decoding from incremental osdmap update
ceph: prevent dup stale messages to console for restarting mds
ceph: fix connection fault con_work reentrancy problem
ceph: rename r_sent_stamp r_stamp
ceph: avoid reopening osd connections when address hasn't changed
ceph: fix snap rebuild condition
ceph: make write_begin wait propagate ERESTARTSYS
ceph: propagate mds session allocation failures to caller
ceph: fix session check on mds reply
ceph: fix possible double-free of mds request reference
ceph: avoid loaded term 'OSD' in documention
ceph: fix use after free on mds __unregister_request
ceph: update discussion list address in MAINTAINERS

Srinivas Eeda (1):
ocfs2: Fix a race in o2dlm lockres mastery

Stanislaw Gruszka (1):
posix-cpu-timers: Reset expire cache when no timer is running

Stefan Haberland (1):
[S390] dasd: check tsb validity

Stefan Richter (2):
firewire: core: fix Model_ID in modalias
firewire: core: align driver match with modalias

Stefan Weinhuber (1):
[S390] dasd: fix alignment of transport mode recovery TCW

Steve Glendinning (1):
smsc95xx: Fix tx checksum offload for small packets

Steven J. Magnani (1):
NET_DMA: free skbs periodically

Steven Rostedt (1):
ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align

Suresh Siddha (1):
x86: Handle legacy PIC interrupts on all the cpu's

Takashi Iwai (3):
ALSA: hda - Sort codec entry list of Nvidia HDMI
ALSA: hda - Fix access-after-free in patch_realtek.c
ALSA: hda - Don't set invalid connection index in Realtek initialiaiton

Tao Ma (4):
ocfs2: Change bg_chain check for ocfs2_validate_gd_parent.
ocfs2: Update i_blocks in reflink operations.
ocfs2: Fix the update of name_offset when removing xattrs
ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block.

Tejun Heo (1):
libata-sff: fix spurious IRQ handling

Tetsuo Handa (2):
rxrpc: Check allocation failure.
rxrpc: Check allocation failure.

Theodore Ts'o (1):
ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3

Thomas Gleixner (3):
genirq: Prevent oneshot irq thread race
clockevents: Sanitize min_delta_ns adjustment and prevent overflows
genirq: Protect access to irq_desc->action in can_request_irq()

Thomas Weber (1):
OMAP: DSS2: VRAM: Fix early_param for vram

Tim Yamin (1):
PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present

Timo Teräs (2):
ipv4: check rt_genid in dst_check
ip_gre: include route header_len in max_headroom calculation

Tomi Valkeinen (2):
OMAP: DSS2: initialize dss clk sources properly
OMAP: DSS2: panel-generic: re-implement mode changing

Tristan Ye (2):
Ocfs2: Journaling i_flags and i_orphaned_slot when adding inode to orphan dir.
Ocfs2: Handle deletion of reflinked oprhan inodes correctly.

Trond Myklebust (4):
NFS: Prevent another deadlock in nfs_release_page()
SUNRPC: Fix a potential memory leak in auth_gss
SUNRPC: Fix a use after free bug with the NFSv4.1 backchannel
SUNRPC: Fix the return value of rpc_run_bc_task()

Uwe Kleine-König (1):
rtc/mc13783: fix use after free bug

Vasu Dev (3):
ixgbe: fix for real_num_tx_queues update issue
vlan: adds vlan_dev_select_queue
vlan: updates vlan real_num_tx_queues

Wolfram Sang (2):
regulator: fix dangling pointers
get_maintainer: repair STDIN usage

YOSHIFUJI Hideaki / 吉藤英明 (1):
ipv6: Don't drop cache route entry unless timer actually expired.

Yegor Yefremov (1):
KS8695: update ksp->next_rx_desc_read at the end of rx loop

Yinghai Lu (2):
x86: Make smp_locks end with page alignment
x86: Make sure free_init_pages() frees pages on page boundary

Zhenyu Wang (1):
drm/i915: Fix check with IS_GEN6

stephen hemminger (1):
TCP: check min TTL on received ICMP packets

wzt wzt (1):
benet: Fix compile warnnings in drivers/net/benet/be_ethtool.c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Larry Finger

unread,

Mar 30, 2010, 3:20:01 PM3/30/10

to

Remember our old friend? See also
http://lkml.indiana.edu/hypermail/linux/kernel/1003.1/02769.html and a long
related discussion at
http://lkml.indiana.edu/hypermail/linux/kernel/1003.1/02772.html.

BUG: key ffff8800374d2660 not in .data!
------------[ cut here ]------------
WARNING: at kernel/lockdep.c:2706 lockdep_init_map+0x236/0x5d0()
Hardware name: HP Pavilion dv2700 Notebook PC
Modules linked in: processor(+) ide_pci_generic amd74xx ide_core thermal
Pid: 612, comm: modprobe Not tainted 2.6.34-rc3-Linus #192
Call Trace:
[<ffffffff81046168>] warn_slowpath_common+0x78/0xb0
[<ffffffff810461af>] warn_slowpath_null+0xf/0x20
[<ffffffff810772c6>] lockdep_init_map+0x236/0x5d0
[<ffffffff811649aa>] sysfs_add_file_mode+0x6a/0xc0
[<ffffffff811acaa0>] ? sprintf+0x40/0x50
[<ffffffff81164a0c>] sysfs_add_file+0xc/0x10
[<ffffffff81164ae1>] sysfs_create_file+0x21/0x30
[<ffffffff81247cc4>] device_create_file+0x14/0x20
[<ffffffff8126d7d4>] thermal_zone_bind_cooling_device+0x1c4/0x310
-- dump truncated here --

I understand that you do not want a whole set of such "fix one such bug at a
time" patches, but the bottom line is that these lockdep warnings for known
problems interfere with "normal" debugging because further lockdep checking is
disabled. I do not feel qualified to detect other cases and fix them. Only when
a warning hits my logs am I qualified to fix and test.

Larry

Rafael J. Wysocki

unread,

Mar 30, 2010, 5:20:02 PM3/30/10

to

On Tuesday 30 March 2010, Linus Torvalds wrote:
...

> Other than that? Random fixes and updates all over. Mostly drivers and
> filesystems, and mostly fairly small things. If you had PCI resource
> conflict problems with the early -rc's due to the _CRS window thing, for
> example, that should hopefully be fixed. See the appended shortlog for
> other details.

...

> Clemens Ladisch (4):
> firewire: core: fw_iso_resource_manage: fix error handling
> firewire: ohci: add cycle timer quirk for the TI TSB12LV22
> ALSA: cmipci: work around invalid PCM pointer
> PCI quirk: RS780/RS880: work around missing MSI initialization

This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
which happens to have a RS780.

The symptom is that every operation involving the GPU is _very_ slow, so the
window manager eventually disables compositing. Reverting this commit makes
things work flawlessly again.

So, please revert.

BTW, I don't think it's a -stable material.

Thanks,
Rafael

Wolfram Sang

unread,

Mar 30, 2010, 10:20:01 PM3/30/10

to

> Remember our old friend? See also
> http://lkml.indiana.edu/hypermail/linux/kernel/1003.1/02769.html and a long

There is a patch available:

https://patchwork.kernel.org/patch/87436/
(or https://bugzilla.kernel.org/show_bug.cgi?id=15548)

Seems it was jussed missed.

The firmware bug should be gone, but there is also

https://patchwork.kernel.org/patch/87359/

pointing out a few more potential issues. I will resend it without the
thermal_sys-part and hope I'll get some comments.

Kind regards,

Wolfram

--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |

signature.asc

Larry Finger

unread,

Mar 30, 2010, 10:50:02 PM3/30/10

to

On 03/30/2010 09:16 PM, Wolfram Sang wrote:
>
>> Remember our old friend? See also
>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.1/02769.html and a long
>
> There is a patch available:
>
> https://patchwork.kernel.org/patch/87436/
> (or https://bugzilla.kernel.org/show_bug.cgi?id=15548)
>
> Seems it was jussed missed.
>
>> related discussion at
>> http://lkml.indiana.edu/hypermail/linux/kernel/1003.1/02772.html.
>
> The firmware bug should be gone, but there is also
>
> https://patchwork.kernel.org/patch/87359/
>
> pointing out a few more potential issues. I will resend it without the
> thermal_sys-part and hope I'll get some comments.

Glad to hear that these will soon be fixed.

Greg KH

unread,

Mar 31, 2010, 4:40:02 PM3/31/10

to

On Tue, Mar 30, 2010 at 11:16:45PM +0200, Rafael J. Wysocki wrote:
> On Tuesday 30 March 2010, Linus Torvalds wrote:
> ...
> > Other than that? Random fixes and updates all over. Mostly drivers and
> > filesystems, and mostly fairly small things. If you had PCI resource
> > conflict problems with the early -rc's due to the _CRS window thing, for
> > example, that should hopefully be fixed. See the appended shortlog for
> > other details.
>
> ...
>
> > Clemens Ladisch (4):
> > firewire: core: fw_iso_resource_manage: fix error handling
> > firewire: ohci: add cycle timer quirk for the TI TSB12LV22
> > ALSA: cmipci: work around invalid PCM pointer
> > PCI quirk: RS780/RS880: work around missing MSI initialization
>
> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> which happens to have a RS780.
>
> The symptom is that every operation involving the GPU is _very_ slow, so the
> window manager eventually disables compositing. Reverting this commit makes
> things work flawlessly again.
>
> So, please revert.
>
> BTW, I don't think it's a -stable material.

Ok, I'll go drop it.

thanks,

greg k-h

Rafael J. Wysocki

unread,

Mar 31, 2010, 9:20:02 PM3/31/10

to

On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
> On Tuesday 30 March 2010, Linus Torvalds wrote:
> ...
> > Other than that? Random fixes and updates all over. Mostly drivers and
> > filesystems, and mostly fairly small things. If you had PCI resource
> > conflict problems with the early -rc's due to the _CRS window thing, for
> > example, that should hopefully be fixed. See the appended shortlog for
> > other details.
>
> ...
>
> > Clemens Ladisch (4):
> > firewire: core: fw_iso_resource_manage: fix error handling
> > firewire: ohci: add cycle timer quirk for the TI TSB12LV22
> > ALSA: cmipci: work around invalid PCM pointer
> > PCI quirk: RS780/RS880: work around missing MSI initialization
>
> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
> which happens to have a RS780.
>
> The symptom is that every operation involving the GPU is _very_ slow, so the
> window manager eventually disables compositing. Reverting this commit makes
> things work flawlessly again.
>
> So, please revert.
>
> BTW, I don't think it's a -stable material.

OK, I've verified that partial revert (below) is sufficient.

Rafael

---
From: Rafael J. Wysocki <r...@sisk.pl>
Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880

Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7
(PCI quirk: RS780/RS880: work around missing MSI initialization)
removed a quirk to disable MSI on RS780 and RS880, which still is
necessary on my Acer Ferrari One, because pci_enable_msi() attempts
to enable the MSI and apparently succeeds despite the PCI quirk
added by that commit. Add the removed radeon quirk again.

Signed-off-by: Rafael J. Wysocki <r...@sisk.pl>
---
drivers/gpu/drm/radeon/radeon_irq_kms.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
@@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de
}
/* enable msi */
rdev->msi_enabled = 0;
- if (rdev->family >= CHIP_RV380) {
+ /* MSIs don't seem to work on my rs780;
+ * not sure about rs880 or other rs780s.
+ * Needs more investigation.
+ */
+ if ((rdev->family >= CHIP_RV380) &&
+ (rdev->family != CHIP_RS780) &&
+ (rdev->family != CHIP_RS880)) {
int ret = pci_enable_msi(rdev->pdev);
if (!ret) {
rdev->msi_enabled = 1;

Alex Deucher

unread,

Mar 31, 2010, 10:20:02 PM3/31/10

to

I also have the attached patch queued in via Dave's tree to disable
MSI on all IGP chips for the time being.

Alex

0001-drm-radeon-kms-disable-MSI-on-IGP-chips.patch

Clemens Ladisch

unread,

Apr 1, 2010, 2:40:02 AM4/1/10

to

Alex Deucher wrote:
> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>>> > PCI quirk: RS780/RS880: work around missing MSI initialization
>>>
>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>>> which happens to have a RS780.

So it's better to disable MSI unconditionally.

Rafael, can you check if MSI works for the HDMI audio device?
(I'd guess it doesn't.)

> I also have the attached patch queued in via Dave's tree to disable
> MSI on all IGP chips for the time being.

This disables MSI only for the graphics device. I'd prefer to have
the quirk on its bridge so that MSI gets disabled for the HDMI audio
device too, to avoid having to duplicate this quirk in the snd-hda-intel
driver.

==========

PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not seem
to be the only problem that prevents MSI, so that quirk is not
sufficient to enable MSI on all machines. To be safe, unconditionally
disable MSI for the internal graphics and HDMI audio on these chipsets.

Signed-off-by: Clemens Ladisch <cle...@ladisch.de>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev)
}
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);

/* Go through the list of Hypertransport capabilities and
@@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374,
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375,
quirk_msi_intx_disable_bug);

-/*
- * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio
- * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.
- */
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)
-{
- u32 nb_cntl;
-
- if (!int_gfx_bridge->subordinate)
- return;
-
- pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
- 0x60, 0);
- pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
- 0x64, &nb_cntl);
-
- if (!(nb_cntl & BIT(10))) {
- dev_warn(&int_gfx_bridge->dev,
- FW_WARN "RS780: MSI for internal graphics disabled\n");
- int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
- }
-}
-
-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602
-
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
- PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
- rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
- PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
- rs780_int_gfx_disable_msi);
-
#endif /* CONFIG_PCI_MSI */

#ifdef CONFIG_PCI_IOV

Alex Deucher

unread,

Apr 1, 2010, 11:10:02 AM4/1/10

to

On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <cle...@ladisch.de> wrote:
> Alex Deucher wrote:
>> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
>>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:
>>>> > PCI quirk: RS780/RS880: work around missing MSI initialization
>>>>
>>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box
>>>> which happens to have a RS780.
>
> So it's better to disable MSI unconditionally.
>
> Rafael, can you check if MSI works for the HDMI audio device?
> (I'd guess it doesn't.)
>
>> I also have the attached patch queued in via Dave's tree to disable
>> MSI on all IGP chips for the time being.
>
> This disables MSI only for the graphics device. I'd prefer to have
> the quirk on its bridge so that MSI gets disabled for the HDMI audio
> device too, to avoid having to duplicate this quirk in the snd-hda-intel
> driver.
>
> ==========
>
> PCI quirk: RS780/RS880: disable MSI completely
>
> The missing initialization of the nb_cntl.strap_msi_enable does not seem
> to be the only problem that prevents MSI, so that quirk is not
> sufficient to enable MSI on all machines. To be safe, unconditionally
> disable MSI for the internal graphics and HDMI audio on these chipsets.
>
> Signed-off-by: Clemens Ladisch <cle...@ladisch.de>

Works fine here.

Tested-by: Alex Deucher <alexd...@gmail.com>

Linus Torvalds

unread,

Apr 1, 2010, 12:40:02 PM4/1/10

to

On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
>
> OK, I've verified that partial revert (below) is sufficient.

Hmm. Through the DRM merge I just did, this area actually conflicted, and
the resolved version is now

if ((rdev->family >= CHIP_RV380) &&

(!(rdev->flags & RADEON_IS_IGP))) {

which presumably also fixes your issue?

[ Side note: somebody in the DRM tree seems to be way too used to LISP,
and thinks that adding parenthesis always improves the code ;-]

However, I do suspect that we should probably revert the quirk regardless
as being useless (ie it probably was related to those IGP chips that
apparently don't do MSI anyway).

So the patch that reverts the quirk by Clemens (to replace it with
disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
good idea regardless, since it's apparently not just about gfx. Jesse?

Linus

Alex Deucher

unread,

Apr 1, 2010, 1:10:02 PM4/1/10

to

On Thu, Apr 1, 2010 at 12:29 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
>>
>> OK, I've verified that partial revert (below) is sufficient.
>
> Hmm. Through the DRM merge I just did, this area actually conflicted, and
> the resolved version is now
>
> if ((rdev->family >= CHIP_RV380) &&
> (!(rdev->flags & RADEON_IS_IGP))) {
>
> which presumably also fixes your issue?
>
> [ Side note: somebody in the DRM tree seems to be way too used to LISP,
> and thinks that adding parenthesis always improves the code ;-]
>

heh, that's me. habit I guess, just to be sure.

> However, I do suspect that we should probably revert the quirk regardless
> as being useless (ie it probably was related to those IGP chips that
> apparently don't do MSI anyway).
>
> So the patch that reverts the quirk by Clemens (to replace it with
> disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
> good idea regardless, since it's apparently not just about gfx. Jesse?

Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
right approach I think. Note that it's only devices hung off the int
gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works
fine on the PCIE slots. I have a similar patch for rs400 chips on bug
15626:
https://bugzilla.kernel.org/show_bug.cgi?id=15626

Alex

>
> Linus
>
> ------------------------------------------------------------------------------
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> --
> _______________________________________________
> Dri-devel mailing list
> Dri-...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dri-devel

Linus Torvalds

unread,

Apr 1, 2010, 1:30:01 PM4/1/10

to

On Thu, 1 Apr 2010, Alex Deucher wrote:
>
> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
> right approach I think. Note that it's only devices hung off the int
> gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works
> fine on the PCIE slots. I have a similar patch for rs400 chips on bug
> 15626:
> https://bugzilla.kernel.org/show_bug.cgi?id=15626

Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that
pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for
PCIE drivers too.

Put another way: it sounds like the quirk now disables MSI for all
devices. Maybe there would some more targeted mode?

Linus

Alex Deucher

unread,

Apr 1, 2010, 2:00:02 PM4/1/10

to

On Thu, Apr 1, 2010 at 1:24 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
>> right approach I think. Note that it's only devices hung off the int
>> gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works
>> fine on the PCIE slots. I have a similar patch for rs400 chips on bug
>> 15626:
>> https://bugzilla.kernel.org/show_bug.cgi?id=15626
>
> Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that
> pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for
> PCIE drivers too.
>
> Put another way: it sounds like the quirk now disables MSI for all
> devices. Maybe there would some more targeted mode?
>

What I meant to say was MSI works fine on bridges other than the
bridge the internal gfx lives on. quirk_disable_msi() just disables
MSI on the devices on that particular bridge as far as I understand
it, but I'm by no means an expert on the PCI code. E.g., on my RS780
board, MSIs are only problematic on the integrated gfx chip. MSIs
work fine on PCI/PCIE add-on cards and the integrated Ethernet.

Alex

Clemens Ladisch

unread,

Apr 1, 2010, 2:00:03 PM4/1/10

to

Linus Torvalds wrote:
> On Thu, 1 Apr 2010, Alex Deucher wrote:
> > Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the
> > right approach I think. Note that it's only devices hung off the int
> > gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works
> > fine on the PCIE slots. I have a similar patch for rs400 chips on bug
> > 15626:
> > https://bugzilla.kernel.org/show_bug.cgi?id=15626
>
> Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that
> pci_no_msi() quirk disable MSI for PCIE too?

A quirk that used pci_no_msi() would disable all MSI for all devices.
However, these patches (and that in bug 15626) use PCI_BUS_FLAGS_NO_MSI
so that only the internal GPU devices are affected.

That "completely" in my patch title should better read "unconditionally".

Regards,
Clemens

Rafael J. Wysocki

unread,

Apr 1, 2010, 3:50:02 PM4/1/10

to

On Thursday 01 April 2010, Linus Torvalds wrote:
>
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> >
> > OK, I've verified that partial revert (below) is sufficient.
>
> Hmm. Through the DRM merge I just did, this area actually conflicted, and
> the resolved version is now
>
> if ((rdev->family >= CHIP_RV380) &&
> (!(rdev->flags & RADEON_IS_IGP))) {
>
> which presumably also fixes your issue?

Yes, it does.

Rafael

Alex Deucher

unread,

Apr 1, 2010, 4:30:01 PM4/1/10

to

On Thu, Apr 1, 2010 at 4:17 PM, Linus Torvalds

<torv...@linux-foundation.org> wrote:
>
>
> On Thu, 1 Apr 2010, Alex Deucher wrote:
>>
>> What I meant to say was MSI works fine on bridges other than the
>> bridge the internal gfx lives on. �quirk_disable_msi() just disables
>> MSI on the devices on that particular bridge as far as I understand
>> it, but I'm by no means an expert on the PCI code.
>

> Yes, it disabled MSI only on devices under that bridge. But if it's the
> northbridge, that would be everything, no?
>
> But I don't know what devices those
>
> � � � �PCI_VENDOR_ID_AMD, 0x9602,
> � � � �PCI_VENDOR_ID_ASUSTEK, 0x9602,
>
> things are. If they are just a PCIE->PCI bridge rather than the root
> bridge, then everything looks fine to me.
>

Yup, those are just the pci to pci bridges used for the internal gfx.
Really there's only one, 0x9602, but some asus oem boards have the
vendor id wrong.

> � � � � � � � � � � � �Linus

Linus Torvalds

unread,

Apr 1, 2010, 4:30:02 PM4/1/10

to

On Thu, 1 Apr 2010, Alex Deucher wrote:
>
> What I meant to say was MSI works fine on bridges other than the
> bridge the internal gfx lives on. quirk_disable_msi() just disables
> MSI on the devices on that particular bridge as far as I understand
> it, but I'm by no means an expert on the PCI code.

Yes, it disabled MSI only on devices under that bridge. But if it's the

northbridge, that would be everything, no?

But I don't know what devices those

PCI_VENDOR_ID_AMD, 0x9602,
PCI_VENDOR_ID_ASUSTEK, 0x9602,

things are. If they are just a PCIE->PCI bridge rather than the root
bridge, then everything looks fine to me.

Linus

Rafael J. Wysocki

unread,

Apr 1, 2010, 4:30:02 PM4/1/10

to

Unfortunately it doesn't work for me without the

if ((rdev->family >= CHIP_RV380) &&
(!(rdev->flags & RADEON_IS_IGP)))

radeon quirk.

Thanks,
Rafael

Rafael J. Wysocki

unread,

Apr 1, 2010, 4:50:02 PM4/1/10

to

On Thursday 01 April 2010, Alex Deucher wrote:

> what are your pci ids?

1022:960b

I guess 1022 is AMD.

OK, I'll try to add that.

Alex Deucher

unread,

Apr 1, 2010, 4:50:02 PM4/1/10

to

what are your pci ids?

Alex

Alex Deucher

unread,

Apr 1, 2010, 5:10:02 PM4/1/10

to

0x960b won't affect the internal gfx. That bridge is for the pcie x16 gfx slot.

0x9600 Host bridge
0x9602 Internal GFX PCI-PCI bridge ID
0x9603 External GFX - port 0
0x960B External GFX - port 1
0x9604 PCI-PCI bridge - Port 0
0x9605 PCI-PCI bridge - Port 1
0x9606 PCI-PCI bridge - Port 2
0x9607 PCI-PCI bridge - Port 3
0x9608 PCI-PCI bridge - Port 4
0x9609 PCI-PCI bridge - Port 5
0x960A PCI-PCI bridge (SB)
0x960F HD Audio controller
0x791A HDMI Audio codec

Alex

Alex Deucher

unread,

Apr 1, 2010, 5:10:02 PM4/1/10

to

It's possible your oem has the wrong vendor id for the 0x9602 bridge.

Alex

Rafael J. Wysocki

unread,

Apr 1, 2010, 5:10:01 PM4/1/10

to

Yes, the patch below works.

Thanks,
Rafael

---
drivers/gpu/drm/radeon/radeon_irq_kms.c | 3 --
drivers/pci/quirks.c | 36 ++------------------------------
2 files changed, 4 insertions(+), 35 deletions(-)

Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi(

}
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);

+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);

DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi);

/* Go through the list of Hypertransport capabilities and

@@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT

Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c
===================================================================
--- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c

@@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de
/* MSIs don't seem to work reliably on all IGP
* chips. Disable MSI on them for now.
*/
- if ((rdev->family >= CHIP_RV380) &&
- (!(rdev->flags & RADEON_IS_IGP))) {
+ if (rdev->family >= CHIP_RV380) {

int ret = pci_enable_msi(rdev->pdev);
if (!ret) {
rdev->msi_enabled = 1;

Alex Deucher

unread,

Apr 1, 2010, 5:20:02 PM4/1/10

to

Let's skip this second chunk for now as there are other non-RS780 IGP
chips that could be problematic, so I'd rather just leave MSIs
disabled for now.

Alex

Rafael J. Wysocki

unread,

Apr 1, 2010, 5:50:01 PM4/1/10

to

Works for me.

So do you want me to resubmit?

Rafael

Alex Deucher

unread,

Apr 1, 2010, 6:10:01 PM4/1/10

to

Please.

Thanks,

Alex

Jesse Barnes

unread,

Apr 1, 2010, 7:00:02 PM4/1/10

to

On Thu, 1 Apr 2010 09:29:23 -0700 (PDT)
Linus Torvalds <torv...@linux-foundation.org> wrote:

>
>
> On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> >
> > OK, I've verified that partial revert (below) is sufficient.
>
> Hmm. Through the DRM merge I just did, this area actually conflicted, and
> the resolved version is now
>
> if ((rdev->family >= CHIP_RV380) &&
> (!(rdev->flags & RADEON_IS_IGP))) {
>
> which presumably also fixes your issue?
>
> [ Side note: somebody in the DRM tree seems to be way too used to LISP,
> and thinks that adding parenthesis always improves the code ;-]
>
> However, I do suspect that we should probably revert the quirk regardless
> as being useless (ie it probably was related to those IGP chips that
> apparently don't do MSI anyway).
>
> So the patch that reverts the quirk by Clemens (to replace it with
> disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
> good idea regardless, since it's apparently not just about gfx. Jesse?

Yeah, that sounds fine. I can include it in my next pull req or you
can just pick it up directly.

--
Jesse Barnes, Intel Open Source Technology Center

Rafael J. Wysocki

unread,

Apr 1, 2010, 7:20:02 PM4/1/10

to

On Friday 02 April 2010, Alex Deucher wrote:
> On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> >> >> >> > On Thursday 01 April 2010, Alex Deucher wrote:
> >> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <cle...@ladisch.de> wrote:
> >> >> >> >> > Alex Deucher wrote:
> >> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> >> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote:

...

> > So do you want me to resubmit?
> >
>
> Please.

Appended, with sign-offs and changelog.

Thanks,
Rafael

---
Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not
seem to be the only problem that prevents MSI, so that quirk is not

sufficient to enable MSI on all machines. To be safe, disable MSI
unconditionally for the internal graphics and HDMI audio on these
chipsets.

[rjw: Added the PCI_VENDOR_ID_AI quirk.]

Signed-off-by: Clemens Ladisch <cle...@ladisch.de>

Signed-off-by: Rafael J. Wysocki <r...@sisk.pl>
---

drivers/pci/quirks.c | 36 +++---------------------------------
1 file changed, 3 insertions(+), 33 deletions(-)

Rafael J. Wysocki

unread,

Apr 1, 2010, 7:30:02 PM4/1/10

to

On Friday 02 April 2010, Jesse Barnes wrote:
> On Thu, 1 Apr 2010 09:29:23 -0700 (PDT)
> Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> >
> >
> > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote:
> > >
> > > OK, I've verified that partial revert (below) is sufficient.
> >
> > Hmm. Through the DRM merge I just did, this area actually conflicted, and
> > the resolved version is now
> >
> > if ((rdev->family >= CHIP_RV380) &&
> > (!(rdev->flags & RADEON_IS_IGP))) {
> >
> > which presumably also fixes your issue?
> >
> > [ Side note: somebody in the DRM tree seems to be way too used to LISP,
> > and thinks that adding parenthesis always improves the code ;-]
> >
> > However, I do suspect that we should probably revert the quirk regardless
> > as being useless (ie it probably was related to those IGP chips that
> > apparently don't do MSI anyway).
> >
> > So the patch that reverts the quirk by Clemens (to replace it with
> > disabling MSI entirely when the AMD NB doesn't accept them) seems to be a
> > good idea regardless, since it's apparently not just about gfx. Jesse?
>
> Yeah, that sounds fine. I can include it in my next pull req or you
> can just pick it up directly.

Not exactly that one, please, it's missing a quirk for the affected system.

I've just sent a corrected version, here:
https://patchwork.kernel.org/patch/90275/

Rafael

Linus Torvalds

unread,

Apr 1, 2010, 8:30:02 PM4/1/10

to

On Fri, 2 Apr 2010, Rafael J. Wysocki wrote:
>
> Appended, with sign-offs and changelog.
>

> ---
> Subject: PCI quirk: RS780/RS880: disable MSI completely

Hmm. Isn't this missing a

From: Clemens Ladisch <cle...@ladisch.de>

too? Or was the original patch yours?

Linus

Rafael J. Wysocki

unread,

Apr 2, 2010, 12:50:01 PM4/2/10

to

On Friday 02 April 2010, Linus Torvalds wrote:
>
> On Fri, 2 Apr 2010, Rafael J. Wysocki wrote:
> >
> > Appended, with sign-offs and changelog.
> >
> > ---
> > Subject: PCI quirk: RS780/RS880: disable MSI completely
>
> Hmm. Isn't this missing a
>
> From: Clemens Ladisch <cle...@ladisch.de>
>
> too?

Ouch, yes it is, sorry.

This one should be complete.

---
From: Clemens Ladisch <cle...@ladisch.de>

Subject: PCI quirk: RS780/RS880: disable MSI completely

The missing initialization of the nb_cntl.strap_msi_enable does not

Borislav Petkov

unread,

Apr 2, 2010, 2:10:01 PM4/2/10

to

Hi,

I've got the following oopsie two times now when hibernating - this
means, I don't get it everytime I hibernate but only sometimes, say once
in a blue moon.

And yeah, I couldn't catch it over serial console so I had to make ugly
pictures. By the way, the numbers in the filenames increment as I scroll
down the whole oops (yep, it hadn't completely frozen and I still could
do Shift->PgUp or Shift->PgDn on the console):

http://www.kernel.org/pub/linux/kernel/people/bp/

So, here's what I could decipher from the oopsie, someone else who's
more knowledgeable in mm, rmap and anon_vma's list traversal should be
able to tell what goes wrong there.

EIP is at page_referenced+0xee

which is

<disasm>
10c4: 41 01 c4 add %eax,%r12d
10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
10cb: 74 19 je 10e6 <page_referenced+0xff>
10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13
10d1: 49 83 ed 20 sub $0x20,%r13

10d5: 49 8b 45 20 mov 0x20(%r13),%rax <--------------

10d9: 0f 18 08 prefetcht0 (%rax)
10dc: 49 8d 45 20 lea 0x20(%r13),%rax
10e0: 48 39 45 80 cmp %rax,-0x80(%rbp)
</disasm>

Corresponding asm:

<asm>
.loc 1 496 0
movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451
.LVL295:
subq $32, %r13 #, avc
.LVL296:
.L184:
.LBE1278:
movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <----------------
prefetcht0 (%rax) # <variable>.same_anon_vma.next
leaq 32(%r13), %rax #, tmp97
cmpq %rax, -128(%rbp) # tmp97, %sfp
jne .L187 #,
.L186:
.loc 1 514 0
movq %r14, %rdi # anon_vma,
call page_unlock_anon_vma #
</asm>

and the NULL pointer in question is being written into %r13 and then 32
is subtracted from it (I'm guessing container_of()). This is consistent
with the register snapshot - %r13 contains 0xffffffffffffffe0 which is
-32 and with the code dump in the oops, in CIMG1640.JPG code points to
opcode 49 8b 45 20.

Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>.

mapcount = page_mapcount(page);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
continue;

</source>

which tells us that same_anon_vma.next is NULL. Hmm...

--
Regards/Gruss,
Boris.

Linus Torvalds

unread,

Apr 2, 2010, 2:20:02 PM4/2/10

to

I think this is likely due to the new scalable anon_vma linking by Rik.
Nothing else I can imagine should have introduced anything like it.

Rik: the picures have the information, but you need to look at several to
see both the oops and the backtrace. Here's a condensed version:

shrink_all_memory ->
do_try_to_free_pages ->
shrink_zone ->
shrink_inactive_list ->
shrink_page_list ->
page_referenced

where page_referenced() oopses due page_referenced_anon() as per
Borislav's description below.

Added all the usual suspects to the Cc list. Left the full report appended
so that the new people don't have to search for it on lkml.

Linus

Andrew Morton

unread,

Apr 2, 2010, 2:30:02 PM4/2/10

to

On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torv...@linux-foundation.org> wrote:

>
> I think this is likely due to the new scalable anon_vma linking by Rik.

Linus Torvalds

unread,

Apr 2, 2010, 2:50:02 PM4/2/10

to

On Fri, 2 Apr 2010, Andrew Morton wrote:

> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> >
> > I think this is likely due to the new scalable anon_vma linking by Rik.
>
> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680

Yup, looks like the same thing, except that bugzilla entry was due to
swapping rather than hibernation and memory shrinking. But same end
result, just different reasons for why we were trying to shrink the page
lists.

Linus

Rik van Riel

unread,

Apr 2, 2010, 6:10:01 PM4/2/10

to

On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> On Fri, 2 Apr 2010, Andrew Morton wrote:
>> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torv...@linux-foundation.org> wrote:
>>
>>>
>>> I think this is likely due to the new scalable anon_vma linking by Rik.
>>
>> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
>
> Yup, looks like the same thing, except that bugzilla entry was due to
> swapping rather than hibernation and memory shrinking. But same end
> result, just different reasons for why we were trying to shrink the page
> lists.

Interesting that it is a null pointer dereference, given
that we do not zero out the anon_vma_chain structs before
freeing them.

Page_referenced_anon() takes the anon_vma->lock before
walking the list. The three places where we modify the
anon_vma_chain->same_anon_vma list, we also hold the
lock.

No doubt something in mm/ is doing something silly, but
I have not found anything yet :(

If I had to guess, I'd say maybe we got one of the
mprotect & vma_adjust cases wrong. Maybe a page stayed
around in the LRU (and in a process?) after its anon_vma
already got freed?

There has to be a reason why a very heavy AIM7 workload
and some other stress tests did not trigger it, but a few
people are able to trigger it on their systems...

Linus Torvalds

unread,

Apr 2, 2010, 8:30:02 PM4/2/10

to

On Fri, 2 Apr 2010, Rik van Riel wrote:
>
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
>
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list. The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.

So let's look at the individual anon_vma_chain entries instead.

What is the protection of the 'vma->anon_vma_chain' list? In
anon_vma_prepare(), the code implies that it is the page_table_lock, but
what about anon_vma_clone()? If I'm reading it correctly, it is some odd
mix of "mmap_sem held for writing" or "mmap_sem held for reading _and_
page_table_lock". And then we have the exit case that apparently has no
locking at all, but that should hopefully be single-threaded.

That thing is subtle. A few more comments about the locking would be good,
so that people like me wouldn't have to try to guess the rules from
reading the source.

> There has to be a reason why a very heavy AIM7 workload
> and some other stress tests did not trigger it, but a few
> people are able to trigger it on their systems...

I don't think AIM7 is at all a very interesting workload, and not likely
to stress anything at all. Did your AIM7 test actually cause heavy
swapping? I doubt it.

Page swapout is where a lot of the magic happens, since that happens
without mmap_sem held etc.

Linus

Clemens Ladisch

unread,

Apr 3, 2010, 2:10:02 PM4/3/10

to

Rafael J. Wysocki wrote:
> From: Clemens Ladisch <cle...@ladisch.de>
> Subject: PCI quirk: RS780/RS880: disable MSI completely
>
> The missing initialization of the nb_cntl.strap_msi_enable does not
> seem to be the only problem that prevents MSI, so that quirk is not
> sufficient to enable MSI on all machines. To be safe, disable MSI
> unconditionally for the internal graphics and HDMI audio on these
> chipsets.
>
> [rjw: Added the PCI_VENDOR_ID_AI quirk.]

> ...

> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);

I fear I have to NACK this. The fact that two OEMs have changed the vendor
ID makes it likely that this is a bug in AMD's template BIOS code, and that
we will see the same problem on other systems using other vendor IDs.

So we should not use the vendor ID of device 0x9602 to declare the quirk, but
use some other device with an ID that is known to be correct. We already
access the configuration space of the host bridge, so we should use that.

Furthermore, the quirk in my first patch was never run at all on the ALi
system, so it is probable that the nb_cntl.strap_msi_enable detection
would actually work. Rafael, please test this patch; if it doesn't work
on your system, we can still remove the check for the strap_msi_enable bit.

==========

Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge

On many RS780 systems, the vendor ID of the PCI/PCI bridge for the
internal graphics is set to that of the mainboard vendor, so the quirk
would not match and failed to notice the disabled MSI.

Since we do not know in advance all possible vendor IDs, we have to
declare the quirk on another device with an ID that is known to be
correct, and use that as a stepping stone to find the PCI/PCI bridge,
if present.

Signed-off-by: Clemens Ladisch <cle...@ladisch.de>
Cc: <sta...@kernel.org>

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2483,34 +2483,38 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT

* MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio

* devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit.

*/
-static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge)

+static void __init rs780_int_gfx_disable_msi(struct pci_dev *host_bridge)
{
+ struct pci_dev *int_gfx_bridge;
u32 nb_cntl;

- if (!int_gfx_bridge->subordinate)
+ /*
+ * Many OEMs change the vendor ID of the internal graphics PCI/PCI
+ * bridge, so we use the possible vendor/device IDs of the host bridge
+ * for the declared quirk, and search for the PCI/PCI bridge by slot
+ * number.
+ */
+ int_gfx_bridge = pci_get_slot(host_bridge->bus, PCI_DEVFN(1, 0));
+ if (!int_gfx_bridge)
return;
+ if (int_gfx_bridge->device != 0x9602 || !int_gfx_bridge->subordinate)
+ goto out;

- pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
- 0x60, 0);
- pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0),
- 0x64, &nb_cntl);

+ pci_write_config_dword(host_bridge, 0x60, 0);
+ pci_read_config_dword(host_bridge, 0x64, &nb_cntl);

if (!(nb_cntl & BIT(10))) {
dev_warn(&int_gfx_bridge->dev,

FW_WARN "RS780: MSI for internal graphics disabled\n");

int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
}
-}

-#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602
+out:
+ pci_dev_put(int_gfx_bridge);
+}

-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD,
- PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
- rs780_int_gfx_disable_msi);
-/* wrong vendor ID on M4A785TD motherboard: */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK,
- PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX,
- rs780_int_gfx_disable_msi);

+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9600, rs780_int_gfx_disable_msi);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9601, rs780_int_gfx_disable_msi);

#endif /* CONFIG_PCI_MSI */

Rafael J. Wysocki

unread,

Apr 3, 2010, 3:40:01 PM4/3/10

to

On Saturday 03 April 2010, Clemens Ladisch wrote:
> Rafael J. Wysocki wrote:
> > From: Clemens Ladisch <cle...@ladisch.de>
> > Subject: PCI quirk: RS780/RS880: disable MSI completely
> >
> > The missing initialization of the nb_cntl.strap_msi_enable does not
> > seem to be the only problem that prevents MSI, so that quirk is not
> > sufficient to enable MSI on all machines. To be safe, disable MSI
> > unconditionally for the internal graphics and HDMI audio on these
> > chipsets.
> >
> > [rjw: Added the PCI_VENDOR_ID_AI quirk.]
> > ...
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi);
>
> I fear I have to NACK this.

I'm afraid it's too late, the patch has been merged.

> The fact that two OEMs have changed the vendor
> ID makes it likely that this is a bug in AMD's template BIOS code, and that
> we will see the same problem on other systems using other vendor IDs.
>
> So we should not use the vendor ID of device 0x9602 to declare the quirk, but
> use some other device with an ID that is known to be correct. We already
> access the configuration space of the host bridge, so we should use that.
>
> Furthermore, the quirk in my first patch was never run at all on the ALi
> system, so it is probable that the nb_cntl.strap_msi_enable detection
> would actually work. Rafael, please test this patch; if it doesn't work
> on your system, we can still remove the check for the strap_msi_enable bit.
>
> ==========
>
> Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge
>
> On many RS780 systems, the vendor ID of the PCI/PCI bridge for the
> internal graphics is set to that of the mainboard vendor, so the quirk
> would not match and failed to notice the disabled MSI.
>
> Since we do not know in advance all possible vendor IDs, we have to
> declare the quirk on another device with an ID that is known to be
> correct, and use that as a stepping stone to find the PCI/PCI bridge,
> if present.
>
> Signed-off-by: Clemens Ladisch <cle...@ladisch.de>
> Cc: <sta...@kernel.org>

Yes, this works (after reverting commit
5193d7a7f500cfbbfc0de221e808208199723521 and removing the
(rdev->flags & RADEON_IS_IGP) test from radeon_irq_kms_init()).

Thanks,
Rafael

Linus Torvalds

unread,

Apr 3, 2010, 11:50:01 PM4/3/10

to

On Tue, 30 Mar 2010, Larry Finger wrote:
>
> I understand that you do not want a whole set of such "fix one such bug at a
> time" patches, but the bottom line is that these lockdep warnings for known
> problems interfere with "normal" debugging because further lockdep checking is
> disabled. I do not feel qualified to detect other cases and fix them. Only when
> a warning hits my logs am I qualified to fix and test.

I'd actually like the owner for this to send me the patches - I don't go
out downloading things from patchwork, and quite frankly, I was expecting
Eric Biederman to follow up on these issues, since that's where the new
requirements came from. Maybe the patches are already waiting in Greg's
tree or something?

But if people are just waiting around for others to take them, if somebody
who has tracked this issue (hint hint, Larry) sends me the final patches,
I can take them that way too.

Linus

Wolfram Sang

unread,

Apr 4, 2010, 12:00:02 AM4/4/10

to

On Sat, Apr 03, 2010 at 08:39:10PM -0700, Linus Torvalds wrote:

> requirements came from. Maybe the patches are already waiting in Greg's
> tree or something?

Andrew picked them up meanwhile.

Regards,

Wolfram

--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |

signature.asc

Greg KH

unread,

Apr 4, 2010, 12:10:01 AM4/4/10

to

On Sun, Apr 04, 2010 at 05:54:38AM +0200, Wolfram Sang wrote:
> On Sat, Apr 03, 2010 at 08:39:10PM -0700, Linus Torvalds wrote:
>
> > requirements came from. Maybe the patches are already waiting in Greg's
> > tree or something?
>
> Andrew picked them up meanwhile.

And I have them as well, Linus will get them in a few days also, when I
catch up with my queue.

thanks,

greg k-h

Minchan Kim

unread,

Apr 4, 2010, 12:20:02 PM4/4/10

to

Hi, Rik.

On Fri, 2010-04-02 at 18:01 -0400, Rik van Riel wrote:
> On 04/02/2010 02:37 PM, Linus Torvalds wrote:
> > On Fri, 2 Apr 2010, Andrew Morton wrote:
> >> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torv...@linux-foundation.org> wrote:
> >>
> >>>
> >>> I think this is likely due to the new scalable anon_vma linking by Rik.
> >>
> >> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680
> >
> > Yup, looks like the same thing, except that bugzilla entry was due to
> > swapping rather than hibernation and memory shrinking. But same end
> > result, just different reasons for why we were trying to shrink the page
> > lists.
>
> Interesting that it is a null pointer dereference, given
> that we do not zero out the anon_vma_chain structs before
> freeing them.
>
> Page_referenced_anon() takes the anon_vma->lock before
> walking the list. The three places where we modify the
> anon_vma_chain->same_anon_vma list, we also hold the
> lock.
>
> No doubt something in mm/ is doing something silly, but
> I have not found anything yet :(
>
> If I had to guess, I'd say maybe we got one of the
> mprotect & vma_adjust cases wrong. Maybe a page stayed
> around in the LRU (and in a process?) after its anon_vma
> already got freed?

While I review the code again due to this BUG, I found some strange
thing.

In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
has vma which is destroyed.
I couldn't find any clean routine to remove this garbage.
I am missing something?

But I think it isn't related to this bug because oops point is not
vma_address but anon_vma_chain.next.

--
Kind regards,
Minchan Kim

Rik van Riel

unread,

Apr 4, 2010, 1:30:03 PM4/4/10

to

On 04/04/2010 12:12 PM, Minchan Kim wrote:

> While I review the code again due to this BUG, I found some strange
> thing.
>
> In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is
> failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which
> has vma which is destroyed.
> I couldn't find any clean routine to remove this garbage.
> I am missing something?

Good catch. The parent VMA's anon_vmas will get delinked
eventually, but we need to get rid of the newly allocated
child anon_vmas. You found a hopefully rare memory leak...

We need a call to unlink_anon_vmas(vma) at the error label
to do that.

> But I think it isn't related to this bug because oops point is not
> vma_address but anon_vma_chain.next.

Agreed, it's probably not it.

Rik van Riel

unread,

Apr 4, 2010, 7:20:01 PM4/4/10

to

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

Reported-by: Minchan Kim <minch...@gmail.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..fb7ce99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)

out_error_free_anon_vma:
anon_vma_free(anon_vma);
+ unlink_anon_vmas(vma);
out_error:
return -ENOMEM;

Minchan Kim

unread,

Apr 4, 2010, 8:00:01 PM4/4/10

to

On Mon, Apr 5, 2010 at 8:09 AM, Rik van Riel <ri...@redhat.com> wrote:
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
>
> Reported-by: Minchan Kim <minch...@gmail.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

Reviewed-by: Minchan Kim <minch...@gmail.com>

--
Kind regards,
Minchan Kim

Linus Torvalds

unread,

Apr 5, 2010, 11:50:02 AM4/5/10

to

On Sun, 4 Apr 2010, Rik van Riel wrote:
>
> Fix a memory leak in anon_vma_fork(), where we fail to tear down the
> anon_vmas attached to the new VMA in case setting up the new anon_vma
> fails.
>
> Reported-by: Minchan Kim <minch...@gmail.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

> Reviewed-by: Minchan Kim <minch...@gmail.com>
> ---

>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..fb7ce99 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>
> out_error_free_anon_vma:
> anon_vma_free(anon_vma);
> + unlink_anon_vmas(vma);
> out_error:
> return -ENOMEM;
> }

This looks _very_ wrong to me.

Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we
should do it even if the "anon_vma_alloc()" failed, nbot just if the
"anon_vma_chain_alloc()" failed?

No?

What am I missing?

Linus

Minchan Kim

unread,

Apr 5, 2010, 12:00:03 PM4/5/10

to

Indeed. You're right.
I should have been reviewed more carefully.

--
Kind regards,
Minchan Kim

Rik van Riel

unread,

Apr 5, 2010, 12:10:02 PM4/5/10

to

On 04/05/2010 11:37 AM, Linus Torvalds wrote:

> This looks _very_ wrong to me.
>
> Shouldn't the unlink_anon_vmas() be in the "out_error" case?

Indeed it should. I've had my mind somewhere else this weekend :/

New patch in the next mail.

Rik van Riel

unread,

Apr 5, 2010, 12:20:02 PM4/5/10

to

Fix a memory leak in anon_vma_fork(), where we fail to tear down the
anon_vmas attached to the new VMA in case setting up the new anon_vma
fails.

This bug also has the potential to leave behind anon_vma_chain structs
with pointers to invalid memory.

Reported-by: Minchan Kim <minch...@gmail.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>

diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..eaa7a09 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -232,6 +232,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
out_error_free_anon_vma:
anon_vma_free(anon_vma);
out_error:
+ unlink_anon_vmas(vma);
return -ENOMEM;

KOSAKI Motohiro

unread,

Apr 6, 2010, 5:00:02 AM4/6/10

to

>
> I think this is likely due to the new scalable anon_vma linking by Rik.
> Nothing else I can imagine should have introduced anything like it.
>
> Rik: the picures have the information, but you need to look at several to
> see both the oops and the backtrace. Here's a condensed version:
>
> shrink_all_memory ->
> do_try_to_free_pages ->
> shrink_zone ->
> shrink_inactive_list ->
> shrink_page_list ->
> page_referenced
>
> where page_referenced() oopses due page_referenced_anon() as per
> Borislav's description below.
>
> Added all the usual suspects to the Cc list. Left the full report appended
> so that the new people don't have to search for it on lkml.

Today, I've reviewed this patch carefully. but I haven't found any bug.

1) anon_vma->list is alwasys protected anon_vma->lock.
2) If anyone forget to take lock, list_add() and/or list_del() never
assign to NULL.

then, NULL mean either three possibility.

a) we see uninitialized data
b) we see after freed data
c) we see memory corruption by another bug

but (a) can't happen because

static inline void __list_add()
{
next->prev = new;
new->next = next;
new->prev = prev;
prev->next = new; (*)
}

If uninitialized var is linked to avc list, new->next was already !NULL.

(b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
can see kfree()ed page. but it is safe. noone corrupt it.

now I doubt (c) ;-)

Also, I've runned stress workload with shrink_all_memory() today. but
I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
I'm frequently fail to reproduce)

I'll continue to work.

KOSAKI Motohiro

unread,

Apr 6, 2010, 6:10:02 AM4/6/10

to

> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
> can see kfree()ed page. but it is safe. noone corrupt it.

by the way: I haven't understand why rik's per process anon_vma concept
works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
not guranteed vma->anon_vma and page->anon_vma are the same.

but I guess bug reporter doesn't use ksm, it's minor feature.

Rik van Riel

unread,

Apr 6, 2010, 10:40:01 AM4/6/10

to

On 04/06/2010 06:09 AM, KOSAKI Motohiro wrote:
>> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma
>> freeing until next rcu period. It mean rcu_read_lock()+page_mapped()
>> can see kfree()ed page. but it is safe. noone corrupt it.
>
> by the way: I haven't understand why rik's per process anon_vma concept
> works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems
> not guranteed vma->anon_vma and page->anon_vma are the same.

KSM removes the page from its original anon_vma.

If the page gets reinstantiated (copy on write), it will be
created in the vma->anon_vma.

Am I overlooking something?

Rik van Riel

unread,

Apr 6, 2010, 10:40:02 AM4/6/10

to

On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:

> Today, I've reviewed this patch carefully. but I haven't found any bug.

> Also, I've runned stress workload with shrink_all_memory() today. but

> I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
> I'm frequently fail to reproduce)
>
> I'll continue to work.

My status with this bug is the same - I have gone through
the code from all angles, but have not found any other bugs
yet (except for that leak - which could leave invalid
pointers behind).

This makes me wonder if perhaps the bug is a side effect
of something Borislav (and the other reproducers) have
in their kernel configuration, which we do not have.

Another (unlikely) thing is that the fix for the leak
makes the bug go away. Yes, very unlikely.

Borislav, could you please send us your .config ?

Also, if you have the time, could you try out the
patch (-v2) I mailed in a little up this thread
that fixes the memory leak in anon_vma_fork?

I suspect it should not change anything, but it
could be useful to rule out anyway.

Minchan Kim

unread,

Apr 6, 2010, 11:40:01 AM4/6/10

to

On Tue, 2010-04-06 at 10:38 -0400, Rik van Riel wrote:
> On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote:
>
> > Today, I've reviewed this patch carefully. but I haven't found any bug.
>
> > Also, I've runned stress workload with shrink_all_memory() today. but
> > I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy.
> > I'm frequently fail to reproduce)
> >
> > I'll continue to work.
>
> My status with this bug is the same - I have gone through
> the code from all angles, but have not found any other bugs
> yet (except for that leak - which could leave invalid
> pointers behind).

Let's see the unlink_anon_vmas.

1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
2. anon_vma_unlink
3. spin_lock(anon_vma->lock) <-- HERE LOCK.
4. list_del(anon_vma_chain->same_anon_vma);

What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
anon_vma object between 2 and 3?
I mean how to make sure 3) does lock valid anon_vma?

I hope it is culprit.

--
Kind regards,
Minchan Kim

Rik van Riel

unread,

Apr 6, 2010, 11:50:02 AM4/6/10

to

On 04/06/2010 11:34 AM, Minchan Kim wrote:

> Let's see the unlink_anon_vmas.
>
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. anon_vma_unlink
> 3. spin_lock(anon_vma->lock)<-- HERE LOCK.
> 4. list_del(anon_vma_chain->same_anon_vma);
>
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma?
>
> I hope it is culprit.

How can the anon_vma get destroyed and reused, when this
anon_vma_chain still has a reference to it (and the
anon_vma has not been freed yet)?

What combination of circumstances is necessary for
your bug hypothetical to happen?

Minchan Kim

unread,

Apr 6, 2010, 12:00:02 PM4/6/10

to

On Tue, 2010-04-06 at 11:40 -0400, Rik van Riel wrote:
> On 04/06/2010 11:34 AM, Minchan Kim wrote:
>
> > Let's see the unlink_anon_vmas.
> >
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. anon_vma_unlink
> > 3. spin_lock(anon_vma->lock)<-- HERE LOCK.
> > 4. list_del(anon_vma_chain->same_anon_vma);
> >
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma?
> >
> > I hope it is culprit.
>
> How can the anon_vma get destroyed and reused, when this
> anon_vma_chain still has a reference to it (and the

Doesn't anon_vma_chain have a ref counter on anon_vma?

> anon_vma has not been freed yet)?

AFAIK, anon_vma can be reused without free by SLAB_XXX_RCU.
So we always use it carefully by page_lock_anon_vma or manual check
with RCU and page_mapped.

What am I missing?

>
> What combination of circumstances is necessary for
> your bug hypothetical to happen?

CPU A CPU B

unlink_anon_vmas
list_for_each_entry

free_pgtable
anon_vma_unlink
<crazy stall> spin_lock(anon_vma);
list_del(same_anon_vma)
spin_unlock(anon_vma)
anon_vma_unlink
anon_vma_free
reuse for another anon_vma
spin_lock(another anon_vma)
list_del(another anon_vma)

If my assumption is wrong, please correct me.
Thanks, Rik.

--
Kind regards,
Minchan Kim

Linus Torvalds

unread,

Apr 6, 2010, 12:10:02 PM4/6/10

to

On Wed, 7 Apr 2010, Minchan Kim wrote:
>
> Let's see the unlink_anon_vmas.
>
> 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> 2. anon_vma_unlink
> 3. spin_lock(anon_vma->lock) <-- HERE LOCK.
> 4. list_del(anon_vma_chain->same_anon_vma);
>
> What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> anon_vma object between 2 and 3?
> I mean how to make sure 3) does lock valid anon_vma?
>
> I hope it is culprit.

I don't think so. That isn't the racy case. We're working with a
anon_vma_chain, so the anonvma is all there.

The racy case is when we look up an anonvma by the page, and the page gets
unmapped at the same time because somebody else is travelling over the LRU
list of the page itself, isn't it?

I do wonder if "page_lock_anon_vma()" should check the whole
"page_mapped()" case _after_ taking the anon_vma lock. Because if the race
happens, we're following a anon_vma list that has nothing to do with that
page (it's stilla _valid_ list, since we locked the anon_vma, but will it
be ok?)

IOW, what is it that really keeps the anon_vma list reliable _and_
relevant wrt the page? We know we may get a stale anon_vma, are we ok if
that anon_vma list doesn't actually have anything to do with the page any
more?

I think the first check in "page_address_in_vma()" protects us, but
whatever.

However, that made me look at the PAGE_MIGRATION case. That seems to be
just broken. It's doing that page_anon_vma() + spin_lock without holding
any RCU locks, so there is no guarantee that anon_vma there is at all
valid.

Is that function always called with rcu_read_lock()?

Linus

Minchan Kim

unread,

Apr 6, 2010, 12:30:03 PM4/6/10

to

Hi, Linus.

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > Let's see the unlink_anon_vmas.
> >
> > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma)
> > 2. anon_vma_unlink
> > 3. spin_lock(anon_vma->lock) <-- HERE LOCK.
> > 4. list_del(anon_vma_chain->same_anon_vma);
> >
> > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another
> > anon_vma object between 2 and 3?
> > I mean how to make sure 3) does lock valid anon_vma?
> >
> > I hope it is culprit.
>
> I don't think so. That isn't the racy case. We're working with a
> anon_vma_chain, so the anonvma is all there.
>

But the anon_vma is using for another anon_vma.
Nonetheless, anon_vma_unlink does list_del(anon_vma's same_anon_vma).
I doubt it.

> The racy case is when we look up an anonvma by the page, and the page gets
> unmapped at the same time because somebody else is travelling over the LRU
> list of the page itself, isn't it?

Yes. but I thought page might travel with anon_vmas which have
same_anon_vma deleted by race.

>
> I do wonder if "page_lock_anon_vma()" should check the whole
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race
> happens, we're following a anon_vma list that has nothing to do with that
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it
> be ok?)

So we always use it with (vma_address and page_check_address) to make
sure validation of anon_vma.
But I think it's not good design. I want to hold lock ahead checking of
page_mapped but maybe performance issue? I am not sure.

>
> IOW, what is it that really keeps the anon_vma list reliable _and_
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if
> that anon_vma list doesn't actually have anything to do with the page any
> more?
> I think the first check in "page_address_in_vma()" protects us, but
> whatever.
>
> However, that made me look at the PAGE_MIGRATION case. That seems to be
> just broken. It's doing that page_anon_vma() + spin_lock without holding
> any RCU locks, so there is no guarantee that anon_vma there is at all
> valid.

FYI, recently there is a patch about migration case.
http://lkml.org/lkml/2010/4/2/145

>
> Is that function always called with rcu_read_lock()?
>
> Linus

--
Kind regards,
Minchan Kim

Linus Torvalds

unread,

Apr 6, 2010, 12:40:02 PM4/6/10

to

On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > I don't think so. That isn't the racy case. We're working with a
> > anon_vma_chain, so the anonvma is all there.
>
> But the anon_vma is using for another anon_vma.

No, that can only happen if somebody has done "anon_vma_free()" on it. And
nobody does that if the anonvma still has a non-empty'&anon_vma->head'.

So as long as the anon_vma has a anon_vma_chain entry associated with it
(or a ksm refcount, but that's a separate issue), it's not going to be
re-allocated for any other use, because it's not going to be free'd.

Linus

Linus Torvalds

unread,

Apr 6, 2010, 12:40:02 PM4/6/10

to

On Wed, 7 Apr 2010, Minchan Kim wrote:
> >
> > However, that made me look at the PAGE_MIGRATION case. That seems to be
> > just broken. It's doing that page_anon_vma() + spin_lock without holding
> > any RCU locks, so there is no guarantee that anon_vma there is at all
> > valid.
>
> FYI, recently there is a patch about migration case.
> http://lkml.org/lkml/2010/4/2/145

No, I'm talking about rmap_walk_anon():

anon_vma = page_anon_vma(page);
if (!anon_vma)
return ret;
spin_lock(&anon_vma->lock);

which seems to be simply buggy. The anon_vma may not exist any more,
because an RCU event might have really freed the page between looking it
up and locking it.

Linus

Minchan Kim

unread,

Apr 6, 2010, 12:50:01 PM4/6/10

to

On Tue, 2010-04-06 at 09:28 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > >
> > > However, that made me look at the PAGE_MIGRATION case. That seems to be
> > > just broken. It's doing that page_anon_vma() + spin_lock without holding
> > > any RCU locks, so there is no guarantee that anon_vma there is at all
> > > valid.
> >
> > FYI, recently there is a patch about migration case.
> > http://lkml.org/lkml/2010/4/2/145
>
> No, I'm talking about rmap_walk_anon():
>
> anon_vma = page_anon_vma(page);
> if (!anon_vma)
> return ret;
> spin_lock(&anon_vma->lock);
>
> which seems to be simply buggy. The anon_vma may not exist any more,
> because an RCU event might have really freed the page between looking it
> up and locking it.
>
> Linus

unmap_and_move
remove_migration_ptes
rmap_walk
rmap_walk_anon

We always has rcu_read_lock about anon page in unmap_and_move.
So I think it's not buggy. What am I missing?

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Apr 6, 2010, 1:00:05 PM4/6/10

to

On Tue, 2010-04-06 at 09:32 -0700, Linus Torvalds wrote:
>
> On Wed, 7 Apr 2010, Minchan Kim wrote:
> > >
> > > I don't think so. That isn't the racy case. We're working with a
> > > anon_vma_chain, so the anonvma is all there.
> >
> > But the anon_vma is using for another anon_vma.
>
> No, that can only happen if somebody has done "anon_vma_free()" on it. And
> nobody does that if the anonvma still has a non-empty'&anon_vma->head'.
>
> So as long as the anon_vma has a anon_vma_chain entry associated with it
> (or a ksm refcount, but that's a separate issue), it's not going to be
> re-allocated for any other use, because it's not going to be free'd.
>

> Linus

That's what I am missing.
Thanks, Linus.

I will think over the problem. :)

--
Kind regards,
Minchan Kim

Linus Torvalds

unread,

Apr 6, 2010, 1:00:04 PM4/6/10

to

On Wed, 7 Apr 2010, Minchan Kim wrote:
>
> unmap_and_move
> remove_migration_ptes
> rmap_walk
> rmap_walk_anon
>
> We always has rcu_read_lock about anon page in unmap_and_move.
> So I think it's not buggy. What am I missing?

Ok, in that case it's fine.

However, it does bring back my comment about all those anonvma changes:
the locking is totally undocumented.

Why isn't there a thing _saying_ that it's ok because of this?

Why is there no comment about the locking of that 'same_vma' /
'vma->anon_vma_chain' except for the totally nonsensical one about
page_table_lock (which doesn't protect _any_ of the other cases)?

Linus

Rik van Riel

unread,

Apr 6, 2010, 1:10:02 PM4/6/10

to

On 04/06/2010 12:53 PM, Linus Torvalds wrote:
> On Wed, 7 Apr 2010, Minchan Kim wrote:
>>
>> unmap_and_move
>> remove_migration_ptes
>> rmap_walk
>> rmap_walk_anon
>>
>> We always has rcu_read_lock about anon page in unmap_and_move.
>> So I think it's not buggy. What am I missing?
>
> Ok, in that case it's fine.
>
> However, it does bring back my comment about all those anonvma changes:
> the locking is totally undocumented.
>
> Why isn't there a thing _saying_ that it's ok because of this?
>
> Why is there no comment about the locking of that 'same_vma' /
> 'vma->anon_vma_chain' except for the totally nonsensical one about
> page_table_lock (which doesn't protect _any_ of the other cases)?

Which other cases? When do we ever walk the "same_vma" list
not from the context of the process owning the vma?

This bug in page_referenced is walking the "same_anon_vma" list,
which is locked with the anon_vma->lock.

Borislav Petkov

unread,

Apr 6, 2010, 1:10:01 PM4/6/10

to

From: Rik van Riel <ri...@redhat.com>
Date: Tue, Apr 06, 2010 at 10:38:18AM -0400

> This makes me wonder if perhaps the bug is a side effect
> of something Borislav (and the other reproducers) have
> in their kernel configuration, which we do not have.
>
> Another (unlikely) thing is that the fix for the leak
> makes the bug go away. Yes, very unlikely.
>
> Borislav, could you please send us your .config ?

attached.

> Also, if you have the time, could you try out the
> patch (-v2) I mailed in a little up this thread
> that fixes the memory leak in anon_vma_fork?

Sure, building ontop of v2.6.34-rc3-288-gab195c5.

Will try to trigger it but let me remind you that it will take a while
since it doesn't happen everytime I suspend.

Any other printks or debug output which might be helpful to slap at the
site, page_referenced_anon() I mean?

> I suspect it should not change anything, but it
> could be useful to rule out anyway.
>

--
Regards/Gruss,
Boris.

config-2.6.34-rc3

Linus Torvalds

unread,

Apr 6, 2010, 2:40:02 PM4/6/10

to

On Tue, 6 Apr 2010, Rik van Riel wrote:
>
> Which other cases? When do we ever walk the "same_vma" list
> not from the context of the process owning the vma?

That's the point. What does 'owning the vma' mean? That's exactly what I'm
asking to be documented.

Quite frankly, the thing is a mess. There is _no_ comment on why it's ok
to modify the list or walk the list, except for the one totally misleading
one, since the page_table_lock has at most a _secondary_ meaning in the
whole ownership (ie it is used only when we do _not_ own the vma chain
exclusively).

So your very comment shows the whole confusion. No, we do not "own the
vma" in all cases. Sometimes we just have a read-lock on it.

> This bug in page_referenced is walking the "same_anon_vma" list,
> which is locked with the anon_vma->lock.

Umm. Wake the hell up, Rik!

It's walking a _corrupt_ same_anon_vma list. In other words, we _know_
that the 'anon_vma_chain' entry is crap. We know that exactly because it
contains "impossible" values with regard to the list.

And what's the easiest way to get such a corrupt list, considering that
the locking looks correct for that particular list?

That's right: by having something like anon_vma_clone() do something bad
when it walks the same avc entries using the 'same_vma' list and creates
copies of it.

You can't just say "but but but same_anon_vma list is always locked
properly". Because it doesn't matter if that list is locked properly if
walking _another_ list doesn't work right.

I really don't understand why you keep on harping on thatr same_anon_vma
list. The fact that that was the corrupt list IN ABSOLUTELY NO WAY implies
that that is the list that caused the corruption.

For example, let's say that the 'anon_vma_chain' list is corrupted. Never
mind how. So what could happen is that you'd have vma->anon_vma pointing
to one thing, and one or more entries on the 'vma->anon_vma_chain' list
pointing to _another_ anon_vma.

What happens then? I have no idea. Maybe nothing bad. But the point is, if
one avc list is corrupted and you may end up referencing those avc's in
unexpected cases, how can you trust the other list that is in the same
data structure?

For example, maybe some list corruption causes us to do that
"anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
"list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
already had "same_anon_vma" on one list.

No, I really don't see how that could happen, but my argument is that a
corrupt list can do odd things. The same entry might end up pointing to
itself, so that you end up freeing it twice or something.

Just as an example of the kind of code that makes me worry:

void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;

/* Unlink each anon_vma chained to the VMA. */
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
anon_vma_unlink(avc);
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}
}

Now, think about what happens for the *last* entry in that avc chain. It
will call that "anon_vma_unlink()" thing, which will delete perhaps the
last entry in the "same_anon_vma" one, and then it does

if (empty)
anon_vma_free(anon_vma);

*before* unlink_anon_vma's has actually does that

list_del(&avc->same_vma);

and what we essentially have is a stale anon_vma_chain entry that still
exists on that same_vma list, and points to an anon_vma that already got
deleted.

Does it matter? I really can't see that it does. But that's the kind of
thing that makes me nervous. It makes me _especially_ nervous when the
whole locking for that anon_vma_chain thing isn't entirely obvious.

Linus

Andrew Morton

unread,

Apr 6, 2010, 3:10:02 PM4/6/10

to

On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
Linus Torvalds <torv...@linux-foundation.org> wrote:

> For example, maybe some list corruption causes us to do that
> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> already had "same_anon_vma" on one list.

The lib/list_debug.c stuff might detect such things. I wonder if
either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Linus Torvalds

unread,

Apr 6, 2010, 3:20:01 PM4/6/10

to

On Tue, 6 Apr 2010, Andrew Morton wrote:

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> > For example, maybe some list corruption causes us to do that
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> > already had "same_anon_vma" on one list.
>
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Well, even without CONFIG_LIST_DEBUG we'd catch _some_ things, and
conversely, even with LIST_DEBUG on we don't catch everything.

For example, doing list_del() twice on the same entry will die with a
really nice pattern due to poisoning even without LIST_DEBUG.

But list_add() twice on the same entry will sadly silently succeed both
with and without list debugging (the list debugging will check the target
list head, but there is no way to check the "new->next/prev" entries).

Anyway, I've not actually found anything wrong in the same_vma locking.
And I'm not at all convinced there is any list corruption there. My point
was really only that
(a) the locking rules seem very unclear and certainly not documented and
(b) corruption of one list could easily be the cause of corruption of
another list of the same structure.
but I don't actually see anything wrong anywhere.

Linus

Steinar H. Gunderson

unread,

Apr 6, 2010, 3:40:02 PM4/6/10

to

On Tue, Apr 06, 2010 at 12:03:15PM -0700, Andrew Morton wrote:
>> For example, maybe some list corruption causes us to do that
>> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
>> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
>> already had "same_anon_vma" on one list.
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

Not set on my kernel.

/* Steinar */
--
Homepage: http://www.sesse.net/

Linus Torvalds

unread,

Apr 6, 2010, 3:50:04 PM4/6/10

to

On Tue, 6 Apr 2010, Linus Torvalds wrote:
>
> Anyway, I've not actually found anything wrong in the same_vma locking.
> And I'm not at all convinced there is any list corruption there. My point
> was really only that
> (a) the locking rules seem very unclear and certainly not documented and
> (b) corruption of one list could easily be the cause of corruption of
> another list of the same structure.
> but I don't actually see anything wrong anywhere.

I _have_ found what looks like a few clues, though.

In particular, the disassembly in Steinar Gunderson's case looks much more
like the disassembly I get, and if I read that correctly, it's actually
the _first_ iteration of the for_each_entry() loop that crashes.

Why do I think so?

In Steinar's oops, we have "RAX: ffff880169111fc8", which is clearly a
kernel pointer. However, the code from Steinar's oops decodes to:

0: 3b 56 10 cmp 0x10(%rsi),%edx
3: 73 1e jae 0x23
5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx
9: 74 18 je 0x23
b: 4d 89 f8 mov %r15,%r8
e: 48 8d 4d cc lea -0x34(%rbp),%rcx
12: 4c 89 e7 mov %r12,%rdi
15: e8 44 f2 ff ff callq 0xfffffffffffff25e
1a: 41 01 c5 add %eax,%r13d
1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
21: 74 19 je 0x3c
23: 48 8b 43 20 mov 0x20(%rbx),%rax
27: 48 8d 58 e0 lea -0x20(%rax),%rbx
2b:* 48 8b 43 20 mov 0x20(%rbx),%rax <-- trapping instruction
2f: 0f 18 08 prefetcht0 (%rax)
32: 48 8d 43 20 lea 0x20(%rbx),%rax
36: 48 39 45 88 cmp %rax,-0x78(%rbp)
3a: 75 a7 jne 0xffffffffffffffe3
3c: 41 fe 06 incb (%r14)
3f: e9 .byte 0xe9

which matches my code pretty well, and the point is, _if_ it went through
the loop, then %rbx should be %rax+20. And it's not.

IOW, the code you see above before the trapping instruction is the end of
the loop: it's the

referenced += page_referenced_one(page, vma, address,
&mapcount, vm_flags);
if (!mapcount)
break;
}

part (the "callq" and "add %eax" is that "referenced +=", and %r13d is
"referenced").

What you cannot see from the code decode is the loop setup and _entry_,
which looks like this for me:

movl 12(%rbx), %eax # <variable>.D.11299._mapcount.counter, D.33294
xorl %r12d, %r12d # referenced
incl %eax # tmp89
movl %eax, -52(%rbp) # tmp89, mapcount
leaq 48(%r14), %rax #,
movq 48(%r14), %r13 # <variable>.head.next, <variable>.head.next
movq %rax, -128(%rbp) #, %sfp
subq $32, %r13 #, avc
jmp .L167 #

where that "L167" is actually the oopsing instruction (ie the "while" loop
has been turned around, and we jump to the end of the loop that does the
loop end test).

In other words, what is NULL here is not an anon_vma_chain entry, but
actually the initial "anon_vma->head.next" pointer.

The whole _head_ of the list has never been initialized, in other words.

So we can entirely ignore the 'anon_vma_chain' issues. We need to look at
the initializations of the 'anon_vma's themselves.

Borislav Petkov

unread,

Apr 6, 2010, 3:50:04 PM4/6/10

to

From: Andrew Morton <ak...@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 12:03:15PM -0700

> On Tue, 6 Apr 2010 11:28:52 -0700 (PDT)
> Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> > For example, maybe some list corruption causes us to do that
> > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that
> > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that
> > already had "same_anon_vma" on one list.
>
> The lib/list_debug.c stuff might detect such things. I wonder if
> either Borislav or Steinar had CONFIG_DEBUG_LIST enabled?

No, it is off in my .config. I'll turn it on and retest to see whether
it screams something. In the meantime, I've been testing current git
(v2.6.34-rc3-288-gab195c5), and especially Rik's mem leak fix which
Linus already committed (4946d54cb55e86a156216fcfeed5568514b0830f) and
tried to retrigger the bug by hibernating the machine several times.

Now, this machine has 8G of memory so I thought maybe if starting
several assorted guests on it would put some pressure on anon_vma lists
but no, the machine habernated happily by creating almost a 600Mb
hibernation image and having all three guests loaded.

Then, I said, well, let's have another last test run and started firefox
which went into reloading the last session. And I remember that firefox
still hadn't finished loading all pages when I hibernated and boom, it
oopsed.

So, it definitely is some anon_vma lists concurrency issue ... The good
thing is, I was able to catch the oops in its sheer magnificence over
netconsole this time:

[ 2995.478125] PM: Preallocating image memory...
[ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0
[ 2995.714001] Oops: 0000 [#1] PREEMPT SMP
[ 2995.714001] last sysfs file: /sys/power/state
[ 2995.714001] CPU 0
[ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.714001]
[ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.714001] RIP: 0010:[<ffffffff810c194d>] [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP: 0018:ffff88022fa038b8 EFLAGS: 00010283
[ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000
[ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520
[ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000
[ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000
[ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00
[ 2995.714001] FS: 00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 2995.714001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0
[ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.714001] Stack:
[ 2995.714001] ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416
[ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60
[ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98
[ 2995.714001] Call Trace:
[ 2995.714001] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.714001] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.714001] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.714001] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.714001] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.714001] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.714001] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.714001] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.714001] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.714001] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.714001] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.714001] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.714001] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.714001] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.714001] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.714001] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.714001] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.714001] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.714001] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.714001] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.714001] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.714001] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.714001] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.714001] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.714001] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8
[ 2995.714001] RIP [<ffffffff810c194d>] page_referenced+0xee/0x1dc
[ 2995.714001] RSP <ffff88022fa038b8>
[ 2995.714001] CR2: 0000000000000000
[ 2995.729717] ---[ end trace 92c25d74e4800968 ]---
[ 2995.729862] note: hib.sh[7440] exited with preempt_count 2
[ 2995.730022] BUG: scheduling while atomic: hib.sh/7440/0x10000003
[ 2995.730170] INFO: lockdep is turned off.
[ 2995.730319] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.731749] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1
[ 2995.732003] Call Trace:
[ 2995.732158] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.732305] [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.732454] [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.732603] [<ffffffff81030301>] __cond_resched+0x18/0x24
[ 2995.732751] [<ffffffff813fa12e>] _cond_resched+0x2c/0x37
[ 2995.732900] [<ffffffff810b8a21>] unmap_vmas+0x6ce/0x893
[ 2995.733053] [<ffffffff810bd0f5>] exit_mmap+0xd7/0x182
[ 2995.733206] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.733356] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.733505] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.733653] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.733802] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.733950] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.734102] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.734255] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.734407] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.734556] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.734705] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.734854] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.735008] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.735161] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.735313] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.735463] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.735612] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.735761] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.735910] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.736062] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.736216] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.736368] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.736518] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.736666] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.736816] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.736965] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.737117] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.737270] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.737422] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.737570] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.737719] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.737868] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.738020] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.738175] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.738326] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.738475] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.738623] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.738772] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.738920] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.739073] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.739226] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.739378] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.739526] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.739940] BUG: unable to handle kernel paging request at 00007faf064ff1f0
[ 2995.740220] IP: [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740441] PGD 0
[ 2995.740646] Oops: 0000 [#2] PREEMPT SMP
[ 2995.740685] last sysfs file: /sys/power/state
[ 2995.740685] CPU 1
[ 2995.740685] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.740685]
[ 2995.740685] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name
[ 2995.740685] RIP: 0010:[<ffffffff8119c0d0>] [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685] RSP: 0018:ffff88022fa03438 EFLAGS: 00010292
[ 2995.740685] RAX: ffff88022fb32520 RBX: 00007faf064ff1f0 RCX: 0000000000000000
[ 2995.740685] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007faf064ff1f0
[ 2995.740685] RBP: ffff88022fa03438 R08: 0000000000000002 R09: 0000000000000000
[ 2995.740685] R10: dead000000100100 R11: ffffffff810d26f5 R12: 00007faf064ff208
[ 2995.740685] R13: fffffffffffffff0 R14: ffff88022d747068 R15: 00007f4da81fa000
[ 2995.740685] FS: 00007f4da8b966f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000
[ 2995.740685] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2995.740685] CR2: 00007faf064ff1f0 CR3: 0000000001646000 CR4: 00000000000006e0
[ 2995.740685] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2995.740685] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2995.740685] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520)
[ 2995.740685] Stack:
[ 2995.740685] ffff88022fa03468 ffffffff813fc6c3 ffffffff810c1ae3 ffff8801cbfde880
[ 2995.740685] <0> ffff88022fb32510 00007faf064ff1f0 ffff88022fa034a8 ffffffff810c1ae3
[ 2995.740685] <0> ffff88022fa034a8 ffff88022d747000 0000000000000000 0000000000000000
[ 2995.740685] Call Trace:
[ 2995.740685] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.740685] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.740685] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.740685] [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.740685] [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.740685] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.740685] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.740685] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.740685] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.740685] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.740685] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.740685] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.740685] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.740685] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.740685] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.740685] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.740685] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.740685] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.740685] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.740685] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.740685] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.740685] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.740685] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.740685] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.740685] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.740685] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.740685] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.740685] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.740685] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.740685] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.740685] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.740685] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.740685] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.740685] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.740685] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.740685] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.740685] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.740685] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.740685] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.740685] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.740685] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.740685] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.740685] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.740685] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.740685] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.740685] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.740685] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.740685] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 2995.740685] Code: c7 c7 90 16 67 81 e8 79 f1 25 00 48 c7 c7 90 16 67 81 e8 e1 e7 25 00 48 c7 c7 30 18 67 81 e8 d5 e7 25 00 c9 c3 90 90 55 48 89 e5 <0f> b7 07 38 e0 8d 90 00 01 00 00 75 05 f0 66 0f b1 17 0f 94 c2
[ 2995.740685] RIP [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a
[ 2995.740685] RSP <ffff88022fa03438>
[ 2995.740685] CR2: 00007faf064ff1f0
[ 2995.762521] ---[ end trace 92c25d74e4800969 ]---
[ 2995.762686] Fixing recursive fault but reboot is needed!
[ 2995.762855] BUG: scheduling while atomic: hib.sh/7440/0x00000005
[ 2995.763026] INFO: lockdep is turned off.
[ 2995.763203] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core
[ 2995.764799] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1
[ 2995.765080] Call Trace:
[ 2995.765256] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24
[ 2995.765429] [<ffffffff8102d499>] __schedule_bug+0x72/0x77
[ 2995.765600] [<ffffffff813f9a0a>] schedule+0xd9/0x730
[ 2995.765771] [<ffffffff8103b7f7>] do_exit+0xcf/0x6a2
[ 2995.765941] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.766115] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.766295] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.766462] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.766632] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.766806] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.766977] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.767161] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.767330] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.767501] [<ffffffff810a9eca>] ? release_pages+0x1ee/0x200
[ 2995.767673] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.767842] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.768017] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.768202] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768373] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.768544] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129
[ 2995.768716] [<ffffffff8119c0d0>] ? do_raw_spin_trylock+0x4/0x3a
[ 2995.768888] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73
[ 2995.769064] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1
[ 2995.769246] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1
[ 2995.769415] [<ffffffff810bb562>] free_pgtables+0x68/0xce
[ 2995.769586] [<ffffffff810bd11e>] exit_mmap+0x100/0x182
[ 2995.769756] [<ffffffff81035b58>] mmput+0x43/0xea
[ 2995.769925] [<ffffffff81039e99>] exit_mm+0x110/0x11d
[ 2995.770099] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2
[ 2995.770279] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155
[ 2995.770447] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 2995.770616] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 2995.770785] [<ffffffff8101ed99>] no_context+0x1fc/0x20b
[ 2995.770955] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af
[ 2995.771141] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d
[ 2995.771311] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15
[ 2995.771482] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d
[ 2995.771653] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130
[ 2995.771824] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6
[ 2995.771994] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 2995.772179] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2995.772352] [<ffffffff813fd8bf>] page_fault+0x1f/0x30
[ 2995.772524] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc
[ 2995.772696] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc
[ 2995.772867] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c
[ 2995.773043] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1
[ 2995.773226] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58
[ 2995.773398] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623
[ 2995.773572] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4
[ 2995.773742] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1
[ 2995.773916] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79
[ 2995.774093] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4
[ 2995.774274] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c
[ 2995.774444] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f
[ 2995.774617] [<ffffffff8103de67>] ? irq_exit+0x93/0x95
[ 2995.774786] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4
[ 2995.774958] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217
[ 2995.775144] [<ffffffff81077503>] ? count_data_pages+0x65/0x79
[ 2995.775314] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 2995.775487] [<ffffffff813f95b5>] ? printk+0x41/0x44
[ 2995.775657] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1
[ 2995.775828] [<ffffffff8107632c>] hibernate+0xce/0x172
[ 2995.775998] [<ffffffff81075099>] state_store+0x5c/0xd3
[ 2995.776182] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19
[ 2995.776350] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144
[ 2995.776521] [<ffffffff810d66ff>] vfs_write+0xb2/0x153
[ 2995.776690] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 2995.776863] [<ffffffff810d6863>] sys_write+0x4a/0x71
[ 2995.777038] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

--
Regards/Gruss,
Boris.

Linus Torvalds

unread,

Apr 6, 2010, 4:10:02 PM4/6/10

to

So again, I can show that the code has never actually been through the
loop. The above code decodes to:

0: 3b 56 10 cmp 0x10(%rsi),%edx
3: 73 1e jae 0x23
5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx
9: 74 18 je 0x23

b: 48 8d 4d cc lea -0x34(%rbp),%rcx
f: 4d 89 f8 mov %r15,%r8
12: 48 89 df mov %rbx,%rdi
15: e8 4d f2 ff ff callq 0xfffffffffffff267
1a: 41 01 c4 add %eax,%r12d

1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
21: 74 19 je 0x3c

23: 4d 8b 6d 20 mov 0x20(%r13),%r13
27: 49 83 ed 20 sub $0x20,%r13
2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction

2f: 0f 18 08 prefetcht0 (%rax)

32: 49 8d 45 20 lea 0x20(%r13),%rax
36: 48 39 45 80 cmp %rax,-0x80(%rbp)
3a: 75 aa jne 0xffffffffffffffe6
3c: 4c 89 f7 mov %r14,%rdi
3f: e8 .byte 0xe8

and in your case, if we had gone through the loop, then %rax would still
contain the return value from page_referenced_one().

But %rax is a kernel pointer, and %r12d is 0.

So again, it's actually anon_vma.head.next that is NULL, not any of the
entries on the list itself.

Now, I can see several cases for this:

- the obvious one: anon_vma just wasn't correctly initialized, and is
missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
don't have a whole lot of coverage of constructors), or somebody
allocated an anon_vma without using the anon_vma_cachep.

- Related to the above: perhaps the RCU freeing isn't working, or
slub/slab/slob ends up reusing the allocations for something else than
anonvma's, so together with the race _and_ an unlucky re-use, you get
some odd crud.

I haven't looked at the kernel config files: do they perhaps share the
same (odd?) SLUB/SLAB/SLOB config?

- anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
with the low bit set. That sounds unlikely, but who knows. The ksm code
sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"

Did people have KSM enabled?

.. and probably other things I haven't even thought about.

Linus

Steinar H. Gunderson

unread,

Apr 6, 2010, 4:50:01 PM4/6/10

to

On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

http://storage.sesse.net/config-crashing-2.6.34-rc2

> Did people have KSM enabled?

No KSM for me.

/* Steinar */
--
Homepage: http://www.sesse.net/

Borislav Petkov

unread,

Apr 6, 2010, 5:00:05 PM4/6/10

to

From: Linus Torvalds <torv...@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 01:02:35PM -0700

I've added code to verify this and am suspend/resuming now... Wait a
minute, Linus, you're good! :) :

[ 873.083074] PM: Preallocating image memory...
[ 873.254359] NULL anon_vma->head.next, page 2182681

This is the page_to_pfn number.

Now, how do we track back to the place which is missing anon_vma->head
init? Can we use the struct page *page arg to page_referenced_anon()
somehow?

[ 873.254654] Pid: 3642, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5-dirty #3
[ 873.254904] Call Trace:
[ 873.255063] [<ffffffff810c0c28>] page_referenced+0xd3/0x219
[ 873.255212] [<ffffffff810c5fb0>] ? swapcache_free+0x37/0x3c
[ 873.255364] [<ffffffff810ab782>] shrink_page_list+0x14a/0x477
[ 873.255512] [<ffffffff810aa6e0>] ? isolate_pages_global+0xc4/0x1f0
[ 873.255662] [<ffffffff813f8a76>] ? _raw_spin_unlock_irq+0x30/0x58
[ 873.255811] [<ffffffff810abe06>] shrink_inactive_list+0x357/0x5e5
[ 873.255960] [<ffffffff810ab626>] ? shrink_active_list+0x232/0x244
[ 873.256112] [<ffffffff810ac39e>] shrink_zone+0x30a/0x3d4
[ 873.256264] [<ffffffff810acf79>] do_try_to_free_pages+0x176/0x27f
[ 873.256416] [<ffffffff810ad117>] shrink_all_memory+0x95/0xc4
[ 873.256564] [<ffffffff810aa61c>] ? isolate_pages_global+0x0/0x1f0
[ 873.256713] [<ffffffff81076e4c>] ? count_data_pages+0x65/0x79
[ 873.256862] [<ffffffff810770b3>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 873.257036] [<ffffffff813f4f75>] ? printk+0x41/0x44
[ 873.257186] [<ffffffff81075a53>] hibernation_snapshot+0x36/0x1e1
[ 873.257337] [<ffffffff81075ccc>] hibernate+0xce/0x172
[ 873.257485] [<ffffffff81074a39>] state_store+0x5c/0xd3
[ 873.257634] [<ffffffff81184eff>] kobj_attr_store+0x17/0x19
[ 873.257783] [<ffffffff81125d43>] sysfs_write_file+0x108/0x144
[ 873.257932] [<ffffffff810d560f>] vfs_write+0xb2/0x153
[ 873.258084] [<ffffffff81063bd9>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 873.258237] [<ffffffff810d5773>] sys_write+0x4a/0x71
[ 873.258388] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

> - Related to the above: perhaps the RCU freeing isn't working, or
> slub/slab/slob ends up reusing the allocations for something else than
> anonvma's, so together with the race _and_ an unlucky re-use, you get
> some odd crud.
>
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

what is an odd SL[AOU]B config?

> - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
> with the low bit set. That sounds unlikely, but who knows. The ksm code
> sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
>
> Did people have KSM enabled?

Nope, KSM is off here.

--
Regards/Gruss,
Boris.

Steinar H. Gunderson

unread,

Apr 6, 2010, 5:10:01 PM4/6/10

to

On Tue, Apr 06, 2010 at 01:56:19PM -0700, Linus Torvalds wrote:
>>> Did people have KSM enabled?
>> No KSM for me.

> Ok, not anything odd there either, and you're not using any odd RCU setup
> either. Nothing odd at all strikes me about your config, in fact. Lots and
> lots of modules, but I guess it comes from some distro default config..

I think it was originally some distro config, yes, but that «config fork» was
at 2.6.16 or something...

Linus Torvalds

unread,

Apr 6, 2010, 5:10:02 PM4/6/10

to

On Tue, 6 Apr 2010, Steinar H. Gunderson wrote:

> On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote:
> > I haven't looked at the kernel config files: do they perhaps share the
> > same (odd?) SLUB/SLAB/SLOB config?
>
> http://storage.sesse.net/config-crashing-2.6.34-rc2

Ok, CONFIG_SLUB, which is the common case. Not likely to be buggy.

> > Did people have KSM enabled?
>
> No KSM for me.

Ok, not anything odd there either, and you're not using any odd RCU setup

either. Nothing odd at all strikes me about your config, in fact. Lots and
lots of modules, but I guess it comes from some distro default config..

Linus

Linus Torvalds

unread,

Apr 6, 2010, 5:40:02 PM4/6/10

to

On Tue, 6 Apr 2010, Borislav Petkov wrote:

> > So again, it's actually anon_vma.head.next that is NULL, not any of the
> > entries on the list itself.
> >
> > Now, I can see several cases for this:
> >
> > - the obvious one: anon_vma just wasn't correctly initialized, and is
> > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we
> > don't have a whole lot of coverage of constructors), or somebody
> > allocated an anon_vma without using the anon_vma_cachep.
>
> I've added code to verify this and am suspend/resuming now... Wait a
> minute, Linus, you're good! :) :
>
> [ 873.083074] PM: Preallocating image memory...
> [ 873.254359] NULL anon_vma->head.next, page 2182681

Yeah, I was pretty sure of that thing.

I still don't see _how_ it happens, though. That 'struct anon_vma' is very
simple, and contains literally just the lock and that list_head.

Now, 'head.next' is kind of magical, because it contains that magic
low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in
mm/mmap.c). But I'm not seeing anything else touching it.

And if you allocate a anon_vma the proper way, the SLUB constructor should
have made sure that the head is initialized. And no normal list operation
ever sets any list pointer to zero, although a "list_del()" on the first
list entry could do it if that first list entry had a NULL next pointer.

> Now, how do we track back to the place which is missing anon_vma->head
> init? Can we use the struct page *page arg to page_referenced_anon()
> somehow?

You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and
then make the "object_err()" function in mm/slub.c be non-static. You
could call it when you see the problem, perhaps.

Or you could just add tests to both alloc_anon_vma() and free_anon_vma()
to check that 'list_empty(&anon_vma->head)' is true. I dunno.

> > I haven't looked at the kernel config files: do they perhaps share the
> > same (odd?) SLUB/SLAB/SLOB config?
>
> what is an odd SL[AOU]B config?

Probably anything but the default SLUB these days. But Steinar already
said he had SLUB, so it's unlikely to be something odd.

> > - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud
> > with the low bit set. That sounds unlikely, but who knows. The ksm code
> > sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM"
> >
> > Did people have KSM enabled?
>
> Nope, KSM is off here.

Yeah, wasn't for Steinar either. So it doesn't look like it's any odd
corner case that depends on some odd configuration.

Linus

Borislav Petkov

unread,

Apr 6, 2010, 7:00:03 PM4/6/10

to

From: Linus Torvalds <torv...@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 02:27:37PM -0700

Ok, I tried doing all you suggested and here's what came out. Please,
take this with a grain of salt because I'm almost falling asleep - even
the coffee is not working anymore so it could be just as well that I've
made a mistake somewhere (the new OOPS is a #GP, by the way), just
watch:

Source changes locally:

--
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4884462..0c11dfb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *);
const char *kmem_cache_name(struct kmem_cache *);
int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);

+void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason);
+
/*
* Please use this macro to create slab caches. Simply specify the
* name of the structure and maybe some flags that are listed above.
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..7b35b3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -66,11 +66,24 @@ static struct kmem_cache *anon_vma_chain_cachep;

static inline struct anon_vma *anon_vma_alloc(void)
{
- return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+ struct anon_vma *ret;
+ ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+
+ if (!ret->head.next) {
+ printk("%s NULL anon_vma->head.next\n", __func__);
+ dump_stack();
+ }
+
+ return ret;
}

void anon_vma_free(struct anon_vma *anon_vma)
{
+ if (!anon_vma->head.next) {
+ printk("%s NULL anon_vma->head.next\n", __func__);
+ dump_stack();
+ }
+
kmem_cache_free(anon_vma_cachep, anon_vma);
}

@@ -494,6 +507,18 @@ static int page_referenced_anon(struct page *page,
return referenced;

mapcount = page_mapcount(page);
+
+ if (!anon_vma->head.next) {
+ printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
+ page_to_pfn(page));
+
+ object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");
+
+ dump_stack();
+
+ return referenced;
+ }
+
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
diff --git a/mm/slub.c b/mm/slub.c
index b364844..bcf5416 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -477,7 +477,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
dump_stack();
}

-static void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason)
{
slab_bug(s, "%s", reason);

---

do the same exercise of starting several guests and then shutting them
down, and hibernating at the same time. After having shutdown the
guests, start firefox and let it load a big html page and hibernate
while doing so, boom!

[ 269.104940] Freezing user space processes ... (elapsed 0.03 seconds) done.
[ 269.141953] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[ 269.155115] PM: Preallocating image memory...
[ 269.423811] general protection fault: 0000 [#1] PREEMPT SMP
[ 269.424003] last sysfs file: /sys/power/state
[ 269.424003] CPU 0
[ 269.424003] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_
core
[ 269.424003]
[ 269.424003] Pid: 2617, comm: hib.sh Tainted: G W 2.6.34-rc3-00288-gab195c5-dirty #4 M3A78 PRO/System Product
Name
[ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[ 269.424003] RSP: 0018:ffff88022a1218b8 EFLAGS: 00010246
[ 269.424003] RAX: ffff8802126fa468 RBX: ffffea000700b210 RCX: 0000000000000000
[ 269.424003] RDX: ffff8802126fa429 RSI: ffff8802126fa440 RDI: ffff88022dc3cb80
[ 269.424003] RBP: ffff88022a121938 R08: 0000000000000002 R09: 0000000000000000
[ 269.424003] R10: 0000000000000246 R11: ffff88021a030478 R12: 0000000000000000
[ 269.424003] R13: 002e2e2e002e2e0e R14: ffff8802126fa428 R15: ffff88022a121a00
[ 269.424003] FS: 00007fe2799796f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 269.424003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 269.424003] CR2: 00007fffdefb3880 CR3: 00000002171c0000 CR4: 00000000000006f0
[ 269.424003] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff
[ 269.424003] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 269.424003] Process hib.sh (pid: 2617, threadinfo ffff88022a120000, task ffff88022dc3cb80)
[ 269.424003] Stack:
[ 269.424003] ffff8802126fa468 00000000813f8cfc ffffffff8165ae28 00000000000042e7
[ 269.424003] <0> ffff88022a1218f8 ffffffff810c6051 ffffea0006f968c8 ffffea0006f968c8
[ 269.424003] <0> ffff88022a121938 00000002810ab275 0000000006f96890 ffffea000700b238
[ 269.424003] Call Trace:
[ 269.424003] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[ 269.424003] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[ 269.424003] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[ 269.424003] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[ 269.424003] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[ 269.424003] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[ 269.424003] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[ 269.424003] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[ 269.424003] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[ 269.424003] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 269.424003] [<ffffffff813f5135>] ? printk+0x41/0x44
[ 269.424003] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[ 269.424003] [<ffffffff81075ce4>] hibernate+0xce/0x172
[ 269.424003] [<ffffffff81074a51>] state_store+0x5c/0xd3
[ 269.424003] [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[ 269.424003] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[ 269.424003] [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[ 269.424003] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 269.424003] [<ffffffff810d590b>] sys_write+0x4a/0x71
[ 269.424003] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b
[ 269.424003] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 1e f2 ff ff 41 01 c4 83 7d cc 00

74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8

[ 269.424003] RIP [<ffffffff810c0cb4>] page_referenced+0x147/0x232
[ 269.424003] RSP <ffff88022a1218b8>
[ 269.438405] ---[ end trace ad5b4172ee94398e ]---
[ 269.438553] note: hib.sh[2617] exited with preempt_count 2
[ 269.438709] BUG: scheduling while atomic: hib.sh/2617/0x10000003
[ 269.438858] INFO: lockdep is turned off.
[ 269.439075] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co
nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_core
[ 269.440875] Pid: 2617, comm: hib.sh Tainted: G D W 2.6.34-rc3-00288-gab195c5-dirty #4
[ 269.441137] Call Trace:
[ 269.441288] [<ffffffff81063107>] ? __debug_show_held_locks+0x1b/0x24
[ 269.441440] [<ffffffff8102d3c0>] __schedule_bug+0x72/0x77
[ 269.441590] [<ffffffff813f553e>] schedule+0xd9/0x730
[ 269.441741] [<ffffffff8103022c>] __cond_resched+0x18/0x24
[ 269.441891] [<ffffffff813f5c62>] _cond_resched+0x2c/0x37
[ 269.442045] [<ffffffff810b7d7d>] unmap_vmas+0x6ce/0x893
[ 269.442205] [<ffffffff810bc42f>] exit_mmap+0xd7/0x182
[ 269.442352] [<ffffffff81035951>] mmput+0x48/0xb9
[ 269.442502] [<ffffffff81039c21>] exit_mm+0x110/0x11d
[ 269.442652] [<ffffffff8103b663>] do_exit+0x1c5/0x691
[ 269.442802] [<ffffffff81038d0d>] ? kmsg_dump+0x13b/0x155
[ 269.442953] [<ffffffff810060db>] ? oops_end+0x47/0x93
[ 269.443107] [<ffffffff81006122>] oops_end+0x8e/0x93
[ 269.443262] [<ffffffff81006313>] die+0x5a/0x63
[ 269.443414] [<ffffffff81003eaf>] do_general_protection+0x134/0x13c
[ 269.443566] [<ffffffff813f90f0>] ? irq_return+0x0/0x2
[ 269.443716] [<ffffffff813f92cf>] general_protection+0x1f/0x30
[ 269.443867] [<ffffffff810c0cb4>] ? page_referenced+0x147/0x232
[ 269.444021] [<ffffffff810c0bf0>] ? page_referenced+0x83/0x232
[ 269.444176] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c
[ 269.444328] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477
[ 269.444479] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58
[ 269.444630] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5
[ 269.444782] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4
[ 269.444933] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f
[ 269.445087] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4
[ 269.445243] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0
[ 269.445396] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79
[ 269.445547] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb
[ 269.445698] [<ffffffff813f5135>] ? printk+0x41/0x44
[ 269.445848] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1
[ 269.445999] [<ffffffff81075ce4>] hibernate+0xce/0x172
[ 269.446160] [<ffffffff81074a51>] state_store+0x5c/0xd3
[ 269.446307] [<ffffffff81185097>] kobj_attr_store+0x17/0x19
[ 269.446457] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144
[ 269.446607] [<ffffffff810d57a7>] vfs_write+0xb2/0x153
[ 269.446757] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b
[ 269.446908] [<ffffffff810d590b>] sys_write+0x4a/0x71
[ 269.447063] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

This time we have

[ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232

which is offset 0x1104.

which is

10eb: 48 89 df mov %rbx,%rdi
10ee: e8 00 00 00 00 callq 10f3 <page_referenced+0x136>
10f3: 41 01 c4 add %eax,%r12d
10f6: 83 7d cc 00 cmpl $0x0,-0x34(%rbp)
10fa: 74 19 je 1115 <page_referenced+0x158>
10fc: 4d 8b 6d 20 mov 0x20(%r13),%r13
1100: 49 83 ed 20 sub $0x20,%r13
1104: 49 8b 45 20 mov 0x20(%r13),%rax <-------------------------
1108: 0f 18 08 prefetcht0 (%rax)
110b: 49 8d 45 20 lea 0x20(%r13),%rax
110f: 48 39 45 80 cmp %rax,-0x80(%rbp)
1113: 75 aa jne 10bf <page_referenced+0x102>
1115: 4c 89 f7 mov %r14,%rdi

and asm is

.loc 1 522 0
movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.454
.LVL295:

subq $32, %r13 #, avc

.LVL296:
.L186:
.LBE1224:
movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <--------------
prefetcht0 (%rax) # <variable>.same_anon_vma.next
leaq 32(%r13), %rax #, tmp104
cmpq %rax, -128(%rbp) # tmp104, %sfp
jne .L189 #,
.L188:
.loc 1 540 0
movq %r14, %rdi # anon_vma,
call page_unlock_anon_vma #

and %r13 contains some funny stuff, could be some mangled SLUB debug
poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
the #GP.

But yes, even if the oopsing instruction is

movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next

this is not same_anon_vma.next because we've come to the above
instruction through the ".L186:" label, before which we have %r13
already loaded with anon_vma->head.next.

To be continued...

--
Regards/Gruss,
Boris.

Linus Torvalds

unread,

Apr 6, 2010, 7:40:01 PM4/6/10

to

On Wed, 7 Apr 2010, Borislav Petkov wrote:
>
> Ok, I tried doing all you suggested and here's what came out. Please,
> take this with a grain of salt because I'm almost falling asleep - even
> the coffee is not working anymore so it could be just as well that I've
> made a mistake somewhere (the new OOPS is a #GP, by the way), just
> watch:

Hey ho, yeah.

The reason it's a #GP fault is that it's not a NULL pointer dereference
any more, but a wild pointer that is not in the legal region of pointers
on x86-64. That is also why your debugging code didn't catch it: the
pointer isn't NULL, so you got the #GP fault on the same old instruction:

2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction

for all the same old reasons.

But now %r13 has a non-zero value: 0x002e2e2e002e2e0e, which I do _not_
recognize as any of the normal poison values.

> and %r13 contains some funny stuff, could be some mangled SLUB debug
> poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for
> the #GP.

Correct. You don't get a page fault if the pointer was totally bogus

> But yes, even if the oopsing instruction is
>
> movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next
>
> this is not same_anon_vma.next because we've come to the above
> instruction through the ".L186:" label, before which we have %r13
> already loaded with anon_vma->head.next.

No, you're mis-reading the asm. It's again the first iteration, and the
code above it is again the end of the loop. And %rax is once more a kernel
pointer, not the return value of 'page_referenced_one()'.

So it once more is 'anon_vma->head.next' that is crap, but now it's not
NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
anything either.

Linus

Linus Torvalds

unread,

Apr 6, 2010, 7:50:02 PM4/6/10

to

On Wed, 7 Apr 2010, Borislav Petkov wrote:
> +
> + if (!anon_vma->head.next) {
> + printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n",
> + page_to_pfn(page));
> +
> + object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next");

Oh, and since the debugging code never triggered ('head.next' wasn't
actually NULL), you never got here, but the 'page' you passed in to
object_error() should be the page of the slab allocation, not the page
associated with the anon_vma.

So it should be something like "virt_to_head_page(anon_vma)" that you pass
in to object_err().

Not that it matters. I assume it is the fact that SLAB debugging is on
that actually turns the NULL into a non-NULL thing. Poisoning is not
active for SLUb's with constructors or RCU-freeing, but things like
redzoning still are. So enabling SLUB debugging will change the offsets
within the pages of all the SLUB allocations. I wonder if that's just
what caused it to now have that 0x002e2e2e002e2e2e instead of NULL.

Linus

Rik van Riel

unread,

Apr 6, 2010, 8:00:01 PM4/6/10

to

When a new VMA has a mergeable anon_vma with a neighboring VMA,
make sure all of the neighbor's old anon_vma structs are also
linked in.

This is necessary because at some point the VMAs could get merged,
and we want to ensure no anon_vma structs get freed prematurely,
while the system still has anonymous pages that belong to those
structs.

Reported-by: Borislav Petkov <b...@alien8.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>

---
include/linux/mm.h | 2 +-
mm/mmap.c | 6 +++---
mm/rmap.c | 20 +++++++++++++-------
3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e70f21b..90ac50e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
struct mempolicy *);
-extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
+extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *);
extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..bf0600c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
* anon_vmas being allocated, preventing vma merge in subsequent
* mprotect.
*/
-struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *near;
unsigned long vm_flags;
@@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
can_vma_merge_before(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff +
((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
- return near->anon_vma;
+ return near;
try_prev:
/*
* It is potentially slow to have to call find_vma_prev here.
@@ -875,7 +875,7 @@ try_prev:
mpol_equal(vma_policy(near), vma_policy(vma)) &&
can_vma_merge_after(near, vm_flags,
NULL, vma->vm_file, vma->vm_pgoff))
- return near->anon_vma;
+ return near;
none:
/*
* There's no absolute need to look only at touching neighbours:
diff --git a/mm/rmap.c b/mm/rmap.c
index eaa7a09..60616db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma)
might_sleep();
if (unlikely(!anon_vma)) {
struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *merge_vma;
struct anon_vma *allocated;

+ merge_vma = find_mergeable_anon_vma(vma);
+ if (merge_vma) {
+ if (anon_vma_clone(vma, merge_vma))
+ goto out_enomem;
+ return 0;
+ }
+
avc = anon_vma_chain_alloc();
if (!avc)
goto out_enomem;

- anon_vma = find_mergeable_anon_vma(vma);
allocated = NULL;
- if (!anon_vma) {
- anon_vma = anon_vma_alloc();
- if (unlikely(!anon_vma))
- goto out_enomem_free_avc;
- allocated = anon_vma;
- }
+ anon_vma = anon_vma_alloc();
+ if (unlikely(!anon_vma))
+ goto out_enomem_free_avc;
+ allocated = anon_vma;
+
spin_lock(&anon_vma->lock);

/* page_table_lock to protect against threads */

Rik van Riel

unread,

Apr 6, 2010, 8:00:01 PM4/6/10

to

On 04/06/2010 05:27 PM, Linus Torvalds wrote:

> I still don't see _how_ it happens, though. That 'struct anon_vma' is very
> simple, and contains literally just the lock and that list_head.

It gets more fun. It looks like the anon_vma is only
allocated through anon_vma_alloc() and only handled
by the functions in rmap.c

By themselves, all of those functions look alright.

However, I think I may have found a possible bug in
the interplay between anon_vma_prepare() and vma_adjust(),
across several mprotect invocations.

Let me explain what I think may be going on in small
steps, since it is quite subtle (assuming I am right).

1) a process forks, creating a second "layer" of
anon_vma objects for the VMAs that have anon pages

2) a new VMA is created adjacant to an existing one,
with different permissions

3) anon_vma_prepare is called on the new VMA, this
only links the "top" anon_vma to the new VMA, since
that is the anon_vma where all new pages get
instantiated anyway (this would be part of the bug)

4) mprotect changes the permission of one of the VMAs,
causing the old and the new VMAs to get merged

5) vma_adjust calls anon_vma_merge, causing the anon_vma
chain of one of the VMAs to get nuked - with bad luck,
this is the original one, leaving just the new anon_vma
attached to the VMA

6) if the parent process quits, the old anon_vma structs
get freed

7) meanwhile, we may still have some anonymous pages
stick around in memory that have their page->mapping
point to a freed anon_vma struct

Does this look like it could happen?

If so, I'll cook up a patch to change anon_vma_prepare
and find_mergeable_anon_vma to attach the whole chain
of anon_vmas to the new VMA, using anon_vma_clone().

Linus Torvalds

unread,

Apr 6, 2010, 8:20:02 PM4/6/10

to

On Tue, 6 Apr 2010, Rik van Riel wrote:
>
> It gets more fun. It looks like the anon_vma is only
> allocated through anon_vma_alloc() and only handled
> by the functions in rmap.c
>
> By themselves, all of those functions look alright.

Yes. Very trivially so, in fact.

> However, I think I may have found a possible bug in
> the interplay between anon_vma_prepare() and vma_adjust(),
> across several mprotect invocations.
>
> Let me explain what I think may be going on in small
> steps, since it is quite subtle (assuming I am right).

Sounds at least possible. Way more likely than any of the "trivially
obvious" code being buggy, or the SLUB layer suddenly having a serious bug
that only the new user could trigger.

That said, the code that _really_ confuses me is the stuff that uses
"anon_vma_clone()". Could you please also explain the code flow of
vma_adjust() to mere mortals, please?

I suspect Borislav is sleeping. But at least we have a patch for him to
test when he wakes up ;)

Linus

Rik van Riel

unread,

Apr 6, 2010, 9:30:02 PM4/6/10

to

On 04/06/2010 08:10 PM, Linus Torvalds wrote:

> That said, the code that _really_ confuses me is the stuff that uses
> "anon_vma_clone()". Could you please also explain the code flow of
> vma_adjust() to mere mortals, please?

That's easier said than done. I spent 3 days with pen and paper,
going over that code before I made the anon_vma changes, first
verifying that the code is indeed correct and then figuring out
how I could make the anon_vma changes safely.

I am not happy with the complexity of the code around vma_adjust,
but could not find a way to simplify it and still keep merging
VMAs the way we do.

My largest change to vma_adjust was moving some code closer to
the beginning of the function, so I could bail out if the
allocation failed, without making change to the vma...

> I suspect Borislav is sleeping. But at least we have a patch for him to
> test when he wakes up ;)

I am looking forward to the test results.

KOSAKI Motohiro

unread,

Apr 7, 2010, 3:10:02 AM4/7/10

to

> When a new VMA has a mergeable anon_vma with a neighboring VMA,
> make sure all of the neighbor's old anon_vma structs are also
> linked in.
>
> This is necessary because at some point the VMAs could get merged,
> and we want to ensure no anon_vma structs get freed prematurely,
> while the system still has anonymous pages that belong to those
> structs.

Ahhhh, I'm shame myself. sure, neighbor vma might have lots avc ;-)
few comments are blow.

Hmm.. probably I'm moron.
I'm also confusing this locking rule as same as linus said.

after this patch, new locking order are

down_read(mmap_sem)
anon_vma_clone(vma, merge_vma)
list_add(&avc->same_vma, &vma->anon_vma_chain);
spin_lock(&anon_vma->lock);
list_add_tail(&avc->same_anon_vma, &anon_vma->head);
spin_unlock(&anon_vma->lock);
spin_lock(&anon_vma->lock);
spin_lock(&mm->page_table_lock);

So, Why mmap_sem read lock can protect vma->anon_vma_chain?
An another threads seems to be able to change avc list concurrentlly and freely.

plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone()
multiple times.

Borislav Petkov

unread,

Apr 7, 2010, 3:30:01 AM4/7/10

to

From: Linus Torvalds <torv...@linux-foundation.org>
Date: Tue, Apr 06, 2010 at 04:27:42PM -0700

> No, you're mis-reading the asm. It's again the first iteration, and the
> code above it is again the end of the loop. And %rax is once more a kernel
> pointer, not the return value of 'page_referenced_one()'.
>
> So it once more is 'anon_vma->head.next' that is crap, but now it's not
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).

No, maybe I expressed myself wrong (it was late an' all) - I was
basically trying to confirm your assessment that anon_vma->head.next
is crap but the code had changed since I had added the debugging 'if
(!anon_vma->head.next)' and that was the value that was already in %r13
before iterating over the list chain.

Yeah, just a minor nitpick and not that it matters. Nevermind though,
we're on the same page.

--
Regards/Gruss,
Boris.

Borislav Petkov

unread,

Apr 7, 2010, 3:30:02 AM4/7/10

to

From: Rik van Riel <ri...@redhat.com>

Date: Tue, Apr 06, 2010 at 09:18:28PM -0400

Hi Rik,

I think your patch needs a bit more baking, see below :)

> >I suspect Borislav is sleeping. But at least we have a patch for him to
> >test when he wakes up ;)
>
> I am looking forward to the test results.

This happens when starting X, I haven't even started hibernating.

[By the way, further testing will have to wait till tonight since I
have a job, you know :) ]

Also, mm/rmap.c:745 is

BUG_ON(!anon_vma);

in __page_set_anon_rmap().

---
[ 43.142371] ------------[ cut here ]------------
[ 43.142411] kernel BUG at mm/rmap.c:745!
[ 43.142436] invalid opcode: 0000 [#1] PREEMPT SMP
[ 43.142514] last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent
[ 43.142537] CPU 0
[ 43.142559] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[ 43.142997]
[ 43.143012] Pid: 1940, comm: console-kit-dae Not tainted 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[ 43.143012] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 43.143012] RSP: 0000:ffff88022c019da8 EFLAGS: 00010246
[ 43.143012] RAX: 0000000000000000 RBX: ffffea000774ff78 RCX: 000000002ce900f4
[ 43.143012] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[ 43.143012] RBP: ffff88022c019dc8 R08: 00007f29e3cfd928 R09: 000000000062c318
[ 43.143012] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88022bbad960
[ 43.143012] R13: 00007f29e3cfd928 R14: 00007f29e3cfd928 R15: 80000002216d9067
[ 43.143012] FS: 00007f29e3d0f790(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 43.143012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 43.143012] CR2: 00007f29e3cfd928 CR3: 000000022dfd3000 CR4: 00000000000006f0
[ 43.143012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 43.143012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 43.143012] Process console-kit-dae (pid: 1940, threadinfo ffff88022c018000, task ffff88022ce90000)
[ 43.143012] Stack:
[ 43.143012] ffffffff810b8802 ffff88022bbad960 ffff88022ea3c600 ffff88022bb6d7e8
[ 43.143012] <0> ffff88022c019e48 ffffffff810b8823 ffff88022ea3c6b8 0000000000000246
[ 43.143012] <0> ffffea000774ff78 0000000000000001 00000001e3cfd928 ffff88022fdb58f0
[ 43.143012] Call Trace:
[ 43.143012] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[ 43.143012] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[ 43.143012] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 43.143012] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27
[ 43.143012] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109
[ 43.143012] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 43.143012] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 43.143012] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 43.143012] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24
[ 43.143012] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 43.143012] RSP <ffff88022c019da8>
[ 43.145276] ---[ end trace d6305f6e826dbd53 ]---
[ 43.145314] note: console-kit-dae[1940] exited with preempt_count 1
[ 73.644201] ------------[ cut here ]------------
[ 73.644218] kernel BUG at mm/rmap.c:745!
[ 73.644226] invalid opcode: 0000 [#2] PREEMPT SMP
[ 73.644266] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
[ 73.644278] CPU 0
[ 73.644287] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr
[ 73.644509]
[ 73.644520] Pid: 2018, comm: iceowl-bin Tainted: G D 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name
[ 73.644534] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 73.644553] RSP: 0000:ffff88022cd37da8 EFLAGS: 00010246
[ 73.644562] RAX: 0000000000000000 RBX: ffffea000764dfa8 RCX: 0000000000000002
[ 73.644572] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740
[ 73.644589] RBP: ffff88022cd37dc8 R08: 00007f2ce0aab928 R09: 0000000000000000
[ 73.644603] R10: 0000000000000000 R11: 000000000011da32 R12: ffff88022d5894b0
[ 73.644615] R13: 00007f2ce0aab928 R14: 00007f2ce0aab928 R15: 800000021cd23067
[ 73.644628] FS: 00007f2cee88b7b0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000
[ 73.644639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 73.644652] CR2: 00007f2ce0aab928 CR3: 000000022b1b5000 CR4: 00000000000006f0
[ 73.644664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 73.644675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 73.644690] Process iceowl-bin (pid: 2018, threadinfo ffff88022cd36000, task ffff88022a74a5c0)
[ 73.644701] Stack:
[ 73.644708] ffffffff810b8802 ffff88022d5894b0 ffff88022ce41e00 ffff88022d4b0558
[ 73.644745] <0> ffff88022cd37e48 ffffffff810b8823 ffff88022ce41eb8 0000000000000246
[ 73.644801] <0> ffffea000764dfa8 0000000000000001 00000001e0aab928 ffff88022c0a4828
[ 73.644862] Call Trace:
[ 73.644874] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e
[ 73.644885] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e
[ 73.644895] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d
[ 73.644909] [<ffffffff810be3c2>] ? do_mmap_pgoff+0x290/0x2f3
[ 73.644921] [<ffffffff813f93e3>] ? error_sti+0x5/0x6
[ 73.644932] [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9
[ 73.644943] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 73.644952] [<ffffffff813f91ff>] page_fault+0x1f/0x30
[ 73.644963] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24
[ 73.645001] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89
[ 73.645001] RSP <ffff88022cd37da8>
[ 73.645610] ---[ end trace d6305f6e826dbd54 ]---
[ 73.645621] note: iceowl-bin[2018] exited with preempt_count 1
[ 77.562222] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
[ 78.014120] SysRq : Emergency Sync
[ 78.016864] Emergency Sync complete
[ 78.585045] SysRq : Emergency Remount R/O
[ 78.663367] Emergency Remount complete
[ 79.098126] SysRq : Resetting

--
Regards/Gruss,
Boris.

Peter Zijlstra

unread,

Apr 7, 2010, 4:40:01 AM4/7/10

to

On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> Just as an example of the kind of code that makes me worry:
>
> void unlink_anon_vmas(struct vm_area_struct *vma)
> {
> struct anon_vma_chain *avc, *next;
>
> /* Unlink each anon_vma chained to the VMA. */
> list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> anon_vma_unlink(avc);
> list_del(&avc->same_vma);
> anon_vma_chain_free(avc);
> }
> }
>
> Now, think about what happens for the *last* entry in that avc chain. It
> will call that "anon_vma_unlink()" thing, which will delete perhaps the
> last entry in the "same_anon_vma" one, and then it does
>
> if (empty)
> anon_vma_free(anon_vma);
>
> *before* unlink_anon_vma's has actually does that
>
> list_del(&avc->same_vma);
>
> and what we essentially have is a stale anon_vma_chain entry that still
> exists on that same_vma list, and points to an anon_vma that already got
> deleted.
>
> Does it matter? I really can't see that it does.

I think it does, the anon_vma thing has an RCU destroyed slab, but that
doesn't mean the anon_vma object itself is rcu delayed. The moment we
free it it can be re-used. So the above use after free is a bug.

Peter Zijlstra

unread,

Apr 7, 2010, 4:40:01 AM4/7/10

to

On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote:
> I do wonder if "page_lock_anon_vma()" should check the whole
> "page_mapped()" case _after_ taking the anon_vma lock. Because if the race
> happens, we're following a anon_vma list that has nothing to do with that
> page (it's stilla _valid_ list, since we locked the anon_vma, but will it
> be ok?)
>
> IOW, what is it that really keeps the anon_vma list reliable _and_
> relevant wrt the page? We know we may get a stale anon_vma, are we ok if
> that anon_vma list doesn't actually have anything to do with the page any
> more?

When doing the whole make i_mmap_lock/anon_vma->lock a mutex thing last
week I ran into the same issue and its on my todo list to find out wth
is happening there.

So yes I think we should move that validation check inside
page_lock_anon_vma().

I'll cook up a patch once I'm done staring at the various funny arch
mmu_gather implementations.

Peter Zijlstra

unread,

Apr 7, 2010, 4:50:02 AM4/7/10

to

On Tue, 2010-04-06 at 13:02 -0700, Linus Torvalds wrote:
> - Related to the above: perhaps the RCU freeing isn't working, or
> slub/slab/slob ends up reusing the allocations for something else than
> anonvma's, so together with the race _and_ an unlucky re-use, you get
> some odd crud.
>
> I haven't looked at the kernel config files: do they perhaps share the
> same (odd?) SLUB/SLAB/SLOB config?

Right, so anon_vma uses SLAB_DESTROY_BY_RCU and as the huge comment in
rmap.c explains, that doesn't mean the objects themself get RCU grace
period delays in freeing, only the SLAB that backs these objects does.

So the moment you do kmem_cache_free() on the anon_vma it can be re-used
for another allocation. The only guarantee given by RCU is that the
backing storage doesn't go away and hence you can 'safely' deref
pointers, you still very much have to revalidate you got the object you
were looking for.

Johannes Weiner

unread,

Apr 7, 2010, 5:20:01 AM4/7/10

to

On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:
> > Just as an example of the kind of code that makes me worry:
> >
> > void unlink_anon_vmas(struct vm_area_struct *vma)
> > {
> > struct anon_vma_chain *avc, *next;
> >
> > /* Unlink each anon_vma chained to the VMA. */
> > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
> > anon_vma_unlink(avc);
> > list_del(&avc->same_vma);
> > anon_vma_chain_free(avc);
> > }
> > }
> >
> > Now, think about what happens for the *last* entry in that avc chain. It
> > will call that "anon_vma_unlink()" thing, which will delete perhaps the
> > last entry in the "same_anon_vma" one, and then it does
> >
> > if (empty)
> > anon_vma_free(anon_vma);
> >
> > *before* unlink_anon_vma's has actually does that
> >
> > list_del(&avc->same_vma);
> >
> > and what we essentially have is a stale anon_vma_chain entry that still
> > exists on that same_vma list, and points to an anon_vma that already got
> > deleted.
> >
> > Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

It frees avc->anon_vma, not avc. So the sequence is

free(avc->anon_vma) in anon_vma_unlink()
list_del(&avc->same_vma) in unlink_anon_vmas()

It's not a use-after free. A problem would be if somebody should find the
avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
pointer is invalid.

I don't think this can happen, however. Both the unlinking and the looking
at the list happen under vma->vm_mm's mmap_sem held for writing.

Peter Zijlstra

unread,

Apr 7, 2010, 5:40:01 AM4/7/10

to

Sure, freeing avc does not involve RCU in any way.

> So the sequence is
>
> free(avc->anon_vma) in anon_vma_unlink()
> list_del(&avc->same_vma) in unlink_anon_vmas()
>
> It's not a use-after free. A problem would be if somebody should find the
> avc through this list (it is the vma->anon_vma_chain list) when its anon_vma
> pointer is invalid.
>
> I don't think this can happen, however. Both the unlinking and the looking
> at the list happen under vma->vm_mm's mmap_sem held for writing.

What I was worried about was it freeing anon_vma and then still having
the avc on list. But I guess that cannot happen because it only frees if
its actually empty.

Pekka Enberg

unread,

Apr 7, 2010, 6:10:01 AM4/7/10

to

Hi Linus,

On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> Sounds at least possible. Way more likely than any of the "trivially
> obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> that only the new user could trigger.

I haven't followed the discussion at all but if someone wants to
investigate that angle more, the most likely suspect are the recent
per-cpu changes. That said, I'd expect the problem to be more
widespread if SLUB is to blame here.

Pekka

KOSAKI Motohiro

unread,

Apr 7, 2010, 6:20:01 AM4/7/10

to

> Hi Linus,
>
> On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> > Sounds at least possible. Way more likely than any of the "trivially
> > obvious" code being buggy, or the SLUB layer suddenly having a serious bug
> > that only the new user could trigger.
>
> I haven't followed the discussion at all but if someone wants to
> investigate that angle more, the most likely suspect are the recent
> per-cpu changes. That said, I'd expect the problem to be more
> widespread if SLUB is to blame here.

Nope. We don't doubt SLUB nor per-cpu anymore. Rik found the bug in his patch.

thanks.

Paulo Marques

unread,

Apr 7, 2010, 10:10:01 AM4/7/10

to

Linus Torvalds wrote:
> [...]

> So it once more is 'anon_vma->head.next' that is crap, but now it's not
> NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
>
> What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
> anything either.

Just a wild shot in the dark: it can be a couple of gray pixels with
intensity 0x2e at some 32 bits per pixel mode. I say this because of the
zero bytes there and someone mentioning seeing the problem when starting X.

--
Paulo Marques - www.grupopie.com

"Don't worry, you'll be fine; I saw it work in a cartoon once..."

Rik van Riel

unread,

Apr 7, 2010, 10:20:02 AM4/7/10

to

On 04/07/2010 04:36 AM, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote:

>> if (empty)
>> anon_vma_free(anon_vma);
>>
>> *before* unlink_anon_vma's has actually does that
>>
>> list_del(&avc->same_vma);
>>
>> and what we essentially have is a stale anon_vma_chain entry that still
>> exists on that same_vma list, and points to an anon_vma that already got
>> deleted.
>>
>> Does it matter? I really can't see that it does.
>
> I think it does, the anon_vma thing has an RCU destroyed slab, but that
> doesn't mean the anon_vma object itself is rcu delayed. The moment we
> free it it can be re-used. So the above use after free is a bug.

Peter, the avc is an anon_vma_chain, which is a different
object than the anon_vma itself. There is no use after free
of an anon_vma object in unlink_anon_vmas + anon_vma_unlink.

Borislav Petkov

unread,

Apr 7, 2010, 10:20:02 AM4/7/10

to

From: Paulo Marques <pmar...@grupopie.com>
Date: Wed, Apr 07, 2010 at 03:05:50PM +0100

> Linus Torvalds wrote:
> > [...]
> > So it once more is 'anon_vma->head.next' that is crap, but now it's not
> > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20
> > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e).
> >
> > What does '0x2e' mean? It's ASCII '.', but that doesn't really mean
> > anything either.
>
> Just a wild shot in the dark: it can be a couple of gray pixels with
> intensity 0x2e at some 32 bits per pixel mode. I say this because of the
> zero bytes there and someone mentioning seeing the problem when starting X.

I don't think those are related: the problem when X was starting happens
with Rik' newest patch and the funny %r13 value happened after enabling
SLUB debugging last night.

Thanks.

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

Rik van Riel

unread,

Apr 7, 2010, 11:00:02 AM4/7/10

to

On 04/07/2010 03:00 AM, KOSAKI Motohiro wrote:

> Hmm.. probably I'm moron.

Someone might be, but it's not you :)

> I'm also confusing this locking rule as same as linus said.
>
> after this patch, new locking order are

> So, Why mmap_sem read lock can protect vma->anon_vma_chain?

> An another threads seems to be able to change avc list concurrentlly and freely.

You are right, the code needs to take the pagetable_lock
around the call to anon_vma_clone, so other threads
get locked out.

This means the locking order has now been inverted,
with the pagetable_lock on the outside and the
anon_vma locks on the inside.

I have checked all the other call sites to the
anon_vma code. The direct callers of anon_vma_clone
and anon_vma_fork already hold the mmap_sem for
write. The callers of anon_vma_prepare hold the
mmap_sem for read - so excluding other callers of
anon_vma_prepare with the page_table_lock is enough.

mm_take_all_locks has the mmap_sem for write.

There seem to be no other traversals of the same_vma
list, so changing the locking order to have the
page_table_lock on the outside of the anon_vma locks
works.

> plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment?
> if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone()
> multiple times.

Added in the new version. See the next email.