Linux 2.6.17-rc2

Linus Torvalds

unread,

Apr 18, 2006, 11:29:34 PM4/18/06

to Linux Kernel Mailing List

Instead of the normal one-week release schedule, there was now two weeks
between 2.6.17-rc1 and -rc2, partly because I was travelling for one of
those weeks, but partly because it was really quiet for a while. Likely a
lot of people are concentrating on 2.6.16 and vendor releases.

It picked up a bit in the last few days (it's also possible that the US
people were all just stressed out over tax season ;), and I cut a
2.6.17-rc2. I expect to be back to the weekly schedule now, even if it is
quiet (which I hope it will be).

Not a lot of hugely interesting stuff, with a large portion of the diff
being a late MIPS update (tssk tssk), and the huge diff from the
long over-due removal of the Sangoma wan drivers that have been marked
BROKEN for a long time. Same goes for the qlogicfc driver (which has been
supplanted by the qla2xxx driver).

As a result, the diff has just tons of deletions, even if most of the rest
of the changes aren't all that big. But there are netfilter fixes, some
more splice work, and just tons of random stuff: usb, scsi, knfsd, fuse,
infiniband..

Shortlog follows, for more details on it all..

Linus

---
adam radford:
[SCSI] 3ware 9000 disable local irqs during kmap_atomic

Adrian Bunk:
[NET]: Fix an off-by-21-or-49 error.
[TG3]: Fix a memory leak.
[IPV6]: Unexport secure_ipv6_port_ephemeral
CONFIGFS_FS must depend on SYSFS
arch/i386/mach-voyager/voyager_cat.c: named initializers
mm/migrate.c: don't export a static function
i386: move SMP option above subarch selection
arch/s390/Makefile: remove -finline-limit=10000
the scheduled unexport of panic_timeout
drivers/isdn/gigaset/common.c: small cleanups
isdn/gigaset/common.c: fix a memory leak
ISDN_DRV_GIGASET should select, not depend on CRC_CCITT
fs/nfsd/nfs4state.c: make a struct static
video/aty/atyfb_base.c: fix an off-by-one error
[WAN]: Remove broken and unmaintained Sangoma drivers.
[ALSA] sound/core/pcm.c: make snd_pcm_format_name() static
drivers/net/via-rhine.c: make a function static
remove drivers/net/hydra.h
USB: pci-quirks.c: proper prototypes
USB: input/: proper prototypes
USB: drivers/usb/core/: remove unused exports
remove kernel/power/pm.c:pm_unregister()
[IPV4]: Possible cleanups.
drivers/char/drm/drm_memory.c: possible cleanups
[CPUFREQ] drivers/cpufreq/cpufreq.c: static functions mustn't be exported
[CPUFREQ] powernow-k8.c: fix a check-after-use

Alan Stern:
USB: g_file_storage: Set short_not_ok for bulk-out transfers
USB: g_file_storage: add comment about buffer allocation
USB: g_file_storage: use module_param_array_named macro
USB: UHCI: don't track suspended ports
driver core: safely unbind drivers for devices not on a bus

Alessandro Zummo:
RTC subsystem: DS1672 cleanup
RTC subsystem: X1205 sysfs cleanup
RTC subsystem: whitespaces and error messages cleanup
RTC subsystem: fix proc output
RTC subsystem: RS5C372 sysfs fix
RTC subsystem: compact error messages
RTC subsystem: SA1100 cleanup
RTC subsystem: VR41XX cleanup

Alexey Dobriyan:
ver_linux: don't print reiser4progs version if none found

Alexey Kuznetsov:
IPC: access to unmapped vmalloc area in grow_ary()

Ananiev, Leonid I:
ext3: Fix missed mutex unlock
ext3: Fix missed mutex unlock

Andi Kleen:
x86_64: Update defconfig
x86_64: Clean up execve path
x86_64: Support memory hotadd without sparsemem
x86_64: Reserve SRAT hotadd memory on x86-64
x86_64: Handle empty PXMs that only contain hotplug memory
x86_64: Fix compilation with CONFIG_PCI=n / allnoconfig
x86_64: Don't sanity check Type 1 PCI bus access on newer systems
x86-64/i386: Don't process APICs/IO-APICs in ACPI when APIC is disabled.
x86_64: Clear APIC feature bit when local APIC is disabled
i386: Consolidate modern APIC handling
x86_64: Revert earlier powernow-k8 change
x86_64: Don't run NMI watchdog during machine checks
x86_64: When user could have changed RIP always force IRET
x86_64: Don't export strlen twice
x86_64: Don't return error for HPET initialization in initcall
i386/x86_64: Check if MCFG works for the first 16 busses
i386/x86-64: Return defined error value for bad PCI config space accesses
i386: Remove printk about reboot fixups at reboot
x86_64: Eliminate IA32_NR_syscalls define
x86_64: Update 32-bit system call table
[CPUFREQ] x86_64: Revert earlier powernow-k8 change
x86-64/i386: Don't process APICs/IO-APICs in ACPI when APIC is disabled.
x86_64: Remove check for canonical RIP
i386: Remove bogus special case code from AMD core parsing
i386/x86-64: Remove checks for value == NULL in PCI config space access
x86_64: Fix embarassing typo in mmconfig bus check
x86_64: Update defconfig
i386/x86-64: Fix ACPI disabled LAPIC handling mismerge
x86_64: Increase NUMA hash function nodemap
x86_64: Add tee and sync_file_range
i386: Move CONFIG_DOUBLEFAULT into arch/i386 where it belongs.

Andreas Gruenbacher:
kbuild: modules_install for external modules must not remove existing modules

Andreas Schwab:
Use pci_set_consistent_dma_mask in ixgb driver

Andrew Morton:
[NET]: More kzalloc conversions.
splice: warning fix
select() warning fixes
sync_file_range(): use unsigned for flags
timer initialisation fix
make tty_insert_flip_string_flags() a non gpl export
sys_kexec_load() naming fixups
hdaps: use ENODEV
3ware: kmap_atomic() fix
atyfb is bust on sparc32
sparc32 vga support
pm: print name of failed suspend function

Andy Whitcroft:
page flags: add commentry regarding field reservation

Anton Blanchard:
powerpc: Ensure runlatch is off in the idle loop
powerpc: Avoid __initcall warnings

Antonino A. Daplas:
vesafb: Fix incorrect logo colors in x86_64
fbdev: Use logo with depth of 4 or less for static pseudocolor

Arjan van de Ven:
x86_64: Rename e820_mapped to e820_any_mapped
x86_64: Introduce e820_all_mapped
i386/x86-64: Check that MCFG points to an e820 reserved area

Arnd Bergmann:
inotify: check for NULL inode in inotify_d_instantiate

Ashley Clark:
[ALSA] hda-codec - Adds HDA support for Intel D945Pvs board with subdevice id 0x0707

Ashok Raj:
swsusp: don't require bigsmp

Atsushi Nemoto:
kbuild: mips: fix sed regexp to generate asm-offset.h
[MIPS] Enable SCHED_NO_NO_OMIT_FRAME_POINTER for MIPS.
[MIPS] Fix tx49_blast_icache32_page_indexed.
[MIPS] Use __ffs() instead of ffs() for waybit calculation.

Ben Dooks:
[ARM] 3468/1: S3C2410: SMDK common include fix
[ARM] 3469/1: S3C24XX: clkout missing hclk selector
S3C24XX GPIO LED support
leds: fix IDE disk trigger name
leds: reorganise Kconfig
leds: re-layout include/linux/leds.h
[ARM] 3474/1: S3C2440: USB rate writes wrong var to CLKDIVN
[ARM] 3475/1: S3C2410: fix spelling mistake in SMDK partition table
USB: cleanups for ohci-s3c2410.c
USB: S3C2410: use clk_enable() to ensure 48MHz to OHCI core

Bjorn Helgaas:
[IA64] update HP CSR space discovery via ACPI
[IA64] always map VGA framebuffer UC, even if it supports WB
DMI: move dmi_scan.c from arch/i386 to drivers/firmware/

Brent Cook:
mv643xx_eth: Always free completed tx descs on tx interrupt

Brian Gerst:
kbuild: fix garbled text in modules.txt

Brian Haley:
[NETFILTER]: Fix build with CONFIG_NETFILTER=y/m on IA64

Brian King:
[SCSI] ipr: Disk remove path cleanup
[SCSI] ipr: Fixup device type check
[SCSI] ipr: Simplify status area dumping
[SCSI] ipr: printk macro cleanup/removal
[SCSI] ipr: Reset device cleanup
[SCSI] ipr: Bump version

Brian Uhrain says:
alpha: SMP boot fixes

Carl-Daniel Hailfinger:
kbuild: fix unneeded rebuilds in drivers/media/video after moving source tree
kbuild: fix unneeded rebuilds in drivers/net/chelsio after moving source tree

Catalin Marinas:
[ARM] 3470/1: Clear the HWCAP bits for the disabled kernel features
[ARM] 3471/1: FTOSI functions should return 0 for NaN
[ARM] 3472/1: Use the D variants of FLDMIA/FSTMIA on ARMv6
[ARM] 3473/1: Use numbers 0-15 for the VFP double registers

Chen, Kenneth W:
[IA64] fix bug in ia64 __mutex_fastpath_trylock

Christoph Hellwig:
move ->eh_strategy_handler to the transport class
build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is set
[SCSI] unify SCSI_IOCTL_SEND_COMMAND implementations

Christoph Lameter:
[IA64] Prefetch mmap_sem in ia64_do_page_fault()
Fix NULL pointer dereference in node_read_numastat()
Some page migration fixups

Corey Minyard:
ipmi: fix event queue limit

Cornelia Huck:
s390: wrong return codes in cio_ignore_proc_init()

Coywolf Qi Hunt:
page-writeback comment fixes
[ALSA] hda-codec - support HP Compaq Presario B2800 laptop with AD1986A codec

Dale Farnsworth:
mv643xx_eth: Fix tx_timeout to only conditionally wake tx queue

Dale Sedivec:
[ALSA] au88x0 - clean up __devinit/__devexit

Dan Aloni:
sata_mv: properly print HC registers

Daniel Ritz:
USB: usbtouchscreen: unified USB touchscreen driver
usb/input: remove Kconfig entries of old touchscreen drivers in favour of usbtouchscreen

Dave Airlie:
drm: Fix issue reported by Coverity in drivers/char/drm/via_irq.c
drm: drm_pci needs dma-mapping.h
drm: remove master setting from add/remove context
drm: deline a few large inlines in DRM code

Dave C Boutcher:
[SCSI] ibmvscsi: prevent scsi commands being sent in invalid state

Dave Hansen:
x86_64: extra NODES_SHIFT definition

Dave Jones:
[CPUFREQ] extra debugging in cpufreq_add_dev()
[CPUFREQ] trailing whitespace removal de-jour.
[CPUFREQ] Remove pointless check in conservative governor.
[SELINUX] Fix build after ipsec decap state changes.
splice: potential !page dereference
S390: fix implicit declaration of (un)likely.
Remove extraneous \n in doubletalk init printk.

David Brownell:
USB: otg hub support is optional
USB: fix gadget_is_musbhdrc()
USB: net2280 short rx status fix
USB: rndis_host whitespace/comment updates
USB: gadgetfs highspeed bugfix
USB: gadget zero poisons OUT buffers
USB: at91 usb driver supend/resume fixes
USB: usbtest: scatterlist OUT data pattern testing
USB: g_ether, highspeed conformance fix
dma doc updates
Fix AT91RM9200 build breakage

David Chinner:
[XFS] Fix inode reclaim scalability regression. When a filesystem has
[XFS] Fix an inode use-after-free durin an unpin. When reclaiming inodes

David Hollis:
USB: Rename ax8817x_func() to asix_func() and add utility functions to reduce bloat

David Howells:
[Security] Keys: Fix oops when adding key to non-keyring
Fix memory barrier docs wrt atomic ops
Improve data-dependency memory barrier example in documentation
Keys: Improve usage of memory barriers and remove IRQ disablement

David S. Miller:
[X25]: Restore skb->dev setting in x25_type_trans().
[IPV4] ip_fragment: Always compute hash with ipfrag_lock held.
[SPARC64]: Add dummy PTRACE_PEEKUSR for gdb.
[SPARC64]: Print out return PC in cheetah_log_errors().
[SPARC64]: Update defconfig.
[SPARC64]: Translate PTRACE_GETEVENTMSG for 32-bit tasks.
[SPARC64]: smp_call_function() fixups...
[SPARC64]: Set ARCH_SELECT_MEMORY_MODEL
[SPARC]: Hook up sys_tee() into syscall tables.
[SPARC64]: Export pcibios_resource_to_bus().

Davide Libenzi:
uniform POLLRDHUP handling between epoll and poll/select

Denis Vlasenko:
[IPV6]: Deinline few large functions in inet6 code

Dmitry Mishin:
unaligned access in sk_run_filter()

Douglas Gilbert:
[SCSI] sg: fix leak when dio setup fails

Eli Cohen:
IPoIB: Wait for join to finish before freeing mcast struct
IPoIB: Close race in ipoib_flush_paths()

Eric Sesterhenn:
[BLUETOOTH] sco: Possible double free.
Bogus NULL pointer check in fs/configfs/dir.c
kbuild: fix NULL dereference in scripts/mod/modpost.c
Wrong out of range check in drivers/char/applicom.c
Overrun in cdrom/aztcd.c
[DCCP]: Fix leak in net/dccp/ipv4.c
[ISDN]: Static overruns in drivers/isdn/i4l/isdn_ppp.c
[ALSA] Overrun in sound/pci/au88x0/au88x0_pcm.c

Eric Van Hensbergen:
9p: handle sget() failure

Eric W. Biederman:
de_thread: Don't confuse users do_each_thread.
do_SAK: Don't recursively take the tasklist_lock
de_thread: Don't change our parents and ptrace flags.
kill unushed __put_task_struct_cb

Erik Mouw:
[CPUFREQ] Update LART site URL

Folkert van Heusden:
USB: add support for Papouch TMU (USB thermometer)

Frank Gevaerts:
hdaps: add support for Thinkpad R52

FUJITA Tomonori:
[SCSI] ibmvscsi: convert the ibmvscsi driver to use include/scsi/srp.h
[SCSI] ibmvscsi: remove drivers/scsi/ibmvscsi/srp.h

Gary Zambrano:
b44: disable default tx pause
b44: increase version to 1.00

Geert Uytterhoeven:
Update contact info for Geert Uytterhoeven

Greg Kroah-Hartman:
USB: add driver for funsoft usb serial device

Grzegorz Janoszka:
arch/i386/pci/irq.c - new VIA chipsets (fwd)

Guennadi Liakhovetski:
USB: net2282 and net2280 software compatibility

H. Peter Anvin:
[efficeon-agp] Add missing memory mask

Hannes Reinecke:
[SCSI] aic79xx bus reset update
[SCSI] aic79xx: target hotplug fixes

Herbert Poetzl:
vfs: propagate mnt_flags into do_loopback/vfsmount

Herbert Xu:
[IPSEC]: Check x->encap before dereferencing it
[INET]: Move no-tunnel ICMP error to tunnel4/tunnel6
[INET]: Use port unreachable instead of proto for tunnels
[TCP]: Fix truesize underflow

Hideo AOKI:
overcommit: add calculate_totalreserve_pages()
overcommit: use totalreserve_pages
overcommit: use totalreserve_pages for nommu

Hirokazu Takata:
m32r: Fix cpu_possible_map and cpu_present_map initialization for SMP kernel
m32r: security fix of {get,put}_user macros
Remove unused prepare_to_switch macro
m32r: Remove symbols exported twice

Horst Hummel:
s390: dasd device offline messages
s390: dasd proc entries

Hugh Dickins:
shmat: stop mprotect from giving write permission to a readonly attachment (CVE-2006-1524)
Fix MADV_REMOVE protection checking

Hyok S. Choi:
frv: define MMU mode specific syscalls as 'cond_syscall' and clean up unneeded macros

Ian Abbott:
USB: ftdi_sio: add support for Eclo COM to 1-Wire USB adapter

Ingo Molnar:
introduce a "kernel-internal pipe object" abstraction
splice: add optional input and output offsets
get rid of the PIPE_*() macros
pipe.c/fifo.c code cleanups
splice: comment styles
another round of fs/pipe.c cleanups

J. Bruce Fields:
knfsd: svcrpc: WARN() instead of returning an error from svc_take_page

Jack Morgenstein:
IB: simplify static rate encoding
IB/mthca: Fix max_srq_sge returned by ib_query_device for Tavor devices

Jacob Shin:
x86_64: Proper null pointer check in powernow_k8_get

jacob...@amd.com:
[CPUFREQ] x86_64: Proper null pointer check in powernow_k8_get

Jamal Hadi Salim:
[PKT_SCHED] act_police: Rename methods.
[XFRM]: Fix aevent timer.
[XFRM]: Add documentation for async events.

James Bottomley:
[SCSI] remove qlogicfc
[SCSI] expose sas internal class for the domain transport
[SCSI] add SCSI_UNKNOWN and LUN transfer limit restrictions
[SCSI] scsi_transport_sas: don't scan a non-existent end device

James Courtier-Dutton:
[ALSA] emu10k1: Add some descriptive text.

James Smart:
[SCSI] FC transport: fixes for workq deadlocks

Jan-Benedict Glaw:
Silence a const vs non-const warning

Jayachandran C:
[BRIDGE] ebtables: fix allocation in net/bridge/netfilter/ebtables.c
driver core: fix unnecessary NULL check in drivers/base/class.c
drm: Fix further issues in drivers/char/drm/via_irq.c

Jean Delvare:
i2c: convert ds1374 to use a workqueue
w83792d: Be quiet on misdetection
PCI: Add PCI quirk for SMBus on the Asus A6VA notebook

Jean-Luc Léger:
[SPARC64]: Fix dependencies of HUGETLB_PAGE_SIZE_64K

Jeff Dike:
UML: TLS fixlets
Add GFP_NOWAIT
uml: memory hotplug cleanups
fuse: add O_ASYNC support to FUSE device
fuse: add O_NONBLOCK support to FUSE device

Jeff Garzik:
[libata] sata_mv: fix can_queue line accidentally removed in scsi-eh patch
[netdrvr b44] trim trailing whitespace

Jeffrey Vandenbroucke sign:
hid-core.c: fix "input irq status -32 received" for Silvercrest USB Keyboard

Jens Axboe:
splice: mark the io page as accessed
splice: only call wake_up_interruptible() when we really have to
splice: cleanup __generic_file_splice_read()
splice: optimize the splice buffer mapping
splice: be smarter about calling do_page_cache_readahead()
splice: add direct fd <-> fd splicing support
splice: speedup __generic_file_splice_read
splice: speedups and optimizations
splice: unlikely() optimizations
splice: add Ingo as addition copyright holder
splice: pass offset around for ->splice_read() and ->splice_write()
splice: add support for sys_tee()

Jesper Juhl:
[NET]: Remove redundant NULL checks before [kv]free

Jing Min Zhao:
[NETFILTER]: H.323 helper: move some function prototypes to ip_conntrack_h323.h
[NETFILTER]: H.323 helper: change EXPORT_SYMBOL to EXPORT_SYMBOL_GPL
[NETFILTER]: H.323 helper: make get_h245_addr() static
[NETFILTER]: H.323 helper: add parameter 'default_rrq_ttl'

Joe Korty:
add cpu_relax to hrtimer_cancel

Joern Engel:
Remove blkmtd

John Blackwood:
x86_64: Plug GS leak in arch_prctl()

John Rose:
PCI: rpaphp: remove init error condition

John W. Linville:
pci_ids.h: correct naming of 1022:7450 (AMD 8131 Bridge)

Jordan Crouse:
Enable TSC for AMD Geode GX/LX

Jordan Hargrave:
x86_64: Fix drift with HPET timer enabled

Jordi Caubet:
spufs: fix context-switch decrementer code

KAMEZAWA Hiroyuki:
[ARM] arm's arch_local_page_offset() fix against 2.6.17-rc1
[IA64] for_each_possible_cpu: ia64
for_each_possible_cpu: network codes
for_each_possible_cpu: sparc
for_each_possible_cpu: sparc64
[SCSI] for_each_possible_cpu: scsi

Kay Sievers:
BLOCK: delay all uevents until partition table is scanned

Keith Owens:
[IA64] Pass more data to the MCA/INIT notify_die hooks
[IA64] Failure to resume after INIT in user space
Reinstate const in next_thread()
[IA64] ia64_wait_for_slaves() incorrectly reports MCA

Komuro:
network: axnet_cs.c: add missing 'PRIV' in ei_rx_overrun

Kumar Gala:
RTC subsystem: DS1672 oscillator handling

Kyle McMartin:
No arch-specific strpbrk implementations
Clean up arch-overrides in linux/string.h

Lennert Buytenhek:
[ARM] 3459/1: ixp23xx: fix debug serial macros for big-endian operation

Linas Vepstas:
powerpc/pseries: bugfix: balance calls to pci_device_put

Linus Torvalds:
Move request_standard_resources() back to before PCI probing
x86: don't allow tail-calls in sys_ftruncate[64]()
x86: be careful about tailcall breakage for sys_open[at] too
Linux v2.6.17-rc2

Linus Walleij:
[IRDA]: smcinit merged into smsc-ircc driver
[IRDA]: smsc-ircc2, smcinit support for ALi ISA bridges

Luiz Fernando Capitulino:
USB serial: Converts port semaphore to mutexes.

Luke Yang:
nommu: use compound page in slab allocator

mao, bibo:
x86_64: inline function prefix with __always_inline in vsyscall

Mark A. Greer:
i2c: convert m41t00 to use a workqueue

Mark Bellon:
MPBL0010 driver sysfs permissions wide open

Mark Fasheh:
ocfs2: multi node truncate fix
ocfs2: remove an overly aggressive BUG() in dlmfs
ocfs2: catch an invalid ast case in dlmfs
ocfs2: Handle the DLM_CANCELGRANT case in user_unlock_ast()
ocfs2: test and set teardown flag early in user_dlm_destroy_lock()
ocfs2: Better I/O error handling in heartbeat

Mark Haverkamp:
[SCSI] aacraid: Use scmd_ functions
[SCSI] aacraid: Track command ownership in driver
[SCSI] aacraid: Add timeout for events
[SCSI] aacraid: Error path cleanup
[SCSI] aacraid: Fix error in max_channel field
[SCSI] aacraid: Fix extra unregister_chrdev
[SCSI] aacraid: General driver cleanup
[SCSI] aacraid: Re-start helper thread if it dies
[SCSI] aacraid: Show max channel and max id is sysfs
[SCSI] aacraid: Fix parenthesis placement error
[SCSI] aacraid: Driver version update

Mark M. Hoffman:
i2c-sis96x: Remove an init-time log message
i2c-parport: Make type parameter mandatory

Martin Michlmayr:
parport: remove duplicate entry for NETMOS_9835

Martin Schwidefsky:
s390: update default configuration

Matthew Wilcox:
[SCSI] Change Kconfig option from IOMAPPED to MMIO
[SCSI] Use pcibios_resource_to_bus()
[SCSI] Simplify error handling a bit
[SCSI] Mark div_10M array const
[SCSI] Disable sym2 driver queueing
[SCSI] Use SPI messages where possible
[SCSI] Allow nvram settings to determine bus mode
[SCSI] Simplify error handling
[SCSI] Enable clustering and large transfers
[SCSI] Version 2.2.3
[SCSI] sym2: Fix build when spinlock debugging is enabled

Matthias Urlichs:
Overrun in option-card USB driver

matthieu castet:
USB: UEAGLE : cosmetic
USB: UEAGLE : support geode
USB: UEAGLE : null pointer dereference fix
USB: UEAGLE : memory leack fix

Michael Chan:
[TG3]: Kill some less useful flags
[TG3]: Speed up SRAM access (2nd version)

Michael Downey:
USB: keyspan-remote bugfix

Michael Ellerman:
powerpc: Fix machine detection in prom_init.c

Michael S. Tsirkin:
IB/mad: fix oops in cancel_mads
IPoIB: Consolidate private neighbour data handling
IB/mthca: Disable tuning PCI read burst size
IB/cache: Use correct pointer to calculate size

Mike Anderson:
[SCSI] sas transport: ref count update

Mike Christie:
[SCSI] fix sg leak when scsi_execute_async fails

Mike Galbraith:
sched: fix interactive task starvation
sched: don't awaken RT tasks on expired array

Mike Miller:
cciss: bug fix for crash when running hpacucli

Miklos Szeredi:
fuse: fix oops in fuse_send_readpages()
fuse: fix fuse_dev_poll() return value
fuse: simplify locking
fuse: use a per-mount spinlock
fuse: consolidate device errors
fuse: clean up request accounting
fuse: account background requests
[fuse] fix deadlock between fuse_put_super() and request_end()
[fuse] Fix accounting the number of waiting requests
[fuse] Don't init request twice
[fuse] Direct I/O should not use fuse_reset_request

Mitchell Blank Jr:
select: don't overflow if (SELECT_STACK_ALLOC % sizeof(long) != 0)

Moore, Eric:
[SCSI] mptfusion - fix panic in mptsas_slave_configure

Nathan Scott:
[XFS] Fix superblock validation regression for the zero imaxpct case.
[XFS] Fix a writepage regression where we accidentally stopped honouring
[XFS] Fix utime(2) in the case that no times parameter was passed in.
[XFS] Fix a problem in aligning inode allocations to stripe unit

NeilBrown:
md: make sure 64bit fields in version-1 metadata are 64-bit aligned
knfsd: Correct reserved reply space for read requests.
knfsd: locks: flag NFSv4-owned locks
knfsd: nfsd4: Wrong error handling in nfs4acl
knfsd: nfsd4: better nfs4acl errors
knfsd: nfsd4: fix acl xattr length return
knfsd: nfsd: oops exporting nonexistent directory
knfsd: nfsd: nfsd_setuser doesn't really need to modify rqstp->rq_cred.
knfsd: nfsd4: remove nfsd_setuser from putrootfh
knfsd: nfsd4: fix corruption of returned data when using 64k pages
knfsd: nfsd4: fix corruption on readdir encoding with 64k pages
knfsd: svcrpc: gss: don't call svc_take_page unnecessarily
knfsd: nfsd4: fix laundromat shutdown race
knfsd: nfsd4: nfsd4_probe_callback cleanup
knfsd: nfsd4: add missing rpciod_down()
knfsd: nfsd4: limit number of delegations handed out.
knfsd: nfsd4: grant delegations more frequently
sysfs: Allow sysfs attribute files to be pollable

Nick Piggin:
Fix buddy list race that could lead to page lru list corruptions

Nicolas Pitre:
[ARM] 3477/1: ARM EABI: undefine removed syscalls

OGAWA Hirofumi:
Remove sys_ prefix of new syscalls from __NR_sys_*
[ALSA] pcm_oss: fix snd_pcm_oss_release() oops
[PATCH 1/2] iosched: fix typo and barrier()
[PATCH 2/2] cfq: fix cic's rbtree traversal
cfq: Further rbtree traversal and cfq_exit_queue() race fix

Olaf Hering:
powerpc32: Set cpu explicitly in kernel compiles

Oleg Nesterov:
__group_complete_signal: remove bogus BUG_ON

Paolo 'Blaisorblade' Giarrusso:
[NET] kzalloc: use in alloc_netdev
kbuild: fix mode of checkstack.pl and other files.
uml: make 64-bit COW files compatible with 32-bit ones
uml: safe migration path to the correct V3 COW format
uml: fix 2 harmless cast warnings for 64-bit
uml: request format warnings to GCC for appropriate functions
uml: fix format errors
uml: fix some double export warnings
uml: fix "extern-vs-static" proto conflict in TLS code
uml: fix critical typo for TT mode
uml: support sparse for userspace files
uml: move outside spinlock call not needing it
uml: fix hang on run_helper() failure on uml_net
uml: fix failure path after conversion
uml: fix big stack user
uml: local_irq_save, not local_save_flags
uml: fix parallel make early failure on clean tree
uml: avoid warnings for diffent names for an unsigned quadword
module support: record in vermagic ability to unload a module

Patrick McHardy:
[NETFILTER]: Fix fragmentation issues with bridge netfilter
[NETFILTER]: Add helper functions for mass hook registration/unregistration
[NETFILTER]: Clean up hook registration
[NETFILTER]: Fix section mismatch warnings
[NETFILTER]: Fix IP_NF_CONNTRACK_NETLINK dependency
[NETFILTER]: Introduce infrastructure for address family specific operations
[NETFILTER]: Add address family specific checksum helpers
[NETFILTER]: Convert conntrack/ipt_REJECT to new checksumming functions
[NETFILTER]: H.323 helper: remove changelog
[NETFILTER]: Fix DNAT in LOCAL_OUT

Paul Fulghum:
ptmx: fix duplicate idr_remove
tty release_dev(): remove dead code
USB: remove __init from usb_console_setup

Paul Mackerras:
powerpc: Fix CHRP booting - needs a define_machine call
powerpc: Use correct sequence for putting CPU into nap mode

Pekka J Enberg:
vfs: add splice_write and splice_read to documentation

Pete Zaitcev:
USB: linux/usb/net2280.h common definitions

Peter Oberparleiter:
s390: ebdic to ascii conversion tables
s390: invalid check after kzalloc()
s390: increase cio_trace debug event size
s390: fail-fast requests on quiesced devices
s390: minor tape fixes

Petko Manolov:
USB: pegasus driver bugfix

Ping Cheng:
USB: wacom tablet driver update
USB: add new wacom devices to usb hid-core list

Ralf Baechle:
[MIPS] Cleanup free_initmem the same way as i386 did.
[MIPS] Make set_vi_srs_handler static.
[MIPS] Remove redundant initialization of sr_allocated.
[MIPS] Fixup printk in mips_srs_init.
[MIPS] Some formatting fixes.
[MIPS] Provide access functions for c0_badvaddr.
[MIPS] Fix vectored interrupt support in TLB exception handler generator.
[MIPS] More SHT_* and SHF_* ELF definitions.
[MIPS] Wire splice syscall.
[MIPS] Wire up sync_file_range(2).
[MIPS] Sort out duplicate exports.
[MIPS] Fix breakage due to the grand makefile crapectomy.
[MIPS] Rewrite spurious_interrupt from assembler to C.
[MIPS] PNX8550 build fix.
[MIPS] Fix CONFIG_LIMITED_DMA build.
[MIPS] ITE8172: Fix build error due to missmatching prototypes.
[MIPS] Jaguar: Fix build errors after the recent move of Marvell headers.
[MIPS] MV6434x: The name of the CPP symbol is __mips__, not __MIPS__.
[MIPS] ITE: Glue build.
[MIPS] it8172: Fix build of serial driver.
[MIPS] MV6434x: Add prototype of interrupt dispatch function.
[MIPS] Ocelot 3: Fix build errors after the recent move of Marvell headers.
[MIPS] EV96100: Fix over two year old typo in variable name.
[MIPS] EV96100: ev96100_cpu_irq needs a struct pt_regs argument.
[MIPS] JMR3927 build fixes for the RTC code.
[MIPS] Replace redundant declarations of _end by <asm/sections.h>.
[MIPS] Fixup damage done by 22a9835c350782a5c3257343713932af3ac92ee0.
[MIPS] Fix the crime against humanity that mipsIRQ.S is.
[MIPS] Rewrite all the assembler interrupt handlers to C.
[MIPS] Use "R" constraint for cache_op.
[MIPS] R2: Implement shadow register allocation without spinlock.
[MIPS] Fix genrtc compilation.
[MIPS] R2: Instruction hazard barrier.
[MIPS] kpsd and other AP/SP improvements.
[MIPS] MT: Improved multithreading support.
[MIPS] FPU affinity for MT ASE.
[MIPS] kgdb: Let gcc compute the array size itself.
[MIPS] MIPS boards: Set HZ to 100.
[MIPS] Make mips_srs_init static.
[MIPS] Handle IDE PIO cache aliases on SMP.
[MIPS] Fix Makefile bugs for MIPS32/MIPS64 R1 and R2.
[MAINTAINERS] The ham radio code now has website at http://www.linux-ax25.org.

Ram Gupta:
mm: fix bug in brk()

Randy Dunlap:
[NET] netconsole: set .name in struct console
hugetlbfs doc. update
i386: print EIP/ESP last
menu: relocate DOUBLEFAULT option
mpparse: prevent table index out-of-bounds
mptspec: remove duplicate #include
docs: laptop-mode.txt source file build
Doc: fix mtrr userspace programs to build cleanly
kexec: update MAINTAINERS
net drivers: fix section attributes for gcc
isd200: limit to BLK_DEV_IDE

Ravikiran G Thirumalai:
x86_64: Fixup read_mostly section on internode cache line size for vSMP
slab: allocate node local memory for off-slab slabmanagement
slab: add statistics for alien cache overflows

Rene Herman:
[ALSA] continue on IS_ERR from platform device registration
[ALSA] unregister platform device again if probe was unsuccessful

Richard Purdie:
[ARM] 3478/1: SharpSL SCOOP: Fix potenial build failure
[ARM] 3479/1: Corgi SSP: Fix potential concurrent access problem

Robert Love:
hdaps: support new Lenovo machines

Robert Olsson:
[FIB_TRIE]: Fix leaf freeing.

Robin Holt:
[IA64] Make show_mem() skip holes in a pgdat

Roger Luethi:
via-rhine: execute bounce buffers code on Rhine-I only

Roland Dreier:
IPoIB: Always build debugging code unless CONFIG_EMBEDDED=y
IB/mthca: Always build debugging code unless CONFIG_EMBEDDED=y
IB/srp: Fix memory leak in options parsing
IPoIB: Use spin_lock_irq() instead of spin_lock_irqsave()
PCI: fix sparse warning about pci_bus_flags

Roland McGrath:
process accounting: take original leader's start_time in non-leader exec
fix non-leader exec under ptrace

Roman Zippel:
kconfig: fix default value for choice input
kconfig: revert conf behaviour change
kconfig: recenter menuconfig
kconfig: fix typo in change count initialization

Russell King:
[ARM] Remove unnecessary extra parens in include/asm-arm/memory.h
[ARM] Move FLUSH_BASE macros to asm/arch/memory.h
[ARM] Fix ebsa110 debug macros
[ARM] ebsa110: Fix incorrect serial port address
[ARM] Fix SA110/SA1100 cache flushing
[ARM] Allow decompressor to be built with -ffunction-sections
[SERIAL] Update serial driver documentation

Ryan Wilson:
driver core: driver_bind attribute returns incorrect value

Sam Ravnborg:
kbuild: use relative path to -I
kbuild: fix building single targets with make O=.. single-target
kbuild: fix make dir/
kbuild: properly pass options to hostcc when doing make O=..
x86_64: fix CONFIG_REORDER
kbuild: rebuild initramfs if content of initramfs changes
kbuild: fix false section mismatch warnings

Samuel Ortiz:
[IRDA]: Support for Sigmatel STIR421x chip
[IRDA]: irda-usb, unregister netdev when patch upload fails

Samuel Thibault:
Enhancing accessibility of lxdialog

Sergey Vlasov:
[NET]: Fix hotplug race during device registration.

Shaohua Li:
PCI: MSI(X) save/restore for suspend/resume

Shirley Ma:
IPoIB: Make send and receive queue sizes tunable

Siddha, Suresh B:
x86_64: fix sync before RDTSC on Intel cpus

Stephen Hemminger:
[BRIDGE]: receive link-local on disabled ports.
dlink pci cards using wrong driver
sky2: bad memory reference on dual port cards
[ATM]: clip causes unregister hang
[ATM]: Clip timer race.
[ATM] clip: run through Lindent
[ATM] clip: get rid of PROC_FS ifdef
[ATM] clip: notifier related cleanups
[ATM] clip: add module info
[IPV4]: ip_route_input panic fix

Stephen Rothwell:
powerpc: iSeries has only 256 IRQs
Fix block device symlink name

Takashi Iwai:
[ALSA] Fix Oops of PCM OSS emulation
[ALSA] hda-codec - Add another HP laptop with AD1981HD
[ALSA] via82xx - Add a dxs entry for ECS K8T890-A
[ALSA] hda-codec - Add support of ASUS U5A with AD1986A codec
[ALSA] ac97 - Add entry for VIA VT1618 codec

Tejun Heo:
[SCSI] SCSI: fix scsi_kill_request() busy count handling

Thomas Renninger:
[CPUFREQ] If max_freq got reduced (e.g. by _PPC) a write to sysfs scaling_governor let cpufreq core stuck at low max_freq for ever

Tilman Schmidt:
isdn4linux: Siemens Gigaset drivers: code cleanup
isdn4linux: Siemens Gigaset drivers: Kconfig correction
isdn4linux: Siemens Gigaset drivers: timer usage
isdn4linux: Siemens Gigaset drivers: logging usage
isdn4linux: Siemens Gigaset drivers: sysfs usage
isdn4linux: Siemens Gigaset drivers: remove IFNULL macros
isdn4linux: Siemens Gigaset drivers: uninline
isdn4linux: Siemens Gigaset drivers: eliminate from_user argument
isdn4linux: Siemens Gigaset drivers: mutex conversion
isdn4linux: Siemens Gigaset drivers: remove private version of __skb_put()
isdn4linux: Siemens Gigaset drivers: remove forward references
isdn4linux: Siemens Gigaset drivers: add README
isdn4linux: Siemens Gigaset drivers: make some variables non-atomic

Tobias Klauser:
Last DMA_xBIT_MASK cleanups
[CPUFREQ] Remove duplicate check in powernow-k8

Tomasz Kazmierczak:
USB: pl2303: added support for OTi's DKU-5 clone cable

Tony Lindgren:
[ARM] 3460/1: ARM: OMAP: Remove unnecessary nop_release()
[ARM] 3461/1: ARM: OMAP: Fix clk_get() when using id and name

Tony Luck:
[IA64] Wire up new syscall sync_file_range()
[IA64] 'msg' may be used uninitialized in xpc_initiate_allocate()
[IA64] Wire up new syscalls {set,get}_robust_list

Vitaly Bordug:
ppc32: Fix string comparing in platform_notify_map

Vivek Goyal:
kdump proc vmcore size oveflow fix
kdump: enable CONFIG_PROC_VMCORE by default
x86_64: x86_64 add crashdump trigger points

Yasunori Goto:
Configurable NODES_SHIFT

Yoichi Yuasa:
RTC subsystem: VR41XX driver
[MIPS] Added tb0287_defconfig back.
[MIPS] Fix VR41xx build errors.

YOSHIFUJI Hideaki:
[IPV6]: Ensure to have hop-by-hop options in our header of &sk_buff.
[IPV6] XFRM: Don't use old copy of pointer after pskb_may_pull().
[IPV6] XFRM: Fix decoding session with preceding extension header(s).
[IPV6]: Clean up hop-by-hop options handler.

Zach Brown:
[IPv6] reassembly: Always compute hash under the fragment lock.
ip_output: account for fraggap when checking to add trailer_len

Ingo Molnar

unread,

Apr 19, 2006, 4:34:36 AM4/19/06

to Linus Torvalds, Linux Kernel Mailing List

dm: fix typo.

Signed-off-by: Ingo Molnar <mi...@elte.hu>
----

drivers/md/dm.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

Index: linux/drivers/md/dm.c
===================================================================
--- linux.orig/drivers/md/dm.c
+++ linux/drivers/md/dm.c
@@ -1004,7 +1004,7 @@ int dm_create(struct mapped_device **res

int dm_create_with_minor(unsigned int minor, struct mapped_device **result)
{
- return create_aux(minor, 1, reqult);
+ return create_aux(minor, 1, result);
}

static struct mapped_device *dm_find_md(dev_t dev)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

unread,

Apr 19, 2006, 4:37:11 AM4/19/06

to Linus Torvalds, Linux Kernel Mailing List

* Ingo Molnar <mi...@elte.hu> wrote:

> dm: fix typo.

disregard it - it seems i got a 1-bit error in that file ...

Ingo

Diego Calleja

unread,

Apr 19, 2006, 2:02:32 PM4/19/06

to Linus Torvalds, linux-...@vger.kernel.org

Could someone give a long high-level description of what splice() and tee()
are? I need a description for wiki.kernelnewbies.org/Linux_2_6_17 (while
we're it, it'd be nice if some people can review it in case it's missing
something ;) I've named it "generic zero-copy mechanism" but I bet
there's a better description, if it's so cool as people says it'd be nice
to do some "advertising" of it (notifying people of new features is not
something linux has done too well historically :)

What kind of apps available today could get performance benefits by using
this? Is there a new class of "processes" (or apps) that couldn't be done
and can be done now using splice, or are there some kind of apps that become
too complex internally today because they try to avoid extra copy of data
and they can get much simpler by using splice? Why people sees it as a
"radical" improvement in some cases over the typical way of doing I/O in
Unix. Is this similar or can be compared with ritchie's/SYSV STREAMS?

Hua Zhong

unread,

Apr 19, 2006, 2:06:14 PM4/19/06

to Diego Calleja, Linus Torvalds, linux-...@vger.kernel.org

http://lwn.net/Articles/178199/

Linus Torvalds

unread,

Apr 19, 2006, 2:45:07 PM4/19/06

to Diego Calleja, linux-...@vger.kernel.org

On Wed, 19 Apr 2006, Diego Calleja wrote:
>
> Could someone give a long high-level description of what splice() and tee()
> are?

The _really_ high-level concept is that there is now a notion of a "random
kernel buffer" that is exposed to user space.

In other words, splice() and tee() work on a kernel buffer that the user
has control over, where "splice()" moves data to/from the buffer from/to
an arbitrary file descriptor, while "tee()" copes the data in one buffer
to another.

So in a very real (but abstract) sense, "splice()" is nothing but
read()/write() to a kernel buffer, and "tee()" is a memcpy() from one
kernel buffer to another.

Now, to get slightly less abstract, there's two important practical
details:

- the "buffer" implementation is nothing but a regular old-fashioned UNIX
pipe.

This actually makes sense on so many levels, but mostly simply because
that is _exactly_ what a UNIX pipe has always been: it's a buffer in
kernel space. That's what a pipe has always been. So the splice usage
isn't conceptually anything new for pipes - it's just exposing that
old buffer in a new way.

Using a pipe for the in-kernel buffer means that we already have all
the infrastructure in place to create these things (the "pipe()" system
call), and refer to them (user space uses a regular file descriptor as
a "pointer" to the kernel buffer).

It also means that we already know how to fill (or read) the kernel
buffer from user space: the bog-standard pre-existing "read()" and
"write()" system calls to the pipe work the obvious ways: they read the
data from the kernel buffer into user space, and write user space data
into the kernel buffer.

- the second part of the deal is that the buffer is actually implemented
as a set of reference-counted pointers, which means that you can copy
them around without actually physically copy memory. So while "tee()"
from a _conceptual_ standpoint is exactly the same as a "memcpy()" on
the kernel buffer, from an implementation standpoint it really just
copies the pointers and increments the refcounts.

There are some other buffer management system calls that I haven't done
yet (and when I say "I haven't done yet", I obviously mean "that I hope
some other sucker will do for me, since I'm lazy"), but that are obvious
future extensions:

- an ioctl/fcntl to set the maximum size of the buffer. Right now it's
hardcoded to 16 "buffer entries" (which in turn are normally limited to
one page each, although there's nothing that _requires_ that a buffer
entry always be a page).

- vmsplice() system call to basically do a "write to the buffer", but
using the reference counting and VM traversal to actually fill the
buffer. This means that the user needs to be careful not to re-use the
user-space buffer it spliced into the kernel-space one (contrast this
to "write()", which copies the actual data, and you can thus re-use the
buffer immediately after a successful write), but that is often easy to
do.

Anyway, when would you actually _use_ a kernel buffer? Normally you'd use
it it you want to copy things from one source into another, and you don't
actually want to see the data you are copying, so using a kernel buffer
allows you to possibly do it more efficiently, and you can avoid
allocating user VM space for it (with all the overhead that implies: not
just the memcpy() to/from user space, but also simply the book-keeping).

It should be noted that splice() is very much _not_ the same as
sendfile(). The buffer is really the big difference, both conceptually,
and in how you actually end up using it.

A "sendfile()" call (which a lot of other OS's also implement) doesn't
actually _need_ a buffer at all, because it uses the file cache directly
as the buffer it works on. So sendfile() is really easy to use, and really
efficient, but fundamentally limited in what it can do.

In contrast, the whole point of splice() very much is that buffer. It
means that in order to copy a file, you literally do it like you would
have done it traditionally in user space:

int ret;

for (;;) {
int ret = read(input, buffer, BUFSIZE);
char *p;
if (!ret)
break;
if (ret < 0) {
if (errno == EINTR)
continue;
.. exit with an inpot error ..
}

p = buffer;
do {
int written = write(output, p, ret);
if (!written)
.. exit with filesystem full ..
if (written < 0) {
if (errno == EINTR)
continue;
.. exit with an output error ..
}
p += written;
ret -= written;
} while (ret);
}

except you'd not have a buffer in user space, and the "read()" and
"write()" system calls would instead be "splice()" system calls to/from a
pipe you set up as your _kernel_ buffer. But the _construct_ would all be
indentical - the only thing that changes is really where that "buffer"
exists.

Now, the advantage of splice()/tee() is that you can do zero-copy movement
of data, and unlike sendfile() you can do it on _arbitrary_ data (and, as
shown by "tee()", it's more than just sending the data to somebody else:
you can duplicate the data and choose to forward it to two or more
different users - for things like logging etc).

So while sendfile() can send files (surprise surprise), splice() really is
a general "read/write in user space" and then some, so you can forward
data from one socket to another, without ever copying it into user space.

Or, rather than just a boring socket->socket forwarding, you could, for
example, forward data that comes from a MPEG-4 hardware encoder, and tee()
it to duplicate the stream, and write one of the streams to disk, and the
other one to a socket for a real-time broadcast. Again, all without
actually physically copying it around in memory.

So splice() is strictly more powerful than sendfile(), even if it's a bit
more complex to use (the explicit buffer management in the middle). That
said, I think we're actually going to _remove_ sendfile() from the kernel
entirely, and just leave a compatibility system call that uses splice()
internally to keep legacy users happy.

Splice really is that much more powerful a concept, that having sendfile()
just doesn't make any sense except as some legacy compatibility layer
around the more powerful splice().

Linus

Grzegorz Kulewski

unread,

Apr 19, 2006, 3:21:27 PM4/19/06

to Linus Torvalds, Diego Calleja, linux-...@vger.kernel.org

On Wed, 19 Apr 2006, Linus Torvalds wrote:
> On Wed, 19 Apr 2006, Diego Calleja wrote:
>>
>> Could someone give a long high-level description of what splice() and tee()
>> are?
>
> The _really_ high-level concept is that there is now a notion of a "random
> kernel buffer" that is exposed to user space.

Suppose I am implementing hi performance HTTP (not caching) proxy that
reads (part of?) HTTP header from A, decides where to send request from
it, connects to the right host (B), sends (part of) HTTP header it already
received and then wants to:

- make all further bytes from A be copied to B without using user space
but no more than n bytes (n = request size it knows from header) or to the
end of data (disconnect or something like that),

- make all bytes from B copied to A without using user space but no more
than m bytes (m = response size from response header),

- stop both operations as soon as they copy enough data (assuming both
sides are still connected) and then use sockets normally - to implement
for example multiple requests per connection (keepalive).

Could it be done with splice() or tee() or some other kernel
"accelerator"? Or should it be done in userspace by plain read and write?

And what if n or m is not known in advance but for example end of request
is represented by <CR><LF><CR><LF> or something like that (common in some
older protocols)?

Thanks in advance,

Grzegorz Kulewski

Jonathan Corbet

unread,

Apr 19, 2006, 3:41:07 PM4/19/06

to linux-...@vger.kernel.org, Hua Zhong, Diego Calleja, Linus Torvalds

> http://lwn.net/Articles/178199/

Additionally, the article on tee() can be found at:

http://lwn.net/SubscriberLink/179492/14a99324520b744f/

jon

Linus Torvalds

unread,

Apr 19, 2006, 4:09:36 PM4/19/06

to Grzegorz Kulewski, Diego Calleja, linux-...@vger.kernel.org

On Wed, 19 Apr 2006, Grzegorz Kulewski wrote:
>
> Suppose I am implementing hi performance HTTP (not caching) proxy that reads
> (part of?) HTTP header from A, decides where to send request from it, connects
> to the right host (B), sends (part of) HTTP header it already received and
> then wants to:
>
> - make all further bytes from A be copied to B without using user space but no
> more than n bytes (n = request size it knows from header) or to the end of
> data (disconnect or something like that),
>
> - make all bytes from B copied to A without using user space but no more than
> m bytes (m = response size from response header),
>
> - stop both operations as soon as they copy enough data (assuming both sides
> are still connected) and then use sockets normally - to implement for example
> multiple requests per connection (keepalive).
>
> Could it be done with splice() or tee() or some other kernel "accelerator"? Or
> should it be done in userspace by plain read and write?

You'd not use "tee()" here, because you never have any data that you want
to go to two different destinations, but yes, you could use very
well use splice() for this.

(well, technically you have the header part that you want to duplicate,
and you _could_ use "tee()" for that, but it would be stupid - since you
want to see the header in user space _anyway_ to see where to forward
things, you just want to start out with a MSG_PEEK on the incoming socket
to see the header, and then use splice, to splice it to the destination
socket).

> And what if n or m is not known in advance but for example end of request is
> represented by <CR><LF><CR><LF> or something like that (common in some older
> protocols)?

At that point, you need to actually watch the data in user space, and so
you need to do a real read() system call.

(Of course, the "kernel buffer" notion does allow for a notion of "kernel
filters" too, but then you get to shades of STREAMS, and that just scares
the crap out of me, so..)

Linus

Trond Myklebust

unread,

Apr 19, 2006, 5:26:02 PM4/19/06

to Linus Torvalds, Diego Calleja, linux-...@vger.kernel.org

On Wed, 2006-04-19 at 11:44 -0700, Linus Torvalds wrote:
>
> On Wed, 19 Apr 2006, Diego Calleja wrote:
> >
> > Could someone give a long high-level description of what splice() and tee()
> > are?
>
> The _really_ high-level concept is that there is now a notion of a "random
> kernel buffer" that is exposed to user space.
>
> In other words, splice() and tee() work on a kernel buffer that the user
> has control over, where "splice()" moves data to/from the buffer from/to
> an arbitrary file descriptor, while "tee()" copes the data in one buffer
> to another.

Any chance this could be adapted to work with all those DMA (and RDMA)
engines that litter our motherboards? I'm thinking in particular of
stuff like the drm drivers, and userspace rdma.

Cheers,
Trond

Linus Torvalds

unread,

Apr 19, 2006, 5:50:24 PM4/19/06

to Trond Myklebust, Diego Calleja, linux-...@vger.kernel.org

On Wed, 19 Apr 2006, Trond Myklebust wrote:
>
> Any chance this could be adapted to work with all those DMA (and RDMA)
> engines that litter our motherboards? I'm thinking in particular of
> stuff like the drm drivers, and userspace rdma.

Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move
these user pages into a kernel buffer") it should be entirely possible to
set up an efficient zero-copy setup that does NOT have any of the problems
with aio and TLB shootdown etc.

Note that a driver would have to support the splice_in() and splice_out()
interfaces (which are basically just given the pipe buffers to do with as
they wish), and perhaps more importantly: note that you need specialized
apps that actually use splice() to do this.

That's the biggest downside by far, and is why I'm not 100% convinced
splice() usage will be all that wide-spread. If you look at sendfile(),
it's been available for a long time, and is actually even almost portable
across different OS's _and_ it is easy to use. But almost nobody actually
does. I suspect the only users are some apache mods, perhaps a ftp deamon
or two, and probably samba. And that's probably largely it.

There's a _huge_ downside to specialized interfaces. Admittedly, splice()
is a lot less specialized (ie it works in a much wider variety of loads),
but it's still very much a "corner-case" thing. You can always do the same
thing splice() does with a read/write pair instead, and be portable.

Also, the genericity of splice() does come at the cost of complexity. For
example, to do a zero-copy from a user space buffer to some RDMA network
interface, you'd have to basically keep track of _two_ buffers:

- keep track of how much of the user space buffer you have moved into
kernel space with "vmsplice()" (or, for that matter, with any other
source of data for the buffer - it might be a file, it might be another
socket, whatever. I say "vmsplice()", but that's just an example for
when you have the data in user space).

The kernel space buffer is - for obvious reasons - size limited in the
way a user-space buffer is not. People are used to doing megabytes of
buffers in user space. The splice buffer, in comparison, is maybe a few
hundred kB at most. For some apps, that's "inifinity". For others, it's
just a few tens of pages of data.

- keep track of how much of the kernel space buffer you have moved to the
RDMA network interface with "splice()".

The splice buffer _is_ another buffer, and you have to feed the data
from that buffer to the RDMA device manually.

In many usage schenarios, this means that you end up having the normal
kind of poll/select loop. Now, that's nothing new: people are used to
them, but people still hate them, and it just means that very few
environments are going to spend the effort on another buffering setup.

So the upside of splice() is that it really can do some things very
efficiently, by "copying" data with just a simple reference counted
pointer. But the downside is that it makes for another level of buffering,
and behind an interface that is in kernel space (for obvious reasons),
which means that it's somewhat harder to wrap your hands and head around
than just a regular user-space buffer.

So I'd expect this to be most useful for perhaps things like some HPC
apps, where you can have specialized libraries for data communication. And
servers, of course (but they might just continue to use the old
"sendfile()" interface, without even knowing that it's not sendfile() any
more, but just a wrapper around splice()).

Linus

Peter Naulls

unread,

Apr 19, 2006, 6:19:13 PM4/19/06

to linux-...@vger.kernel.org

Linus Torvalds wrote:
>
> On Wed, 19 Apr 2006, Trond Myklebust wrote:
>> Any chance this could be adapted to work with all those DMA (and RDMA)
>> engines that litter our motherboards? I'm thinking in particular of
>> stuff like the drm drivers, and userspace rdma.
>
> Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move
> these user pages into a kernel buffer") it should be entirely possible to
> set up an efficient zero-copy setup that does NOT have any of the problems
> with aio and TLB shootdown etc.
>
> Note that a driver would have to support the splice_in() and splice_out()
> interfaces (which are basically just given the pipe buffers to do with as
> they wish), and perhaps more importantly: note that you need specialized
> apps that actually use splice() to do this.
>
> That's the biggest downside by far, and is why I'm not 100% convinced
> splice() usage will be all that wide-spread. If you look at sendfile(),
> it's been available for a long time, and is actually even almost portable
> across different OS's _and_ it is easy to use. But almost nobody actually
> does. I suspect the only users are some apache mods, perhaps a ftp deamon
> or two, and probably samba. And that's probably largely it.

I am. I'm developing a distributed file system responsible for
transferring GBs of files around a network. The biggest problem here
with the traditional send/recv/poll that was in use was heavy duty
CPU usage. Maxing out the gigabit network eats about 60% CPU. In
some simple experiments, sendfile reduced that to 10% or less
(depending, there's a lot of variation in stuff that goes on).

One big problem I had is that sendfile is not symmetric (for quite
understable reasons), but that meant the overlying file system API
(it's a userspace library) has to undergo various changes to make
effective use of sendfile. Doing so in a sensible manner proved
tricky, but not impossible

Anyway, CPU usage is still a big deal, which is why I'm interested
in these new zero-copy calls I've just caught up on the discussion
about. And if I decide to use them, that means moving a whole
load of machines to 2.6.17 - some of which will be running 2.6.12
for at least a little while longer. I guess I might be asking
for the opposite of this:

> So I'd expect this to be most useful for perhaps things like some HPC
> apps, where you can have specialized libraries for data communication. And
> servers, of course (but they might just continue to use the old
> "sendfile()" interface, without even knowing that it's not sendfile() any
> more, but just a wrapper around splice()).

i.e, a splice emulation, that happens to use sendfile when it can.

I very much appreciate the conceptual improvements that splice has
over sendfile, but can anyone give some examples significant CPU
savings that would not be possible using sendfile?

Diego Calleja

unread,

Apr 20, 2006, 9:22:14 AM4/20/06

to Linus Torvalds, linux-...@vger.kernel.org

El Wed, 19 Apr 2006 11:44:25 -0700 (PDT),
Linus Torvalds <torv...@osdl.org> escribió:

> Anyway, when would you actually _use_ a kernel buffer? Normally you'd
use
> it it you want to copy things from one source into another, and you d
on't

Thanks,I wonder it splice can be useful for more cases than just high-b
andwith
blind transference of data? For example, in X.org as of today, I think
that
pixmaps need to be copied from the client adress space to the server. B
ecause
X.org is network-oriented the pixmaps must be sent even in local machin
es,
(in order to save memory when clients move a pixmap to the server they
must
free it in their address space, because extra copies mean high memory u
sage,
at some point nautilus was keeping three copies of the desktop backgrou
nd
in memory)

There're shared memory extensions in commercial X servers which I think
they fix this for local usage (there're rumors that Sun may port and
contribute their Xsun shared memory implementation to x.org in the
future), but I wonder if splice could be an alternative aswell? Or
maybe splice is not a good option when you need several MB? (if the buf
fer
size becomes tweakable in the future)

Jens Axboe

unread,

Apr 20, 2006, 10:50:38 AM4/20/06

to Linus Torvalds, Diego Calleja, linux-...@vger.kernel.org

On Wed, Apr 19 2006, Linus Torvalds wrote:
> There are some other buffer management system calls that I haven't done
> yet (and when I say "I haven't done yet", I obviously mean "that I hope
> some other sucker will do for me, since I'm lazy"), but that are obvious
> future extensions:

Well it's worked so far, hasn't it? :-)

> - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
> hardcoded to 16 "buffer entries" (which in turn are normally limited to
> one page each, although there's nothing that _requires_ that a buffer
> entry always be a page).

This is on a TODO, but not very high up since I've yet to see a case
where the current 16 page limitation is an issue. I'm sure something
will come up eventually, but until then I'd rather not bother.

> - vmsplice() system call to basically do a "write to the buffer", but
> using the reference counting and VM traversal to actually fill the
> buffer. This means that the user needs to be careful not to re-use the
> user-space buffer it spliced into the kernel-space one (contrast this
> to "write()", which copies the actual data, and you can thus re-use the
> buffer immediately after a successful write), but that is often easy to
> do.

This I already did, it was pretty easy and straight forward. I'll post
it soonish.

--
Jens Axboe

Linus Torvalds

unread,

Apr 20, 2006, 11:32:34 AM4/20/06

to Jens Axboe, Diego Calleja, linux-...@vger.kernel.org

On Thu, 20 Apr 2006, Jens Axboe wrote:
>
> > - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
> > hardcoded to 16 "buffer entries" (which in turn are normally limited to
> > one page each, although there's nothing that _requires_ that a buffer
> > entry always be a page).
>
> This is on a TODO, but not very high up since I've yet to see a case
> where the current 16 page limitation is an issue. I'm sure something
> will come up eventually, but until then I'd rather not bother.

The real reason for limiting the number of buffer entries is not to make
the number _larger_ (although that can be a performance optimization), but
to make it _smaller_ or at least knowing/limiting how big it is.

It doesn't matter with the current interfaces which are mostly agnostic as
to how big the buffer is, but it _does_ matter with vmsplice().

Why?

Simple: for a vmsplice() user, it's very important to know when they can
start re-using the buffer(s) that they used vmsplice() on previously. And
while the user could just ask the kernel how many bytes are left in the
pipe buffer, that's pretty inefficient for many normal streaming cases.

The _efficient_ way is to make the user-space buffer that you use for
splicing information to another entity a circular buffer that is at least
as large as any of the splice pipes involved in the transfer (depending on
use. In many cases, you will probably want to make the user-space buffer
_twice_ as big as the kernel buffer, which makes the tracking even easier:
while half of the buffer is busy, you can write to the half that is
guaranteed to not be in the kernel buffer, so you effectively do "double
buffering")

So if you do that, then you can continue to write to the buffer without
ever worrying about re-use, because you know that by the time you wrap
around, the kernel buffer will have been flushed out, or the vmsplice()
would have blocked, waiting for the receiver. So now you no longer need to
worry about "how much has flushed" - you only need to worry about doing
the vmsplice() call at least twice per buffer traversal (assuming the
"user buffer is double the size of the kernel buffer" approach).

So you could do a very efficient "stdio-like" implementation for logging,
for example, since this allows you to re-use the same pages over and over
for splicing, without ever having any copying overhead, and without ever
having to play VM-related games (ie you don't need to do unmaps or
mprotects or anything expensive like that in order to get a new page or
something).

But in order to do that, you really do need to know (and preferably set)
the size of the splice buffer. Otherwise, if the in-kernel splice buffer
is larger than the circular buffer you use in user space, the kernel will
add the same page _twice_ to the buffer, and you'll overwrite the data
that you already spliced.

(Now, you still need to be very careful with vmsplice() in general, since
it leaves the data page writable in the source VM and thus allows for all
kinds of confusion, but the theory here is "give them rope". Rope enough
to do clever things always ends up being rope enough to hang yourself too.
Tough.).

Linus

Ingo Oeser

unread,

Apr 20, 2006, 12:27:03 PM4/20/06

to Linus Torvalds, Diego Calleja, linux-...@vger.kernel.org, Jens Axboe

Hi,

On Wednesday, 19. April 2006 20:44, Linus Torvalds wrote:
> Or, rather than just a boring socket->socket forwarding, you could, for
> example, forward data that comes from a MPEG-4 hardware encoder, and tee()
> it to duplicate the stream, and write one of the streams to disk, and the
> other one to a socket for a real-time broadcast. Again, all without
> actually physically copying it around in memory.

Yes! That's what I've been after for some time now.

Thanks everyone.

Regards

Ingo Oeser

Linh Dang

unread,

Apr 20, 2006, 2:42:55 PM4/20/06

to linux-...@vger.kernel.org

Jens Axboe <ax...@suse.de> wrote:

> On Wed, Apr 19 2006, Linus Torvalds wrote:
>> There are some other buffer management system calls that I haven't
>> done yet (and when I say "I haven't done yet", I obviously mean
>> "that I hope some other sucker will do for me, since I'm lazy"),
>> but that are obvious future extensions:
>
> Well it's worked so far, hasn't it? :-)
>
>> - an ioctl/fcntl to set the maximum size of the buffer. Right now
>> it's
>> hardcoded to 16 "buffer entries" (which in turn are normally limited to
>> one page each, although there's nothing that _requires_ that a buffer
>> entry always be a page).
>
> This is on a TODO, but not very high up since I've yet to see a case
> where the current 16 page limitation is an issue. I'm sure something
> will come up eventually, but until then I'd rather not bother.

DVD burning! splicing those huge VOB files into the dvd device would
be nice. And believe me, the current 16 entries of the pipe is nowhere
enough to sustain burning at 8X avg speed or higher.

It's a special case but it'd benefit a LOT of ppl ;-)

--
Linh Dang

Jens Axboe

unread,

Apr 20, 2006, 3:19:46 PM4/20/06

to Linus Torvalds, Diego Calleja, linux-...@vger.kernel.org

Good point, as you can tell I had other uses in mind for this. I'd
prefer using fcntl for this instead of an ioctl - how about a set of
matching F_SETPIPESZ/F_GETPIPESZ or something in that order? Right now
we can just -EINVAL stub the pipe size setting, but really implement the
pipe size getting.

Other suggestions?

The vmsplice addition itself is pretty slim:

splice.c | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 132 insertions(+), 31 deletions(-)

> (Now, you still need to be very careful with vmsplice() in general, since
> it leaves the data page writable in the source VM and thus allows for all
> kinds of confusion, but the theory here is "give them rope". Rope enough
> to do clever things always ends up being rope enough to hang yourself too.
> Tough.).

Oh definitely :-)

--
Jens Axboe

David S. Miller

unread,

Apr 20, 2006, 3:27:22 PM4/20/06

to ax...@suse.de, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

From: Jens Axboe <ax...@suse.de>
Date: Thu, 20 Apr 2006 16:50:42 +0200

> On Wed, Apr 19 2006, Linus Torvalds wrote:
> > - vmsplice() system call to basically do a "write to the buffer", but
> > using the reference counting and VM traversal to actually fill the
> > buffer. This means that the user needs to be careful not to re-use the
> > user-space buffer it spliced into the kernel-space one (contrast this
> > to "write()", which copies the actual data, and you can thus re-use the
> > buffer immediately after a successful write), but that is often easy to
> > do.
>
> This I already did, it was pretty easy and straight forward. I'll post
> it soonish.

Do we plan to do vmsplice() to sockets? That's interesting, but
requires some serious cooperation from things like TCP so that
the indication of "buffer can be reused now, thanks" is explicit
and indicated as soon as ACK's come back for those parts of the
data stream.

Even UDP would need to wait until the card is done with transmit,
and we have DCCP and SCTP too.

People would want to be able to get event notifications of this,
or do we plan to just block? Blocking could be problematic,
performance wise.

Anyways, I'm just stabbing in the dark. It would be useful, because
there is no real clan way to use sendfile() for zero copy of anonymous
user data, and this vmsplice() thing seems like it could bridge that
gap if we do it right.

Jens Axboe

unread,

Apr 20, 2006, 3:34:37 PM4/20/06

to David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

On Thu, Apr 20 2006, David S. Miller wrote:
> From: Jens Axboe <ax...@suse.de>
> Date: Thu, 20 Apr 2006 16:50:42 +0200
>
> > On Wed, Apr 19 2006, Linus Torvalds wrote:
> > > - vmsplice() system call to basically do a "write to the buffer", but
> > > using the reference counting and VM traversal to actually fill the
> > > buffer. This means that the user needs to be careful not to re-use the
> > > user-space buffer it spliced into the kernel-space one (contrast this
> > > to "write()", which copies the actual data, and you can thus re-use the
> > > buffer immediately after a successful write), but that is often easy to
> > > do.
> >
> > This I already did, it was pretty easy and straight forward. I'll post
> > it soonish.
>
> Do we plan to do vmsplice() to sockets? That's interesting, but
> requires some serious cooperation from things like TCP so that
> the indication of "buffer can be reused now, thanks" is explicit
> and indicated as soon as ACK's come back for those parts of the
> data stream.

vmsplice() really just fills the pipe with the user data, at least that
is how I implemented it. Then you'd use splice to actually splice that
pipe to a socket, for instance.

> Even UDP would need to wait until the card is done with transmit,
> and we have DCCP and SCTP too.
>
> People would want to be able to get event notifications of this,
> or do we plan to just block? Blocking could be problematic,
> performance wise.
>
> Anyways, I'm just stabbing in the dark. It would be useful, because
> there is no real clan way to use sendfile() for zero copy of anonymous
> user data, and this vmsplice() thing seems like it could bridge that
> gap if we do it right.

It should be able to, yes. Seems to me it should just work like regular
splicing, with the difference that you'd have to wait for the reference
count to drop before reusing. One way would be to do as Linus suggests
and make the vmsplice call block or just return -EAGAIN if we are not
ready yet. With that pollable, that should suffice?

--
Jens Axboe

David S. Miller

unread,

Apr 20, 2006, 3:40:24 PM4/20/06

to ax...@suse.de, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

From: Jens Axboe <ax...@suse.de>
Date: Thu, 20 Apr 2006 21:34:31 +0200

> It should be able to, yes. Seems to me it should just work like regular
> splicing, with the difference that you'd have to wait for the reference
> count to drop before reusing. One way would be to do as Linus suggests
> and make the vmsplice call block or just return -EAGAIN if we are not
> ready yet. With that pollable, that should suffice?

Yes.

We really can't block on this, but I guess we could consider allowing
that for really dumb applications.

It does indeed require some smarts in the application to field the
events, but by definition of using this splice stuff there is explicit
knowledge in the application of what's going on.

This is why I'm very hesitant to say "yeah, blocking on the socket is
OK", because to be honest it's not. As long as the socket buffer
limits haven't been reached, we really shouldn't block so the user can
go and do more work and create more transmit data in time to keep the
network pipe full.

Jens Axboe

unread,

Apr 20, 2006, 3:44:27 PM4/20/06

to David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

On Thu, Apr 20 2006, David S. Miller wrote:

> From: Jens Axboe <ax...@suse.de>
> Date: Thu, 20 Apr 2006 21:34:31 +0200
>
> > It should be able to, yes. Seems to me it should just work like regular
> > splicing, with the difference that you'd have to wait for the reference
> > count to drop before reusing. One way would be to do as Linus suggests
> > and make the vmsplice call block or just return -EAGAIN if we are not
> > ready yet. With that pollable, that should suffice?
>
> Yes.
>
> We really can't block on this, but I guess we could consider allowing
> that for really dumb applications.

It's up to the user, any non-dumb app would use SPLICE_F_NONBLOCK and
avoid blocking ofcourse.

> It does indeed require some smarts in the application to field the
> events, but by definition of using this splice stuff there is explicit
> knowledge in the application of what's going on.

Exactly.

> This is why I'm very hesitant to say "yeah, blocking on the socket is
> OK", because to be honest it's not. As long as the socket buffer
> limits haven't been reached, we really shouldn't block so the user can
> go and do more work and create more transmit data in time to keep the
> network pipe full.

I'll post what I have tomorrow, lets take it from there.

--
Jens Axboe

Jens Axboe

unread,

Apr 20, 2006, 3:49:04 PM4/20/06

to Linh Dang, linux-...@vger.kernel.org

On Thu, Apr 20 2006, Linh Dang wrote:
> Jens Axboe <ax...@suse.de> wrote:
>
> > On Wed, Apr 19 2006, Linus Torvalds wrote:
> >> There are some other buffer management system calls that I haven't
> >> done yet (and when I say "I haven't done yet", I obviously mean
> >> "that I hope some other sucker will do for me, since I'm lazy"),
> >> but that are obvious future extensions:
> >
> > Well it's worked so far, hasn't it? :-)
> >
> >> - an ioctl/fcntl to set the maximum size of the buffer. Right now
> >> it's
> >> hardcoded to 16 "buffer entries" (which in turn are normally limited to
> >> one page each, although there's nothing that _requires_ that a buffer
> >> entry always be a page).
> >
> > This is on a TODO, but not very high up since I've yet to see a case
> > where the current 16 page limitation is an issue. I'm sure something
> > will come up eventually, but until then I'd rather not bother.
>
> DVD burning! splicing those huge VOB files into the dvd device would
> be nice. And believe me, the current 16 entries of the pipe is nowhere
> enough to sustain burning at 8X avg speed or higher.
>
> It's a special case but it'd benefit a LOT of ppl ;-)

(don't drop the cc list)

DVD burning probably isn't a good splice fit, since you need to do more
than actually just point the device at the data. SG_IO is already
zero-copy as it maps the user data into the kernel without copying, so
there's very little room for improvement there to begin with.

--
Jens Axboe

bjd

unread,

Apr 20, 2006, 3:53:24 PM4/20/06

to linux-...@vger.kernel.org

Isn't it time this interesting thread got a more suitable
subject?

bjd

Linh Dang

unread,

Apr 20, 2006, 3:59:24 PM4/20/06

to Jens Axboe, linux-...@vger.kernel.org

Jens Axboe <ax...@suse.de> wrote:

DVD burning on linux is mostly:

mkisofs .... | growisofs ....

Ideally, on mkisofs side, we'd be able to:

- write some data/padding into the pipe
- splice a HUGE file into the pipe
- write some data/padding into the pipe
- splice a HUGE file into the pipe
...

On growisofs side, we'd be able to:

- send some commands
- splice N MBs of data from the pipe to the driver
- send some commands
- splice M MBs of data from the pipe to the driver
...

What'd be nice is an ioctl to change the size of the pipe between
mkisofs and growisofs.

--
Linh Dang

Jens Axboe

unread,

Apr 20, 2006, 4:08:41 PM4/20/06

to Linh Dang, linux-...@vger.kernel.org

On the mkisofs side you have a good point, splice/vmsplice could be
really useful there! I was too narrowly thinking of burning already made
iso files which is easier of course. You'd need to invent a new way to
do SG_IO with a pipe buffer, but that's really implementation detail.
The mkisofs part is already doable with the current code, with the size
restriction naturally.

> What'd be nice is an ioctl to change the size of the pipe between
> mkisofs and growisofs.

Yes fully agree. It is something that will be done eventually, but since
it requires redoing pipe_inode_info bufs[] it is a little invasive on
fs/pipe.c. You could even allow it to grow dynamically, lots of
possibilities...

--
Jens Axboe

Piet Delaney

unread,

Apr 20, 2006, 5:37:45 PM4/20/06

to Jens Axboe, Piet Delaney, David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

On Thu, 2006-04-20 at 21:34 +0200, Jens Axboe wrote:

> > Anyways, I'm just stabbing in the dark. It would be useful, because
> > there is no real clan way to use sendfile() for zero copy of anonymous
> > user data, and this vmsplice() thing seems like it could bridge that
> > gap if we do it right.
>
> It should be able to, yes. Seems to me it should just work like regular
> splicing, with the difference that you'd have to wait for the reference
> count to drop before reusing. One way would be to do as Linus suggests
> and make the vmsplice call block or just return -EAGAIN if we are not
> ready yet. With that pollable, that should suffice?

What about marking the pages Read-Only while it's being used by the
kernel and if the user tries to write into them letting the VM dup
the page with the COW code? Often you can use a FILO memory allocator
in user space to minimize the odds of trying to reuse the page while
the kernel is using it.

FreeBSD folks developed a ZERO_COPY_SOCKET facility that uses COW;
code looked great.

-piet

--
---
pi...@bluelane.com

Linus Torvalds

unread,

Apr 20, 2006, 6:21:12 PM4/20/06

to Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, linux-...@vger.kernel.org

On Thu, 20 Apr 2006, Piet Delaney wrote:
>
> What about marking the pages Read-Only while it's being used by the
> kernel

NO!

That's a huge mistake, and anybody that does it that way (FreeBSD) is
totally incompetent.

Once you play games with page tables, you are generally better off copying
the data. The cost of doing page table updates and the associated TLB
invalidates is simply not worth it, both from a performance standpoing and
a complexity standpoint.

Basically, if you want the highest possible performance, you do not want
to do TLB invalidates. And if you _don't_ want the highest possible
performance, you should just use regular write(), which is actually good
enough for most uses, and is portable and easy.

The thing is, the cost of marking things COW is not just the cost of the
initial page table invalidate: it's also the cost of the fault eventually
when you _do_ write to the page, even if at that point you decide that the
page is no longer shared, and the fault can just mark the page writable
again.

That cost is _bigger_ than the cost of just copying the page in the first
place.

The COW approach does generate some really nice benchmark numbers, because
the way you benchmark this thing is that you never actually write to the
user page in the first place, so you end up having a nice benchmark loop
that has to do the TLB invalidate just the _first_ time, and never has to
do any work ever again later on.

But you do have to realize that that is _purely_ a benchmark load. It has
absolutely _zero_ relevance to any real life. Zero. Nada. None. In real
life, COW-faulting overhead is expensive. In real life, TLB invalidates
(with a threaded program, and all users of this had better be threaded, or
they are leaving more performance on the floor) are expensive.

I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
Playing games with VM is bad. memory copies are _also_ bad, but quite
frankly, memory copies often have _less_ downside than VM games, and
bigger caches will only continue to drive that point home.

Linus

Piet Delaney

unread,

Apr 20, 2006, 7:40:06 PM4/20/06

to Linus Torvalds, Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, linux-...@vger.kernel.org

On Thu, 2006-04-20 at 15:20 -0700, Linus Torvalds wrote:
>
> On Thu, 20 Apr 2006, Piet Delaney wrote:
> >
> > What about marking the pages Read-Only while it's being used by the
> > kernel
>
> NO!
>
> That's a huge mistake, and anybody that does it that way (FreeBSD) is
> totally incompetent.

Yea, we're not using it either.

>
> Once you play games with page tables, you are generally better off copying
> the data. The cost of doing page table updates and the associated TLB
> invalidates is simply not worth it, both from a performance standpoing and
> a complexity standpoint.

I once wrote some code to find the PTE entries for user buffers;
and as I recall the code was only about 20 lines of code. I thought
only a small part of the TLB had to be invalidated. I never tested
or profiled it and didn't consider the multi-threading issues.

Instead of COW, I just returned information in recvmsg control
structure indicating that the buffer wasn't being use by the kernel
any longer.

I kept the list of pages involved in the zero copy in a structure
and when the kernel was done with the pages it decremented the page
count via a callback, similar to what yzy <y...@clusterfs.com> discussed
two weeks ago on the linux-net mailing list.

I thought this structure could have pointers to the PTE's and
mmu context to clear the PTE entries. Unfortunately it gets
messy if the zero copy's overlap onto a shared page.

I didn't study the BSD implementation well enough to appreciate
how their COW implementation worked.

>
> Basically, if you want the highest possible performance, you do not want
> to do TLB invalidates. And if you _don't_ want the highest possible
> performance, you should just use regular write(), which is actually good
> enough for most uses, and is portable and easy.

We use a zero copy, and also don't mess with the TLB. In our application
99.99% of the data is looked at but not modified (we are looking through
TCP streams for a security exploitations).

>
> The thing is, the cost of marking things COW is not just the cost of the
> initial page table invalidate: it's also the cost of the fault eventually
> when you _do_ write to the page, even if at that point you decide that the
> page is no longer shared, and the fault can just mark the page writable
> again.

Right, it's difficult for the kernel code to change the involved PTE's
when it's done with a page. Then flushing the TLB's of involved CPU's
adds to the problem.

>
> That cost is _bigger_ than the cost of just copying the page in the first
> place.
>
> The COW approach does generate some really nice benchmark numbers, because
> the way you benchmark this thing is that you never actually write to the
> user page in the first place, so you end up having a nice benchmark loop
> that has to do the TLB invalidate just the _first_ time, and never has to
> do any work ever again later on.
>
> But you do have to realize that that is _purely_ a benchmark load. It has
> absolutely _zero_ relevance to any real life. Zero. Nada. None. In real
> life, COW-faulting overhead is expensive. In real life, TLB invalidates
> (with a threaded program, and all users of this had better be threaded, or
> they are leaving more performance on the floor) are expensive.

Yea, your right, the multi-threading it a real problem,
you would have to send a interrupt with information about which part
of the TLB needs to be invalidated to each CPU.

>
> I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
> Playing games with VM is bad. memory copies are _also_ bad, but quite
> frankly, memory copies often have _less_ downside than VM games, and
> bigger caches will only continue to drive that point home.

Yep, both of the zero copy implementations that I've worked on have
used non-VM techniques to synchronize socket buffer state between the
kernel and user space.

-piet

>
> Linus
--
---
pi...@bluelane.com

Linus Torvalds

unread,

Apr 20, 2006, 8:09:57 PM4/20/06

to Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, linux-...@vger.kernel.org

On Thu, 20 Apr 2006, Piet Delaney wrote:
>
> I once wrote some code to find the PTE entries for user buffers;
> and as I recall the code was only about 20 lines of code. I thought
> only a small part of the TLB had to be invalidated. I never tested
> or profiled it and didn't consider the multi-threading issues.

Looking up the page table entry is fairly quick, and is definitely worth
it. It's usually just a few memory loads, and it may even be cached. So
that part of the "VM tricks" is fine.

The cost comes when you modify it. Part of it is the initial TLB
invalidate cost, but that actually tends to be the smaller part (although
it can be pretty steep already, if you have to do a cross-CPU invalidate:
that alone may already have taken more time than it would to just do a
straightforward copy).

The bigger part tends to be that any COW approach will obviously have to
be undone later, usually when the user writes to the page. Even if (by the
time the fault is taken) the page is no longer shared, and undoing the COW
is just a matter of touching the page tables again, just the cost of
taking the fault is easily thousands of cycles.

At which point the optimization is very debatable indeed. If the COW
actually causes a real copy and a new page to be allocated, you've lost
everything, and you're solidly in "that sucks" territory.

> Instead of COW, I just returned information in recvmsg control
> structure indicating that the buffer wasn't being use by the kernel
> any longer.

That is very close to what I propose with vmsplice(), and yes, once you
avoid the COW, it's a clear win to just look up the page in the page
tables and increment a usage count.

So basically:

- just looking up the page is cheap, and that's what vmsplice() does
(if people want to actually play with it, Jens now has a vmsplice()
implementation in his "splice" branch in his git tree on
brick.kernel.dk).

It does mean that it's up to the _user_ to not write to the page again
until the page is no longer shared, and there are different approaches
to handling that. Sometimes the answer may even be that synchronization
is done at a much higher level (ie there's some much higher-level
protocol where the other end acknowledges the data).

The fact that it's up to the user obviously means that the user has to
be more careful, but the upside is that you really _do_ get very high
performance. If there are no good synchronization mechanisms, the
answer may well be "don't use vmsplice()", but the point is that if you
_can_ synchronize some other way, vmsplice() runs like a bat out of
hell.

- playing VM games where you actually modify the VM is almost always a
loss. It does have the advantage that the user doesn't have to be aware
of the VM games, but if it means that performance isn't really all that
much better than just a regular "write()" call, what's the point?

I'm of the opinion that we already have robust and user-friendly
interfaces (the regular read()/write()/recvmsg/sendmgs() interfaces that
are "synchronous" wrt data copies, and that are obviously portable). We've
even optimized them as much as we can, so they actually perform pretty
well.

So there's no point in a half-assed "safe VM" trick with COW, which isn't
all that much faster. Playing tricks with zero-copy only makes sense if
they are a _lot_ faster, and that implies that you cannot do COW. You
really expose the fact that user-space gave a real reference to its own
pages away, and that if user space writes to it, it writes to a buffer
that is already in flight.

(Some users may even be able to take _advantage_ of the fact that the
buffer is "in flight" _and_ mapped into user space after it has been
submitted. You could imagine code that actually goes on modifying the
buffer even while it's being queued for sending. Under some strange
circumstances that may actually be useful, although with things like
checksums that get invalidated by you changing the data while it's queued
up, it may not be acceptable for everything, of course).

Linus

David S. Miller

unread,

Apr 20, 2006, 8:21:41 PM4/20/06

to pi...@bluelane.com, ax...@suse.de, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

From: Piet Delaney <pi...@bluelane.com>
Date: Thu, 20 Apr 2006 14:37:11 -0700

> What about marking the pages Read-Only while it's being used by the
> kernel and if the user tries to write into them letting the VM dup
> the page with the COW code?

That's historically how you kill performance.

David Lang

unread,

Apr 20, 2006, 8:27:41 PM4/20/06

to Linus Torvalds, Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, linux-...@vger.kernel.org

On Thu, 20 Apr 2006, Linus Torvalds wrote:

> (Some users may even be able to take _advantage_ of the fact that the
> buffer is "in flight" _and_ mapped into user space after it has been
> submitted. You could imagine code that actually goes on modifying the
> buffer even while it's being queued for sending. Under some strange
> circumstances that may actually be useful, although with things like
> checksums that get invalidated by you changing the data while it's queued
> up, it may not be acceptable for everything, of course).

I could see this in some sort of logging/monitoring situation where you
want the latest data you can possibly get at each write. with the
appropriate care in write ordering you could have one thread update the
buffer continuously and the buffer gets written out periodicly, what gets
written is the latest possible info.

definantly not a common case, but I could see it's use in some cases.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

David S. Miller

unread,

Apr 20, 2006, 8:41:58 PM4/20/06

to torv...@osdl.org, pi...@bluelane.com, ax...@suse.de, die...@gmail.com, linux-...@vger.kernel.org

From: Linus Torvalds <torv...@osdl.org>
Date: Thu, 20 Apr 2006 15:20:14 -0700 (PDT)

> I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
> Playing games with VM is bad. memory copies are _also_ bad, but quite
> frankly, memory copies often have _less_ downside than VM games, and
> bigger caches will only continue to drive that point home.

Yep.

And as I've documented several times on this list already, this
and research in that area is very well documented in standard
texts such as Networking Algorithmics by George Varghese,
particularly in Chapter 5 "Copying Data"

TLB games are dumb, just say no.

David S. Miller

unread,

Apr 20, 2006, 8:50:10 PM4/20/06

to dl...@digitalinsight.com, torv...@osdl.org, pi...@bluelane.com, ax...@suse.de, die...@gmail.com, linux-...@vger.kernel.org

From: David Lang <dl...@digitalinsight.com>
Date: Thu, 20 Apr 2006 16:26:54 -0700 (PDT)

> On Thu, 20 Apr 2006, Linus Torvalds wrote:
>
> > (Some users may even be able to take _advantage_ of the fact that the
> > buffer is "in flight" _and_ mapped into user space after it has been
> > submitted. You could imagine code that actually goes on modifying the
> > buffer even while it's being queued for sending. Under some strange
> > circumstances that may actually be useful, although with things like
> > checksums that get invalidated by you changing the data while it's queued
> > up, it may not be acceptable for everything, of course).
>
> I could see this in some sort of logging/monitoring situation where you
> want the latest data you can possibly get at each write. with the
> appropriate care in write ordering you could have one thread update the
> buffer continuously and the buffer gets written out periodicly, what gets
> written is the latest possible info.
>
> definantly not a common case, but I could see it's use in some cases.

And the checksums don't get invalidated, the card computes them
on transmit.

If the card cannot compute them on transmit, we will copy into a
stable kernel buffer, always.

Andi Kleen

unread,

Apr 20, 2006, 10:12:34 PM4/20/06

to pi...@bluelane.com, David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

Piet Delaney <pi...@bluelane.com> writes:
>
> FreeBSD folks developed a ZERO_COPY_SOCKET facility that uses COW;
> code looked great.

Linux had patches many years ago (in 2.3.x), but it was never merged
because it is inherently unscalable on MP. Classical BSD sockets really
don't work well for zero copy - you need a new interface (like POSIX aio)
that allows the kernel/user to tell each other when use of data is
finished and buffers can be reused.

-Andi

Nick Piggin

unread,

Apr 21, 2006, 1:50:11 AM4/21/06

to Linh Dang, Jens Axboe, linux-...@vger.kernel.org

Linh Dang wrote:
> Jens Axboe <ax...@suse.de> wrote:

>>DVD burning probably isn't a good splice fit, since you need to do
>>more than actually just point the device at the data. SG_IO is
>>already zero-copy as it maps the user data into the kernel without
>>copying, so there's very little room for improvement there to begin
>>with.
>
>
> DVD burning on linux is mostly:
>
> mkisofs .... | growisofs ....
>
> Ideally, on mkisofs side, we'd be able to:
>
> - write some data/padding into the pipe
> - splice a HUGE file into the pipe
> - write some data/padding into the pipe
> - splice a HUGE file into the pipe
> ...
>
> On growisofs side, we'd be able to:
>
> - send some commands
> - splice N MBs of data from the pipe to the driver
> - send some commands
> - splice M MBs of data from the pipe to the driver
> ...
>
> What'd be nice is an ioctl to change the size of the pipe between
> mkisofs and growisofs.

I don't see why the pipe buffers would be a problem though. It isn't
like you've lost any of the pagecache buffering (eg. from readahead)
or the application level buffering.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

Nick Piggin

unread,

Apr 21, 2006, 2:05:45 AM4/21/06

to Jens Axboe, David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

Jens Axboe wrote:

> It's up to the user, any non-dumb app would use SPLICE_F_NONBLOCK and
> avoid blocking ofcourse.

BTW. How come you don't just set the pipe's fds non blocking
instead of using that flag? Any reason?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

-

Piet Delaney

unread,

Apr 21, 2006, 2:47:43 AM4/21/06

to Andi Kleen, Piet Delaney, David S. Miller, torv...@osdl.org, die...@gmail.com, linux-...@vger.kernel.org

On Fri, 2006-04-21 at 04:05 +0200, Andi Kleen wrote:
> Piet Delaney <pi...@bluelane.com> writes:
> >
> > FreeBSD folks developed a ZERO_COPY_SOCKET facility that uses COW;
> > code looked great.
>
> Linux had patches many years ago (in 2.3.x), but it was never merged
> because it is inherently unscalable on MP. Classical BSD sockets really
> don't work well for zero copy - you need a new interface (like POSIX aio)
> that allows the kernel/user to tell each other when use of data is
> finished and buffers can be reused.

Right, back when I was working on zero copy for 2.4 I noticed that
2.6 seemed to support aio in the socket code, passing the kiocb
pointer as I recall, and support in the socket code for for sendpage
seemed enhanced. I was also wondering about using 2.6 and aio for zero
copy instead of tokens via sendmsg() and recvmsg() cmsghdr structures.

-piet

>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
---
pi...@bluelane.com

Jens Axboe

unread,

Apr 21, 2006, 3:53:55 AM4/21/06

to Nick Piggin, Linh Dang, linux-...@vger.kernel.org

Yes, hence the reason that a larger pipe / dynamic pipe wasn't even
attempted yet. In the tests I did, manually increasing the pipe size
yielded no noticable benefits.

Conceptually it might be simpler for the mkisofs side to accept larger
in-kernel pipes, but on the performance side I doubt it would matter a
lot. The growisofs side sending data to the drive is not limited by the
64k pipe, typically the commands will be smaller than that anyways. So
"splice N MBs of data from the pipe to the driver" is a nice dream, but
that's not how you talk to the device anyways.

--
Jens Axboe

Alistair John Strachan

unread,

Apr 21, 2006, 6:26:35 AM4/21/06

to Linus Torvalds, Linux Kernel Mailing List

On Wednesday 19 April 2006 04:27, Linus Torvalds wrote:
> Instead of the normal one-week release schedule, there was now two weeks
> between 2.6.17-rc1 and -rc2, partly because I was travelling for one of
> those weeks, but partly because it was really quiet for a while. Likely a
> lot of people are concentrating on 2.6.16 and vendor releases.
>
> It picked up a bit in the last few days (it's also possible that the US
> people were all just stressed out over tax season ;), and I cut a
> 2.6.17-rc2. I expect to be back to the weekly schedule now, even if it is
> quiet (which I hope it will be).
>
> Not a lot of hugely interesting stuff, with a large portion of the diff
> being a late MIPS update (tssk tssk), and the huge diff from the
> long over-due removal of the Sangoma wan drivers that have been marked
> BROKEN for a long time. Same goes for the qlogicfc driver (which has been
> supplanted by the qla2xxx driver).
>
> As a result, the diff has just tons of deletions, even if most of the rest
> of the changes aren't all that big. But there are netfilter fixes, some
> more splice work, and just tons of random stuff: usb, scsi, knfsd, fuse,
> infiniband..

Something in here (or -rc1, I didn't test that) broke WINE. x86-64 kernel,
32bit WINE, works fine on 2.6.16.7. I'll check whether -rc1 had the same
problem and work backwards, but just in case somebody has an idea..

[alistair] 11:17 [~/.wine/drive_c/Program Files/Warcraft III] wine
war3.exe -opengl
wine: Unhandled page fault on write access to 0x00495000 at address 0x495000
(thread 0009), starting debugger...
WineDbg starting on pid 0x8
Unhandled exception: page fault on write access to 0x00495000 in 32-bit code
(0x00495000).
Register dump:
CS:0023 SS:002b DS:002b ES:002b FS:006b GS:0063
EIP:00495000 ESP:7f9eff0c EBP:7f9effe8 EFLAGS:00010246( - 00 -RIZP1)
EAX:00000000 EBX:7fcb4710 ECX:00400000 EDX:00000000
ESI:7ffdf3a0 EDI:00495000
Stack dump:
0x7f9eff0c: 7fc794de 7ffdf3a0 00000000 00000000
0x7f9eff1c: 00000000 ffffffff 7fc35ff8 7fc4caf0
0x7f9eff2c: 7fcb4710 00400000 7fcaf784 7f9effe8
0x7f9eff3c: 16d2f22f 168b9967 00000001 00000000
0x7f9eff4c: 00000000 00000000 00000000 00000000
0x7f9eff5c: 00000000 00000000 00000000 00000000
Backtrace:
=>1 0x00495000 EntryPoint in war3 (0x00495000)
2 0xf7f763ab wine_switch_to_stack+0x17 in libwine.so.1 (0xf7f763ab)
0x00495000 EntryPoint in war3: pushl %eax

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Herbert Poetzl

unread,

Apr 21, 2006, 7:02:21 AM4/21/06

to Linus Torvalds, Linux Kernel Mailing List, Chandra Seetharaman, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, ak...@osdl.org

On Tue, Apr 18, 2006 at 08:27:37PM -0700, Linus Torvalds wrote:
>
> Instead of the normal one-week release schedule, there was now two weeks
> between 2.6.17-rc1 and -rc2, partly because I was travelling for one of
> those weeks, but partly because it was really quiet for a while. Likely a
> lot of people are concentrating on 2.6.16 and vendor releases.

[rest zapped]

here is the 'updated' bug report on the xfs issue which
seems to have been introduced with 2.6.17-rc1

note: 2.6.16.8 does not have this issue

best,
Herbert

Linux (none) 2.6.17-rc2 #1 SMP Fri Apr 21 11:52:19 CEST 2006 i686 unknown

/ # mkfs.xfs -f /dev/hdc1
meta-data=/dev/hdc1 isize=256 agcount=8, agsize=8189 blks
data = bsize=4096 blocks=65512, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=0
naming =version 2 bsize=4096
log =internal log bsize=4096 blocks=1200
realtime =none extsz=65536 blocks=0, rtextents=0
/ # mount /dev/hdc1 /mnt/

[ 64.289157] BUG: unable to handle kernel paging request at virtual address c056a680
[ 64.290085] printing eip:
[ 64.290402] c0129290
[ 64.290686] *pde = 005bd027
[ 64.291037] *pte = 0056a000
[ 64.291504] Oops: 0000 [#1]
[ 64.291823] SMP DEBUG_PAGEALLOC
[ 64.292820] Modules linked in:
[ 64.293453] CPU: 0
[ 64.293485] EIP: 0060:[<c0129290>] Not tainted VLI
[ 64.293529] EFLAGS: 00000286 (2.6.17-rc2 #1)
[ 64.295055] EIP is at notifier_chain_register+0x20/0x50
[ 64.295648] eax: c056a678 ebx: cf5e23f8 ecx: 00000000 edx: c04bea9c
[ 64.296362] esi: cf5e23f8 edi: cffc5000 ebp: cf5e2800 esp: cffdad5c
[ 64.297140] ds: 007b es: 007b ss: 0068
[ 64.297613] Process mount (pid: 34, threadinfo=cffda000 task=cff7e570)
[ 64.298258] Stack: <0>c04bea80 c0129454 c04bea9c cf5e23f8 cf5e2000 cf5e2000 c01367f7 c04bea80
[ 64.299558] cf5e23f8 c02d4b26 cf5e23f8 00000404 cf5e2000 cfd1f520 cffc5000 c02d1f53
[ 64.300700] cf5e2000 00000001 c02e65ef 00000424 00000001 cffc5000 cfd1f520 c02f2880
[ 64.301841] Call Trace:
[ 64.302278] <c0129454> blocking_notifier_chain_register+0x54/0x90 <c01367f7> register_cpu_notifier+0x17/0x20
[ 64.303684] <c02d4b26> xfs_icsb_init_counters+0x46/0xb0 <c02d1f53> xfs_mount_init+0x23/0x160
[ 64.304844] <c02e65ef> kmem_zalloc+0x1f/0x50 <c02f2880> bhv_insert_all_vfsops+0x10/0x50
[ 64.305940] <c02f1f65> xfs_fs_fill_super+0x35/0x1f0 <c0313e97> snprintf+0x27/0x30
[ 64.307124] <c01a24f4> disk_name+0x64/0xc0 <c0168f1f> sb_set_blocksize+0x1f/0x50
[ 64.308140] <c0168869> get_sb_bdev+0x109/0x160 <c02f2150> xfs_fs_get_sb+0x30/0x40
[ 64.309129] <c02f1f30> xfs_fs_fill_super+0x0/0x1f0 <c0168b10> do_kern_mount+0xa0/0x160
[ 64.310156] <c0181187> do_new_mount+0x77/0xc0 <c018184f> do_mount+0x1bf/0x230
[ 64.311177] <c03f4a68> iret_exc+0x3d4/0x6ab <c0181633> copy_mount_options+0x63/0xc0
[ 64.312246] <c03f427f> lock_kernel+0x2f/0x50 <c0181c5f> sys_mount+0x9f/0xe0
[ 64.313237] <c0102b27> syscall_call+0x7/0xb
[ 64.313917] Code: 90 90 90 90 90 90 90 90 90 90 90 53 8b 54 24 08 8b 5c 24 0c 8b 02 85 c0 74 31 8b 4b 08 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 <3b> 48 08 7f 1b 8d 50 04 8b 40 04 85 c0 75 f1 31 c0 eb 0d 90 90
[ 64.318371] EIP: [<c0129290>] notifier_chain_register+0x20/0x50 SS:ESP 0068:cffdad5c

Linux (none) 2.6.16.8 #1 SMP Fri Apr 21 12:45:31 CEST 2006 i686 unknown
/ # mkfs.xfs -f /dev/hdc1
meta-data=/dev/hdc1 isize=256 agcount=8, agsize=8189 blks
data = bsize=4096 blocks=65512, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=0
naming =version 2 bsize=4096
log =internal log bsize=4096 blocks=1200
realtime =none extsz=65536 blocks=0, rtextents=0
/ # mount /dev/hdc1 /mnt/
[ 24.627530] XFS mounting filesystem hdc1

Linus Torvalds

unread,

Apr 21, 2006, 12:41:25 PM4/21/06

to Alistair John Strachan, Andi Kleen, Linux Kernel Mailing List

On Fri, 21 Apr 2006, Alistair John Strachan wrote:
>
> Something in here (or -rc1, I didn't test that) broke WINE. x86-64 kernel,
> 32bit WINE, works fine on 2.6.16.7. I'll check whether -rc1 had the same
> problem and work backwards, but just in case somebody has an idea..

Nothing strikes me, but maybe Andi has a clue.

> [alistair] 11:17 [~/.wine/drive_c/Program Files/Warcraft III] wine
> war3.exe -opengl
> wine: Unhandled page fault on write access to 0x00495000 at address 0x495000

..

> Unhandled exception: page fault on write access to 0x00495000 in 32-bit code

That looks bogus. %eip is 0x00495000, and might well have taken a fault,
but it sure ain't a write access. According to the built-in wine debugger
it was

> 0x00495000 EntryPoint in war3: pushl %eax

which does do a write, but to %esp (which is 7f9eff0c according to the
dump, and which is unlikely to have taken a fault, since it's almost 256
bytes off the end of a page in the stack area).

Alistair, if you can do a "git bisect" on this one, that would help.
Unless Andi goes "Duh!".

Linus

Stephen Rothwell

unread,

Apr 21, 2006, 1:21:58 PM4/21/06

to Linus Torvalds, s034...@sms.ed.ac.uk, a...@suse.de, linux-...@vger.kernel.org

On Fri, 21 Apr 2006 09:40:26 -0700 (PDT) Linus Torvalds <torv...@osdl.org> wrote:
>
> On Fri, 21 Apr 2006, Alistair John Strachan wrote:
> >
> > Something in here (or -rc1, I didn't test that) broke WINE. x86-64 kernel,
> > 32bit WINE, works fine on 2.6.16.7. I'll check whether -rc1 had the same
> > problem and work backwards, but just in case somebody has an idea..
>
> Nothing strikes me, but maybe Andi has a clue.

Also (and this is probably already known) using a 2G/2G split on i386
kills wine. At least when attempting to run Lotus Notes under wine, wine
gets a signal 9. The normal 3G/1G split works fine.

--
Cheers,
Stephen Rothwell s...@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

Linus Torvalds

unread,

Apr 21, 2006, 1:59:48 PM4/21/06

to Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, Linux Kernel Mailing List

I got slashdotted! Yay!

On Thu, 20 Apr 2006, Linus Torvalds wrote:
>
> I claim that Mach people (and apparently FreeBSD) are incompetent idiots.

I also claim that Slashdot people usually are smelly and eat their
boogers, and have an IQ slightly lower than my daughters pet hamster
(that's "hamster" without a "p", btw, for any slashdot posters out
there. Try to follow me, ok?).

Furthermore, I claim that anybody that hasn't noticed by now that I'm an
opinionated bastard, and that "impolite" is my middle name, is lacking a
few clues.

Finally, it's clear that I'm not only the smartest person around, I'm also
incredibly good-looking, and that my infallible charm is also second only
to my becoming modesty.

So there. Just to clarify.

Linus "bow down before me, you scum" Torvalds

Steven Rostedt

unread,

Apr 21, 2006, 2:16:00 PM4/21/06

to Linus Torvalds, Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, Linux Kernel Mailing List

On Fri, 2006-04-21 at 10:58 -0700, Linus Torvalds wrote:
> I got slashdotted! Yay!
>
> On Thu, 20 Apr 2006, Linus Torvalds wrote:
> >
> > I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
>
> I also claim that Slashdot people usually are smelly and eat their
> boogers, and have an IQ slightly lower than my daughters pet hamster
> (that's "hamster" without a "p", btw, for any slashdot posters out
> there. Try to follow me, ok?).
>
> Furthermore, I claim that anybody that hasn't noticed by now that I'm an
> opinionated bastard, and that "impolite" is my middle name, is lacking a
> few clues.
>
> Finally, it's clear that I'm not only the smartest person around, I'm also
> incredibly good-looking, and that my infallible charm is also second only
> to my becoming modesty.

I'll vouch for your handsomeness, as in this picture, I'm trying to be
your twin: http://www.kihontech.com/pics/torvalds.jpg Ha! I bet you
never knew we've met.

-- Steve

Steven Rostedt

unread,

Apr 21, 2006, 2:43:05 PM4/21/06

to Linus Torvalds, Piet Delaney, Jens Axboe, David S. Miller, die...@gmail.com, Linux Kernel Mailing List

On Fri, 2006-04-21 at 14:15 -0400, Steven Rostedt wrote:
> On Fri, 2006-04-21 at 10:58 -0700, Linus Torvalds wrote:
> > I got slashdotted! Yay!
> >
> > On Thu, 20 Apr 2006, Linus Torvalds wrote:
> > >
> > > I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
> >
> > I also claim that Slashdot people usually are smelly and eat their
> > boogers, and have an IQ slightly lower than my daughters pet hamster
> > (that's "hamster" without a "p", btw, for any slashdot posters out
> > there. Try to follow me, ok?).
> >
> > Furthermore, I claim that anybody that hasn't noticed by now that I'm an
> > opinionated bastard, and that "impolite" is my middle name, is lacking a
> > few clues.
> >
> > Finally, it's clear that I'm not only the smartest person around, I'm also
> > incredibly good-looking, and that my infallible charm is also second only
> > to my becoming modesty.
>
> I'll vouch for your handsomeness, as in this picture, I'm trying to be
> your twin: http://www.kihontech.com/pics/torvalds.jpg Ha! I bet you
> never knew we've met.
>

Oh, and BTW, Linus, I never apologized for that shot. It was probably
the reason you don't walk around the Expo's at LinuxWorld. I was the
first person to recognize you at the first LinuxWorldExpo in
San Fransisco in 1998 and walked up to ask for the photograph. After the
shot was taken, it attracted the attention of others in the area, and
then within seconds you were mobbed. So I now apologize for causing you
that headache ;-)

And about that shot... Unfortunately(maybe?) the guy taking the picture
hit the button too quickly and neither of us were ready for it. You
then looked at me and said "Well that should be an interesting photo"
and proceeded to walk away (just before being mobbed).

Thanks,

-- Steve

Chandra Seetharaman

unread,

Apr 21, 2006, 5:33:20 PM4/21/06

to Herbert Poetzl, Linus Torvalds, Linux Kernel Mailing List, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, ak...@osdl.org, Alan Stern

Hi Herbert,

I am not able to reproduce the problem you are seeing. Need some help
from you in reproducing it.

Do you have any unique hardware/driver/kernel component ?

Did you try without QEMU (to see if it isolats the problem) ?

regards,

chandra

--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Andi Kleen

unread,

Apr 21, 2006, 6:02:48 PM4/21/06

to Linus Torvalds, Alistair John Strachan, Linux Kernel Mailing List

On Friday 21 April 2006 18:40, Linus Torvalds wrote:
> On Fri, 21 Apr 2006, Alistair John Strachan wrote:
> > Something in here (or -rc1, I didn't test that) broke WINE. x86-64
> > kernel, 32bit WINE, works fine on 2.6.16.7. I'll check whether -rc1 had
> > the same problem and work backwards, but just in case somebody has an
> > idea..
>
> Nothing strikes me, but maybe Andi has a clue.

NX for 32bit programs is enabled by default now. Does it
work with noexec32=off?

If it's that then it won't work with PAE kernels on i386 and NX
capable machines neither - i just changed the default to be
the same as 32bit, but unlike 32bit all x86-64 kernels use PAE
and many of the systems have NX.

If it's not that don't know what it could be. I actually even used a simple
wine program with a post rc2 kernel and it worked for me.

So it isn't anything fundamental. Maybe some bad interaction
with copy protection again, but I don't remember changing ptrace
at all this time.

> Alistair, if you can do a "git bisect" on this one, that would help.

If noexec32=off doesn't help please do.
If noexec32 helps then it's likely a wine bug for using the wrong
protections.

-Andi

Alistair John Strachan

unread,

Apr 21, 2006, 8:54:05 PM4/21/06

to Andi Kleen, Linus Torvalds, Linux Kernel Mailing List

On Friday 21 April 2006 23:02, Andi Kleen wrote:
> On Friday 21 April 2006 18:40, Linus Torvalds wrote:
> > On Fri, 21 Apr 2006, Alistair John Strachan wrote:
> > > Something in here (or -rc1, I didn't test that) broke WINE. x86-64
> > > kernel, 32bit WINE, works fine on 2.6.16.7. I'll check whether -rc1 had
> > > the same problem and work backwards, but just in case somebody has an
> > > idea..
> >
> > Nothing strikes me, but maybe Andi has a clue.
>
> NX for 32bit programs is enabled by default now. Does it
> work with noexec32=off?
>
> If it's that then it won't work with PAE kernels on i386 and NX
> capable machines neither - i just changed the default to be
> the same as 32bit, but unlike 32bit all x86-64 kernels use PAE
> and many of the systems have NX.
>
> If it's not that don't know what it could be. I actually even used a
> simple wine program with a post rc2 kernel and it worked for me.
>
> So it isn't anything fundamental. Maybe some bad interaction
> with copy protection again, but I don't remember changing ptrace
> at all this time.
>
> > Alistair, if you can do a "git bisect" on this one, that would help.
>
> If noexec32=off doesn't help please do.
> If noexec32 helps then it's likely a wine bug for using the wrong
> protections.

[alistair] 01:52 [~] uname -rm
2.6.17-rc2 x86_64

[alistair] 01:52 [~] cat /proc/cmdline
vga=794 root=/dev/sda1 quiet noexec32=off

[alistair] 01:51 [~/.wine/drive_c/Program Files/Warcraft III] wine
war3.exe -opengl
err:ole:CoCreateInstance apartment not initialised
fixme:advapi:SetSecurityInfo stub

Aaand wine suddenly starts working again. Looks like a bug in WINE; is there
any additional information required before I can file a bug report on this
one? Thanks.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Herbert Poetzl

unread,

Apr 21, 2006, 8:59:31 PM4/21/06

to Chandra Seetharaman, Linus Torvalds, Linux Kernel Mailing List, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, ak...@osdl.org, Alan Stern

On Fri, Apr 21, 2006 at 02:31:37PM -0700, Chandra Seetharaman wrote:
> Hi Herbert,
>
> I am not able to reproduce the problem you are seeing.
> Need some help from you in reproducing it.

okay, no problem ...

> Do you have any unique hardware/driver/kernel component ?

hmm, you got my config, so that's it from the
kernel side (no modules, no special settings)

> Did you try without QEMU (to see if it isolats the problem) ?

nope, have no test system available to test it atm
but I'm using QEMU for a long time now to do kernel
debugging and it would really surprise me if that
was a QEMU bug (it's pretty stable on x86 nowadays,
but you never know)

here is how to reproduce it (with QEMU)

- get qemu 0.8 (or later)
- get TEST_32M_public2.img.bz2, TEST_256M_empty.img.bz2
- get the kernel QEMU_2.6.17-rc2.config

- build the kernel on x86 for x86 (gcc 3.3.5 here)
- unpack the disk images (bunzip2)
- run qemu like this:

qemu -nographic -m 256 -snapshot -hda TEST_32M_public2.img -hdc TEST_256M_empty.img -kernel /path/to/your/linux-2.6.17-rc2/arch/i386/boot/bzImage -append "rw root=/dev/hda1"

(on non x86 use qemu-system-i386 instead)

- then, on the prompt enter:

# mkfs.xfs -f /dev/hdc1
# mount /dev/hdc1 /mnt/

- *bang*

you can get all the beforementioned stuff including
a prebuilt kernel image from here:

http://vserver.13thfloor.at/Stuff/MAINLINE/linux-2.6.17_xfs/

HTH,
Herbert

Andi Kleen

unread,

Apr 21, 2006, 9:12:25 PM4/21/06

to Alistair John Strachan, Linus Torvalds, Linux Kernel Mailing List, meis...@suse.de

On Saturday 22 April 2006 02:53, Alistair John Strachan wrote:

> > > Alistair, if you can do a "git bisect" on this one, that would help.
> >
> > If noexec32=off doesn't help please do.
> > If noexec32 helps then it's likely a wine bug for using the wrong
> > protections.
>
> [alistair] 01:52 [~] uname -rm
> 2.6.17-rc2 x86_64
>
> [alistair] 01:52 [~] cat /proc/cmdline
> vga=794 root=/dev/sda1 quiet noexec32=off
>
> [alistair] 01:51 [~/.wine/drive_c/Program Files/Warcraft III] wine
> war3.exe -opengl
> err:ole:CoCreateInstance apartment not initialised
> fixme:advapi:SetSecurityInfo stub
>
> Aaand wine suddenly starts working again.

Ok. There is a way to change this at runtime for individual
processes too (using personality), but most distros seem
to miss the user tools for that so far.

> Looks like a bug in WINE; is there
> any additional information required before I can file a bug report on this
> one? Thanks.

They probably forget to set PROT_EXEC in either mprotect or mmap somewhere.
You can check in /proc/*/maps which mapping contains the address it is faulting
on and then try to find where it is allocated or mprotect'ed.

-Andi

Keith Owens

unread,

Apr 22, 2006, 3:16:53 PM4/22/06

to Herbert Poetzl, Linus Torvalds, Linux Kernel Mailing List, Chandra Seetharaman, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, ak...@osdl.org

Apply this debugging patch. It uses KERN_EMERG instead of KERN_DEBUG
to ensure that the messages appear on the console. Capture the boot
log, extract all the notify register and unregister messages. Convert
the addresses of *nl and n to symbols[*] and mail the result to lkml.

[*] A quick way of converting text with possible addresses to symbols is
ksymoops -m System.map -A `cat log.extract` < /dev/null
ksymoops -A extracts anything that might be an address, looks it up
in the system map and prints the corresponding symbol.

Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c 2006-04-19 17:33:07.000000000 +1000
+++ linux/kernel/sys.c 2006-04-22 16:28:19.593794509 +1000
@@ -105,6 +105,8 @@ static BLOCKING_NOTIFIER_HEAD(reboot_not
static int notifier_chain_register(struct notifier_block **nl,
struct notifier_block *n)
{
+ printk(KERN_EMERG "%s start *nl=%p n=%p\n",
+ __FUNCTION__, *nl, n);
while ((*nl) != NULL) {
if (n->priority > (*nl)->priority)
break;
@@ -112,19 +114,27 @@ static int notifier_chain_register(struc
}
n->next = *nl;
rcu_assign_pointer(*nl, n);
+ printk(KERN_EMERG "%s return *nl=%p n=%p\n",
+ __FUNCTION__, *nl, n);
return 0;
}

static int notifier_chain_unregister(struct notifier_block **nl,
struct notifier_block *n)
{
+ printk(KERN_EMERG "%s start *nl=%p n=%p\n",
+ __FUNCTION__, *nl, n);
while ((*nl) != NULL) {
if ((*nl) == n) {
rcu_assign_pointer(*nl, n->next);
+ printk(KERN_EMERG "%s return 1 *nl=%p n=%p\n",
+ __FUNCTION__, *nl, n);
return 0;
}
nl = &((*nl)->next);
}
+ printk(KERN_EMERG "%s return 2 *nl=%p n=%p\n",
+ __FUNCTION__, *nl, n);
return -ENOENT;

Troy Benjegerdes

unread,

Apr 22, 2006, 3:32:47 PM4/22/06

to Piet Delaney, Linus Torvalds, Jens Axboe, David S. Miller, die...@gmail.com, linux-...@vger.kernel.org

<snip>

Do any of these zero-copy implementations allow you to maintain the old
fashioned read/write semantics? I'm quite sure I can find a real-world
computational chemistry application that wants plain old read/write
for network and filesystem access (and network filesystem access), that
is going to hand you a write() of 256M that is never going to fit into
any CPU cache, and you will always be better off playing COW games
rather than saturating the memory bus with memory copies.. particularly
because the application already saturated the memory bus of the system
doing computations on 16GB of data.

I would like to see some real numbers of *when*, not if a COW scheme
actually starts to be worth the page table flushing. This is a
trade-off, not some absolute 'COW is always bad'.

Lets also not forget that sometimes what an application cares about is
latency, not absolute performance.. so taking a *possible*, but not
certain future hit on TLB invalidates and reloads can be a win if you
get back to executing user code sooner and don't have to synchronize
with the kernel to find out you can use a page again.

Alistair John Strachan

unread,

Apr 22, 2006, 3:52:14 PM4/22/06

to Andi Kleen, Linus Torvalds, Linux Kernel Mailing List, meis...@suse.de

On Saturday 22 April 2006 02:07, Andi Kleen wrote:
[snip]

> They probably forget to set PROT_EXEC in either mprotect or mmap somewhere.
> You can check in /proc/*/maps which mapping contains the address it is
> faulting on and then try to find where it is allocated or mprotect'ed.

Turned out this was exactly what the problem was. Wine attempts to match
Windows as far as read/write/execute mappings go, and war3.exe tried to
execute memory in a section with "MEM_EXECUTE" not set.

I'm surprised the program works on Windows with DEP/NX enabled, but apparently
it does. There's a patch floating around on the Wine mailing list which adds
a workaround for this problem:

http://www.winehq.org/pipermail/wine-devel/2006-April/046935.html

Many thanks to Marcus Meissner for debugging it.

--
Cheers,
Alistair.

Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Chandra Seetharaman

unread,

Apr 24, 2006, 5:27:15 PM4/24/06

to Herbert Poetzl, Linus Torvalds, Linux Kernel Mailing List, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, ak...@osdl.org, Alan Stern

On Sat, 2006-04-22 at 02:58 +0200, Herbert Poetzl wrote:

Herbert,

Thanks for the steps. With that i was able to reproduce the problem and
i found the bug.

While i go ahead and generate the patch, i wanted to hear if my
conclusion is correct.

The problem is due to the fact that most notifier registrations
incorrectly use __devinitdata to define the callback structure, as in:

static struct notifier_block __devinitdata hrtimers_nb = {
.notifier_call = hrtimer_cpu_notify,
};

devinitdata'd data is not _expected to be available_ after the
initialization(unless CONFIG_HOTPLUG is defined).

I do not know how it was working until now :), anybody has a theory that
can explain it (or my conclusion is wrong) ?

Thanks,

chandra

Andrew Morton

unread,

Apr 24, 2006, 6:01:32 PM4/24/06

to sekh...@us.ibm.com, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, st...@rowland.harvard.edu

Chandra Seetharaman <sekh...@us.ibm.com> wrote:
>
> Thanks for the steps. With that i was able to reproduce the problem and
> i found the bug.
>
> While i go ahead and generate the patch, i wanted to hear if my
> conclusion is correct.
>
> The problem is due to the fact that most notifier registrations
> incorrectly use __devinitdata to define the callback structure, as in:
>
> static struct notifier_block __devinitdata hrtimers_nb = {
> .notifier_call = hrtimer_cpu_notify,
> };
>
> devinitdata'd data is not _expected to be available_ after the
> initialization(unless CONFIG_HOTPLUG is defined).
>
> I do not know how it was working until now :), anybody has a theory that
> can explain it (or my conclusion is wrong) ?

That sounds right. There are several __devinitdata notifier_blocks in the
tree - please be sure to check them all.

btw, it'd be pretty trivial to add runtime checking for this sort of thing:

int addr_in_init_section(void *addr)
{
return addr >= __init_begin && addr < __init_end;
}

(x86-specific)
(need to add __init_end to vmlinux.lds.S)

then we could use that to check various things in various places...

Chandra Seetharaman

unread,

Apr 24, 2006, 7:02:25 PM4/24/06

to Andrew Morton, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, Alan Stern

On Mon, 2006-04-24 at 15:03 -0700, Andrew Morton wrote:
> Chandra Seetharaman <sekh...@us.ibm.com> wrote:
> >
> > Thanks for the steps. With that i was able to reproduce the problem and
> > i found the bug.
> >
> > While i go ahead and generate the patch, i wanted to hear if my
> > conclusion is correct.
> >
> > The problem is due to the fact that most notifier registrations
> > incorrectly use __devinitdata to define the callback structure, as in:
> >
> > static struct notifier_block __devinitdata hrtimers_nb = {
> > .notifier_call = hrtimer_cpu_notify,
> > };
> >
> > devinitdata'd data is not _expected to be available_ after the
> > initialization(unless CONFIG_HOTPLUG is defined).
> >
> > I do not know how it was working until now :), anybody has a theory that
> > can explain it (or my conclusion is wrong) ?
>
> That sounds right. There are several __devinitdata notifier_blocks in the
> tree - please be sure to check them all.

Yes, I am covering all notifier blocks.

Another issue... many of the notifier callback functions are marked as
init calls (__cpuinit, __devinit etc.,) as in:

static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)

I am generating a separate patch to take care of those too.

>
> btw, it'd be pretty trivial to add runtime checking for this sort of thing:
>
> int addr_in_init_section(void *addr)
> {
> return addr >= __init_begin && addr < __init_end;
> }

I will add this to kernel/sys.c, and put a BUG_ON to check for both the
notifier block and the callback function.

BTW, which header file you want me to export this through ?

>
> (x86-specific)
> (need to add __init_end to vmlinux.lds.S)

I see __init_end in arch/i386/kernel/vmlinux.lds.S.

>
> then we could use that to check various things in various places...

--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Andrew Morton

unread,

Apr 24, 2006, 7:26:23 PM4/24/06

to sekh...@us.ibm.com, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, st...@rowland.harvard.edu, Ashok Raj

Chandra Seetharaman <sekh...@us.ibm.com> wrote:
>
> On Mon, 2006-04-24 at 15:03 -0700, Andrew Morton wrote:
> > Chandra Seetharaman <sekh...@us.ibm.com> wrote:
> > >
> > > Thanks for the steps. With that i was able to reproduce the problem and
> > > i found the bug.
> > >
> > > While i go ahead and generate the patch, i wanted to hear if my
> > > conclusion is correct.
> > >
> > > The problem is due to the fact that most notifier registrations
> > > incorrectly use __devinitdata to define the callback structure, as in:
> > >
> > > static struct notifier_block __devinitdata hrtimers_nb = {
> > > .notifier_call = hrtimer_cpu_notify,
> > > };
> > >
> > > devinitdata'd data is not _expected to be available_ after the
> > > initialization(unless CONFIG_HOTPLUG is defined).
> > >
> > > I do not know how it was working until now :), anybody has a theory that
> > > can explain it (or my conclusion is wrong) ?
> >
> > That sounds right. There are several __devinitdata notifier_blocks in the
> > tree - please be sure to check them all.
>
> Yes, I am covering all notifier blocks.
>
> Another issue... many of the notifier callback functions are marked as
> init calls (__cpuinit, __devinit etc.,) as in:
>
> static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
> unsigned long action,
> void *hcpu)

hm. This needs some care and thought. We _should_ be oopsing all over the
place because of this. So why aren't we?

iirc, the cpu notifier chain is never used after bootup if
!CONFIG_HOTPLUG_CPU, so there's a good chance that we have things on that
list which have been unloaded, but which never get accessed.

It could be similar with the __devinit things - they're on the list,
they're unloaded, but nothing ever happens in a !CONFIG_HOTPLUG kernel to
cause them to be dereferenced.

Really, these notifier chains just shouldn't exist at all if they're not
going to be used. We're a bit sloppy here. Ashok and I spent some time
working on making lots of code and data structures go away if
!CONFIG_HOTPLUG_CPU, but it's a bit tricky due to the way we do SMP
bringup.

I guess for now, bringing those things into .text and .data when there's
doubt is a reasonable thing to do.

> I am generating a separate patch to take care of those too.
> >
> > btw, it'd be pretty trivial to add runtime checking for this sort of thing:
> >
> > int addr_in_init_section(void *addr)
> > {
> > return addr >= __init_begin && addr < __init_end;
> > }
>
> I will add this to kernel/sys.c, and put a BUG_ON to check for both the
> notifier block and the callback function.

It's x86-only I think. If all architectures use the same symbols then I
guess we could do an arch-neutral version, but one should check.

If it won't work on all architectures then kernel/sys.c isn't the right
place for it.

Maybe it's not so useful. If we're actually accessing these things then
someone should report oopses. So this debugging infrastructure will only
detect things which a) are in __init, b) shouldn't be in __init and c) are
never actually accessed.

So I'd be inclined to not bother about this for now.

Chandra Seetharaman

unread,

Apr 24, 2006, 8:20:42 PM4/24/06

to Andrew Morton, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, Alan Stern, Ashok Raj, mi...@gnu.org

On Mon, 2006-04-24 at 16:28 -0700, Andrew Morton wrote:
> Chandra Seetharaman <sekh...@us.ibm.com> wrote:

<snip>

> > Another issue... many of the notifier callback functions are marked as
> > init calls (__cpuinit, __devinit etc.,) as in:
> >
> > static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
> > unsigned long action,
> > void *hcpu)
>
> hm. This needs some care and thought. We _should_ be oopsing all over the
> place because of this. So why aren't we?

for that matter we should have been oopsing w.r.t __initdata also,
right ?

>
> iirc, the cpu notifier chain is never used after bootup if
> !CONFIG_HOTPLUG_CPU, so there's a good chance that we have things on that
> list which have been unloaded, but which never get accessed.
>
> It could be similar with the __devinit things - they're on the list,
> they're unloaded, but nothing ever happens in a !CONFIG_HOTPLUG kernel to
> cause them to be dereferenced.
>
> Really, these notifier chains just shouldn't exist at all if they're not
> going to be used. We're a bit sloppy here. Ashok and I spent some time
> working on making lots of code and data structures go away if
> !CONFIG_HOTPLUG_CPU, but it's a bit tricky due to the way we do SMP
> bringup.
>
> I guess for now, bringing those things into .text and .data when there's
> doubt is a reasonable thing to do.

Will do.

>
>
> > I am generating a separate patch to take care of those too.
> > >
> > > btw, it'd be pretty trivial to add runtime checking for this sort of thing:
> > >
> > > int addr_in_init_section(void *addr)
> > > {
> > > return addr >= __init_begin && addr < __init_end;
> > > }
> >
> > I will add this to kernel/sys.c, and put a BUG_ON to check for both the
> > notifier block and the callback function.
>
> It's x86-only I think. If all architectures use the same symbols then I
> guess we could do an arch-neutral version, but one should check.

I checked all the architectures, only v850 doesn't seem to have
__init_begin (instead it has __init_start and it is the only arch that
defines __init_start). But, it does have __init_end.

CC'd the author of the file.

>
> If it won't work on all architectures then kernel/sys.c isn't the right
> place for it.
>
> Maybe it's not so useful. If we're actually accessing these things then
> someone should report oopses. So this debugging infrastructure will only
> detect things which a) are in __init, b) shouldn't be in __init and c) are
> never actually accessed.

We do not know how the __initdata initializations were _not_ oopsing
till 2.6.16, but fails consistently in 2.6.17-rc2. We spent some time
debugging the problem and got to this point.

If for random reason, the __init functions also start failing for
whatever reason then we have to go through this debug cycle again.

On the other hand, if we add a panic or BUG_ON in
notifier_chain_register, then the bug will be apparent.

> So I'd be inclined to not bother about this for now.

I 'd agree with this in regards to exporting the function.
>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Alan Stern

unread,

Apr 26, 2006, 11:50:13 AM4/26/06

to Andrew Morton, sekh...@us.ibm.com, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, Ashok Raj

It seems clear that this particular oops was caused by the xfs driver
trying to register a cpu_notifier at a time when that notifier chain was
expected to be completely idle.

Instead of moving all this code and data out of the init sections,
wouldn't it be better to fix the individual drivers (like xfs) so they
won't try to use inaccessible notifier chains?

For that matter, if lots of entries on the cpu_notifier chain are marked
with __cpuinit, then shouldn't the chain header itself plus
register_cpu_notifier and unregister_cpu_notifier be marked the same way?

Alan Stern

Chandra Seetharaman

unread,

Apr 26, 2006, 2:19:33 PM4/26/06

to Alan Stern, Andrew Morton, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, Ashok Raj

On Wed, 2006-04-26 at 11:49 -0400, Alan Stern wrote:
> On Mon, 24 Apr 2006, Andrew Morton wrote:

<snip>

> > I guess for now, bringing those things into .text and .data when there's
> > doubt is a reasonable thing to do.
>
> It seems clear that this particular oops was caused by the xfs driver
> trying to register a cpu_notifier at a time when that notifier chain was
> expected to be completely idle.
>
> Instead of moving all this code and data out of the init sections,
> wouldn't it be better to fix the individual drivers (like xfs) so they
> won't try to use inaccessible notifier chains?
>
> For that matter, if lots of entries on the cpu_notifier chain are marked
> with __cpuinit, then shouldn't the chain header itself plus
> register_cpu_notifier and unregister_cpu_notifier be marked the same way?

Your suggestion is very valid, since the cpu_notifiers are called only
at init time, unless CONFIG_HOTPLUG_CPU is turned ON. Definitions of
__cpuinit and __cpuinitdata takes care of HOTPLUG config option.

XFS wants to register only for HOTPLUG_CPU case, and it do so by putting
the callback, register and unregister inside #ifdef HOTPLUG_CPU.

Note: I made the changes and tested, it works.

Andrew, Linus, Any comments ?

> Alan Stern
>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Andrew Morton

unread,

Apr 26, 2006, 2:41:51 PM4/26/06

to sekh...@us.ibm.com, st...@rowland.harvard.edu, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, asho...@intel.com

Chandra Seetharaman <sekh...@us.ibm.com> wrote:
>
> On Wed, 2006-04-26 at 11:49 -0400, Alan Stern wrote:
> > On Mon, 24 Apr 2006, Andrew Morton wrote:
> <snip>
>
> > > I guess for now, bringing those things into .text and .data when there's
> > > doubt is a reasonable thing to do.
> >
> > It seems clear that this particular oops was caused by the xfs driver
> > trying to register a cpu_notifier at a time when that notifier chain was
> > expected to be completely idle.
> >
> > Instead of moving all this code and data out of the init sections,
> > wouldn't it be better to fix the individual drivers (like xfs) so they
> > won't try to use inaccessible notifier chains?
> >
> > For that matter, if lots of entries on the cpu_notifier chain are marked
> > with __cpuinit, then shouldn't the chain header itself plus
> > register_cpu_notifier and unregister_cpu_notifier be marked the same way?
>
> Your suggestion is very valid, since the cpu_notifiers are called only
> at init time, unless CONFIG_HOTPLUG_CPU is turned ON. Definitions of
> __cpuinit and __cpuinitdata takes care of HOTPLUG config option.
>
> XFS wants to register only for HOTPLUG_CPU case, and it do so by putting
> the callback, register and unregister inside #ifdef HOTPLUG_CPU.
>
> Note: I made the changes and tested, it works.
>
> Andrew, Linus, Any comments ?

Ashok's the one who has spent most time with this. Basically _everything_
to do with register_cpu_notifier() and all the things which call it should
be __cpuinit and should be tossed away during boot on non-cpu-hotplug
kernels.

But there are a few nasty problems with that which made us give up.

Ashok Raj

unread,

Apr 26, 2006, 3:38:47 PM4/26/06

to Andrew Morton, sekh...@us.ibm.com, st...@rowland.harvard.edu, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com, asho...@intel.com

On Wed, Apr 26, 2006 at 11:43:48AM -0700, Andrew Morton wrote:
> Chandra Seetharaman <sekh...@us.ibm.com> wrote:

> Ashok's the one who has spent most time with this. Basically _everything_
> to do with register_cpu_notifier() and all the things which call it should
> be __cpuinit and should be tossed away during boot on non-cpu-hotplug
> kernels.
>
> But there are a few nasty problems with that which made us give up.

I think we got to a reasonable start, until i got busy with other things
and didnt complete it all the way to be ready to submit. There were many files
that got affected, so we thought may be could take smaller steps.

for the above xfs, if you want to avoid the ifdef CONFIG_HOTPLUG_CPU
you could choose to use the hotcpu_notifier() which is null macro when
CONFIG_HOTPLUG_CPU=n

The problem we ran into was some of the startup code depends on the notifier
call chain for smp bringup, hence we couldn't nuke it similar to
hotcpu_notifier().

so we ended up calling that function for early risers as
early_register_cpu_notifier(), and all functions/data with __cpuinit etc to
overcome that issue.

I will try to pursue to again when i get a chance.
--
Cheers,
Ashok Raj
- Open Source Technology Center

Chandra Seetharaman

unread,

Apr 26, 2006, 4:22:07 PM4/26/06

to Ashok Raj, Andrew Morton, Alan Stern, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Wed, 2006-04-26 at 12:29 -0700, Ashok Raj wrote:
> On Wed, Apr 26, 2006 at 11:43:48AM -0700, Andrew Morton wrote:
> > Chandra Seetharaman <sekh...@us.ibm.com> wrote:
> > Ashok's the one who has spent most time with this. Basically _everything_
> > to do with register_cpu_notifier() and all the things which call it should
> > be __cpuinit and should be tossed away during boot on non-cpu-hotplug
> > kernels.
> >
> > But there are a few nasty problems with that which made us give up.
>
> I think we got to a reasonable start, until i got busy with other things
> and didnt complete it all the way to be ready to submit. There were many files
> that got affected, so we thought may be could take smaller steps.
>
> for the above xfs, if you want to avoid the ifdef CONFIG_HOTPLUG_CPU
> you could choose to use the hotcpu_notifier() which is null macro when
> CONFIG_HOTPLUG_CPU=n

No, they can't use the hotcpu_notifier, because they want to hold on to
their notifier block (as they attach it with each mount data structure).

But, it is not a major issue. Changes to xfs to adhere to the model
being discussed is very small.

>
> The problem we ran into was some of the startup code depends on the notifier
> call chain for smp bringup, hence we couldn't nuke it similar to
> hotcpu_notifier().

I do not understand the problem. If everybody that uses
register_cpu_notifier() starts using __cpuinit and __cpuinitdata (or the
devinit siblings), then the notifier mechanism will not be any different
than what they are now, right ? (both in hotplug cpu and non-hotplug cpu
case) Or am i missing something ?

>
> so we ended up calling that function for early risers as
> early_register_cpu_notifier(), and all functions/data with __cpuinit etc to
> overcome that issue.
>
> I will try to pursue to again when i get a chance.

I made patches that removes all init stuff from all the usages of
notifier_blocks, and i _think_ it is on its way to 2.6.17. The question
is, should it go in or not ?

May be the right answer is it should not and xfs should fix their
register_cpu_notifier() usage.

chandra
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Ashok Raj

unread,

Apr 26, 2006, 4:30:25 PM4/26/06

to Chandra Seetharaman, Ashok Raj, Andrew Morton, Alan Stern, her...@13thfloor.at, torv...@osdl.org, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Wed, Apr 26, 2006 at 01:21:33PM -0700, Chandra Seetharaman wrote:
> >
> > The problem we ran into was some of the startup code depends on the notifier
> > call chain for smp bringup, hence we couldn't nuke it similar to
> > hotcpu_notifier().
>
> I do not understand the problem. If everybody that uses
> register_cpu_notifier() starts using __cpuinit and __cpuinitdata (or the
> devinit siblings), then the notifier mechanism will not be any different
> than what they are now, right ? (both in hotplug cpu and non-hotplug cpu
> case) Or am i missing something ?

Well, register_cpu_notifier() is an exported function. There are several
modules that use this today like cpufreq etc which disqualifies it to be
a init style function.

either that function should be devinit and be present premanently, or
should be mapped to null macro for correctness.

Otherwise module loaders will start to oops when they call into
register.

--
Cheers,
Ashok Raj
- Open Source Technology Center

Chandra Seetharaman

unread,

Apr 28, 2006, 7:12:26 PM4/28/06

to Andrew Morton, torv...@osdl.org, Ashok Raj, Alan Stern, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Wed, 2006-04-26 at 13:26 -0700, Ashok Raj wrote:

Hi All,

Looks like the patches I provided is a step backward from where Ashok &
Andrew were taking the register_cpu_notifier stuff to.

After some discussions with Ashok we both think the following would be
the right direction:
1 revert the changes i pushed recently
2 make all usages of register_cpu_notifier to be _init and
__initdata (if hotplug cpu is defined these are removed)
3 export the symbols register_cpu_notifier and
unregister_cpu_notifier only in CONFIG_HOTPLUG_CPU is defined
4 move the hot plug cpu based usages of register_cpu_notifier
inside #ifdef CONFIG_HOTPLUF_CPU(like xfs's usage).

I have few questions:
- any problems with the above direction (mainly 3) ?
- Should we proceed in this direction ?
- is it too late for 2.6.17 ? if not late how much time do we have ?

Many thanks to Alan for bringing up the issue.

regards,

chandra

> On Wed, Apr 26, 2006 at 01:21:33PM -0700, Chandra Seetharaman wrote:
> > >
> > > The problem we ran into was some of the startup code depends on the notifier
> > > call chain for smp bringup, hence we couldn't nuke it similar to
> > > hotcpu_notifier().
> >
> > I do not understand the problem. If everybody that uses
> > register_cpu_notifier() starts using __cpuinit and __cpuinitdata (or the
> > devinit siblings), then the notifier mechanism will not be any different
> > than what they are now, right ? (both in hotplug cpu and non-hotplug cpu
> > case) Or am i missing something ?
>
> Well, register_cpu_notifier() is an exported function. There are several
> modules that use this today like cpufreq etc which disqualifies it to be
> a init style function.
>
> either that function should be devinit and be present premanently, or
> should be mapped to null macro for correctness.
>
> Otherwise module loaders will start to oops when they call into
> register.
>
--

----------------------------------------------------------------------

Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Andrew Morton

unread,

Apr 28, 2006, 7:23:10 PM4/28/06

to sekh...@us.ibm.com, torv...@osdl.org, asho...@intel.com, st...@rowland.harvard.edu, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

Chandra Seetharaman <sekh...@us.ibm.com> wrote:
>
> Looks like the patches I provided is a step backward from where Ashok &
> Andrew were taking the register_cpu_notifier stuff to.
>
> After some discussions with Ashok we both think the following would be
> the right direction:
> 1 revert the changes i pushed recently
> 2 make all usages of register_cpu_notifier to be _init and
> __initdata (if hotplug cpu is defined these are removed)
> 3 export the symbols register_cpu_notifier and
> unregister_cpu_notifier only in CONFIG_HOTPLUG_CPU is defined
> 4 move the hot plug cpu based usages of register_cpu_notifier
> inside #ifdef CONFIG_HOTPLUF_CPU(like xfs's usage).
>
> I have few questions:
> - any problems with the above direction (mainly 3) ?
> - Should we proceed in this direction ?
> - is it too late for 2.6.17 ? if not late how much time do we have ?

hm. I'm leaning more towards doing something expedient and obvious for
2.6.17. It's pretty late in the cycle, and the only downside is the loss
of a kbyte or two. Plus I'll be at linuxtag next week and won't be around to
help out.

So if it's OK, can we do something minimal, revisit it after 2.6.17?

Linus Torvalds

unread,

Apr 28, 2006, 7:36:09 PM4/28/06

to Andrew Morton, sekh...@us.ibm.com, asho...@intel.com, st...@rowland.harvard.edu, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Fri, 28 Apr 2006, Andrew Morton wrote:
>
> hm. I'm leaning more towards doing something expedient and obvious for
> 2.6.17. It's pretty late in the cycle, and the only downside is the loss
> of a kbyte or two. Plus I'll be at linuxtag next week and won't be around to
> help out.
>
> So if it's OK, can we do something minimal, revisit it after 2.6.17?

I'm personally fine with the current state: it should be stable and work.
If there are any remaining _bugs_ that people know about, please send
fixes to me, but I think we can definitely leave the "free the unnecessary
memory" stuff to after 2.6.17.

Linus

Chandra Seetharaman

unread,

Apr 28, 2006, 7:43:53 PM4/28/06

to Andrew Morton, torv...@osdl.org, asho...@intel.com, Alan Stern, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Fri, 2006-04-28 at 16:23 -0700, Andrew Morton wrote:
> Chandra Seetharaman <sekh...@us.ibm.com> wrote:
> >
> > Looks like the patches I provided is a step backward from where Ashok &
> > Andrew were taking the register_cpu_notifier stuff to.
> >
> > After some discussions with Ashok we both think the following would be
> > the right direction:
> > 1 revert the changes i pushed recently
> > 2 make all usages of register_cpu_notifier to be _init and
> > __initdata (if hotplug cpu is defined these are removed)
> > 3 export the symbols register_cpu_notifier and
> > unregister_cpu_notifier only in CONFIG_HOTPLUG_CPU is defined
> > 4 move the hot plug cpu based usages of register_cpu_notifier
> > inside #ifdef CONFIG_HOTPLUF_CPU(like xfs's usage).
> >
> > I have few questions:
> > - any problems with the above direction (mainly 3) ?
> > - Should we proceed in this direction ?
> > - is it too late for 2.6.17 ? if not late how much time do we have ?
>
> hm. I'm leaning more towards doing something expedient and obvious for
> 2.6.17. It's pretty late in the cycle, and the only downside is the loss
> of a kbyte or two. Plus I'll be at linuxtag next week and won't be around to
> help out.
>
> So if it's OK, can we do something minimal, revisit it after 2.6.17?

- if we are ok with a loss of a kbyte or two, 2.6.17 is fine as is
(with my incorrect patches in).
- if we want to save that memory, we can revert the two patches and fix
xfs to make the register calls only when hotplug cpu is defined. This
change is also minimal. It is a step in the right direction.

Only downside i can see in reverting my patch is that if there is any
other modules that are doing the same as what xfs was doing, we might
trip in a similar oops.

chandra
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Chandra Seetharaman

unread,

Apr 28, 2006, 7:48:29 PM4/28/06

to Linus Torvalds, Andrew Morton, asho...@intel.com, Alan Stern, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Fri, 2006-04-28 at 16:33 -0700, Linus Torvalds wrote:
>
> On Fri, 28 Apr 2006, Andrew Morton wrote:
> >
> > hm. I'm leaning more towards doing something expedient and obvious for
> > 2.6.17. It's pretty late in the cycle, and the only downside is the loss
> > of a kbyte or two. Plus I'll be at linuxtag next week and won't be around to
> > help out.
> >
> > So if it's OK, can we do something minimal, revisit it after 2.6.17?
>
> I'm personally fine with the current state: it should be stable and work.

in that case _no_ change would be the best option.

> If there are any remaining _bugs_ that people know about, please send
> fixes to me, but I think we can definitely leave the "free the unnecessary

No bugs that i know of.

> memory" stuff to after 2.6.17.

sounds good.
>
> Linus
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekh...@us.ibm.com | .......you may get it.
----------------------------------------------------------------------

Alan Stern

unread,

Apr 29, 2006, 11:31:30 AM4/29/06

to Chandra Seetharaman, Andrew Morton, torv...@osdl.org, asho...@intel.com, her...@13thfloor.at, linux-...@vger.kernel.org, linu...@oss.sgi.com, xfs-m...@oss.sgi.com

On Fri, 28 Apr 2006, Chandra Seetharaman wrote:

> - if we are ok with a loss of a kbyte or two, 2.6.17 is fine as is
> (with my incorrect patches in).
> - if we want to save that memory, we can revert the two patches and fix
> xfs to make the register calls only when hotplug cpu is defined. This
> change is also minimal. It is a step in the right direction.
>
> Only downside i can see in reverting my patch is that if there is any
> other modules that are doing the same as what xfs was doing, we might
> trip in a similar oops.

Once register_cpu_notifier is placed in an init section, everything should
be okay. If some other module does _exactly_ what xfs did, it won't oops
-- instead the module will get an unresolved symbol error whenever someone
tries to insmod it, because the register_cpu_notifier symbol won't be
defined. I think this is an appropriate kind of failure mode.

However, it wouldn't hurt to add some comments to the definition and
declaration of register_cpu_notifier, explaining the circumstances in
which it should be used.

Alan Stern