Linux 2.6.29

Linus Torvalds

unread,

Mar 23, 2009, 7:31:09 PM3/23/09

to Linux Kernel Mailing List

It's out there now, or at least in the process of getting mirrored out.

The most obvious change is the (temporary) change of logo to Tuz, the
Tasmanian Devil. But there's a number of driver updates and some m68k
header updates (fixing headers_install after the merge of non-MMU/MMU)
that end up being pretty noticeable in the diffs.

The shortlog (from -rc8, obviously - the full logs from 2.6.28 are too big
to even contemplate attaching here) is appended, and most of the non-logo
changes really shouldn't be all that noticeable to most people. Nothing
really exciting, although I admit to fleetingly considering another -rc
series just because the changes are bigger than I would have wished for
this late in the game. But there was little point in holding off the real
release any longer, I feel.

This obviously starts the merge window for 2.6.30, although as usual, I'll
probably wait a day or two before I start actively merging. I do that in
order to hopefully result in people testing the final plain 2.6.29 a bit
more before all the crazy changes start up again.

Linus

---
Aaro Koskinen (2):
ARM: OMAP: sched_clock() corrected
ARM: OMAP: Allow I2C bus driver to be compiled as a module

Abhijeet Joglekar (2):
[SCSI] libfc: Pass lport in exch_mgr_reset
[SCSI] libfc: when rport goes away (re-plogi), clean up exchanges to/from rport

Achilleas Kotsis (1):
USB: Add device id for Option GTM380 to option driver

Al Viro (1):
net: fix sctp breakage

Alan Stern (2):
USB: usbfs: keep async URBs until the device file is closed
USB: EHCI: expedite unlinks when the root hub is suspended

Albert Pauw (1):
USB: option.c: add ZTE 622 modem device

Alexander Duyck (1):
igb: remove ASPM L0s workaround

Andrew Vasquez (4):
[SCSI] qla2xxx: Correct address range checking for option-rom updates.
[SCSI] qla2xxx: Correct truncation in return-code status checking.
[SCSI] qla2xxx: Correct overwrite of pre-assigned init-control-block structure size.
[SCSI] qla2xxx: Update version number to 8.03.00-k4.

Andy Whitcroft (1):
suspend: switch the Asus Pundit P1-AH2 to old ACPI sleep ordering

Anirban Chakraborty (1):
[SCSI] qla2xxx: Correct vport delete bug.

Anton Vorontsov (1):
ucc_geth: Fix oops when using fixed-link support

Antti Palosaari (1):
V4L/DVB (10972): zl10353: i2c_gate_ctrl bug fix

Axel Wachtler (1):
USB: serial: add FTDI USB/Serial converter devices

Ben Dooks (6):
[ARM] S3C64XX: Set GPIO pin when select IRQ_EINT type
[ARM] S3C64XX: Rename IRQ_UHOST to IRQ_USBH
[ARM] S3C64XX: Fix name of USB host clock.
[ARM] S3C64XX: Fix USB host clock mux list
[ARM] S3C64XX: sparse warnings in arch/arm/plat-s3c64xx/s3c6400-clock.c
[ARM] S3C64XX: sparse warnings in arch/arm/plat-s3c64xx/irq.c

Benjamin Herrenschmidt (2):
emac: Fix clock control for 405EX and 405EXr chips
radeonfb: Whack the PCI PM register until it sticks

Benny Halevy (1):
NFSD: provide encode routine for OP_OPENATTR

Bjørn Mork (1):
ipv6: fix display of local and remote sit endpoints

Borislav Petkov (1):
ide-floppy: do not map dataless cmds to an sg

Carlos Corbacho (2):
acpi-wmi: Unmark as 'experimental'
acer-wmi: Unmark as 'experimental'

Chris Leech (3):
[SCSI] libfc: rport retry on LS_RJT from certain ELS
[SCSI] fcoe: fix handling of pending queue, prevent out of order frames (v3)
ixgbe: fix multiple unicast address support

Chris Mason (2):
Btrfs: Fix locking around adding new space_info
Btrfs: Clear space_info full when adding new devices

Christoph Paasch (2):
netfilter: conntrack: fix dropping packet after l4proto->packet()
netfilter: conntrack: check for NEXTHDR_NONE before header sanity checking

Chuck Lever (2):
NLM: Shrink the IPv4-only version of nlm_cmp_addr()
NLM: Fix GRANT callback address comparison when IPv6 is enabled

Corentin Chary (4):
asus-laptop: restore acpi_generate_proc_event()
eeepc-laptop: restore acpi_generate_proc_event()
asus-laptop: use select instead of depends on
platform/x86: depends instead of select for laptop platform drivers

Cyrill Gorcunov (1):
acpi: check for pxm_to_node_map overflow

Daisuke Nishimura (1):
vmscan: pgmoved should be cleared after updating recent_rotated

Dan Carpenter (1):
acer-wmi: double free in acer_rfkill_exit()

Dan Williams (1):
USB: Option: let cdc-acm handle Sony Ericsson F3507g / Dell 5530

Darius Augulis (1):
MX1 fix include

Dave Jones (1):
via-velocity: Fix DMA mapping length errors on transmit.

David Brownell (2):
ARM: OMAP: Fix compile error if pm.h is included
dm9000: locking bugfix

David S. Miller (3):
dnet: Fix warnings on 64-bit.
xfrm: Fix xfrm_state_find() wrt. wildcard source address.
sparc64: Reschedule KGDB capture to a software interrupt.

Davide Libenzi (1):
eventfd: remove fput() call from possible IRQ context

Dhananjay Phadke (1):
netxen: remove old flash check.

Dirk Hohndel (1):
USB: Add Vendor/Product ID for new CDMA U727 to option driver

Eilon Greenstein (3):
bnx2x: Adding restriction on sge_buf_size
bnx2x: Casting page alignment
bnx2x: Using DMAE to initialize the chip

Enrik Berkhan (1):
nommu: ramfs: pages allocated to an inode's pagecache may get wrongly discarded

Eric Sandeen (3):
ext4: fix header check in ext4_ext_search_right() for deep extent trees.
ext4: fix bogus BUG_ONs in in mballoc code
ext4: fix bb_prealloc_list corruption due to wrong group locking

FUJITA Tomonori (1):
ide: save the returned value of dma_map_sg

Geert Uytterhoeven (1):
ps3/block: Replace mtd/ps3vram by block/ps3vram

Geoff Levand (1):
powerpc/ps3: ps3_defconfig updates

Gerald Schaefer (1):
[S390] Dont check for pfn_valid() in uaccess_pt.c

Gertjan van Wingerde (1):
Update my email address

Grant Grundler (2):
parisc: fix wrong assumption about bus->self
parisc: update MAINTAINERS

Grant Likely (1):
Fix Xilinx SystemACE driver to handle empty CF slot

Greg Kroah-Hartman (3):
USB: usbtmc: fix stupid bug in open()
USB: usbtmc: add protocol 1 support
Staging: benet: remove driver now that it is merged in drivers/net/

Greg Ungerer (8):
m68k: merge the non-MMU and MMU versions of param.h
m68k: merge the non-MMU and MMU versions of swab.h
m68k: merge the non-MMU and MMU versions of sigcontext.h
m68k: use MMU version of setup.h for both MMU and non-MMU
m68k: merge the non-MMU and MMU versions of ptrace.h
m68k: merge the non-MMU and MMU versions of signal.h
m68k: use the MMU version of unistd.h for all m68k platforms
m68k: merge the non-MMU and MMU versions of siginfo.h

Gregory Lardiere (1):
V4L/DVB (10789): m5602-s5k4aa: Split up the initial sensor probe in chunks.

Hans Werner (1):
V4L/DVB (10977): STB6100 init fix, the call to stb6100_set_bandwidth needs an argument

Hartley Sweeten (1):
[ARM] 5419/1: ep93xx: fix build warnings about struct i2c_board_info

Heiko Carstens (2):
[S390] topology: define SD_MC_INIT to fix performance regression
[S390] ftrace/mcount: fix kernel stack backchain

Helge Deller (7):
parisc: BUG_ON() cleanup
parisc: fix section mismatch warnings
parisc: fix `struct pt_regs' declared inside parameter list warning
parisc: remove unused local out_putf label
parisc: fix dev_printk() compile warnings for accessing a device struct
parisc: add braces around arguments in assembler macros
parisc: fix 64bit build

Herbert Xu (1):
gro: Fix legacy path napi_complete crash

Huang Ying (1):
dm crypt: fix kcryptd_async_done parameter

Ian Dall (1):
Bug 11061, NFS mounts dropped

Igor M. Liplianin (1):
V4L/DVB (10976): Bug fix: For legacy applications stv0899 performs search only first time after insmod.

Ilya Yanok (3):
dnet: Dave DNET ethernet controller driver (updated)
dnet: replace obsolete *netif_rx_* functions with *napi_*
dnet: DNET should depend on HAS_IOMEM

Ingo Molnar (1):
kconfig: improve seed in randconfig

J. Bruce Fields (1):
nfsd: nfsd should drop CAP_MKNOD for non-root

James Bottomley (1):
parisc: remove klist iterators

Jan Dumon (1):
USB: unusual_devs: Add support for GI 0431 SD-Card interface

Jay Vosburgh (1):
bonding: Fix updating of speed/duplex changes

Jeff Moyer (1):
aio: lookup_ioctx can return the wrong value when looking up a bogus context

Jiri Slaby (8):
ACPI: remove doubled status checking
USB: atm/cxacru, fix lock imbalance
USB: image/mdc800, fix lock imbalance
USB: misc/adutux, fix lock imbalance
USB: misc/vstusb, fix lock imbalance
USB: wusbcore/wa-xfer, fix lock imbalance
ALSA: pcm_oss, fix locking typo
ALSA: mixart, fix lock imbalance

Jody McIntyre (1):
trivial: fix orphan dates in ext2 documentation

Johannes Weiner (3):
HID: fix incorrect free in hiddev
HID: fix waitqueue usage in hiddev
nommu: ramfs: don't leak pages when adding to page cache fails

John Dykstra (1):
ipv6: Fix BUG when disabled ipv6 module is unloaded

John W. Linville (1):
lib80211: silence excessive crypto debugging messages

Jorge Boncompte [DTI2] (1):
netns: oops in ip[6]_frag_reasm incrementing stats

Jouni Malinen (3):
mac80211: Fix panic on fragmentation with power saving
zd1211rw: Do not panic on device eject when associated
nl80211: Check that function pointer != NULL before using it

Karsten Wiese (1):
USB: EHCI: Fix isochronous URB leak

Kay Sievers (1):
parisc: dino: struct device - replace bus_id with dev_name(), dev_set_name()

Koen Kooi (1):
ARM: OMAP: board-omap3beagle: set i2c-3 to 100kHz

Krzysztof Helt (1):
ALSA: opl3sa2 - Fix NULL dereference when suspending snd_opl3sa2

Kumar Gala (2):
powerpc/mm: Respect _PAGE_COHERENT on classic ppc32 SW
powerpc/mm: Fix Respect _PAGE_COHERENT on classic ppc32 SW TLB load machines

Kyle McMartin (8):
parisc: fix use of new cpumask api in irq.c
parisc: convert (read|write)bwlq to inlines
parisc: convert cpu_check_affinity to new cpumask api
parisc: define x->x mmio accessors
parisc: update defconfigs
parisc: sba_iommu: fix build bug when CONFIG_PARISC_AGP=y
tulip: fix crash on iface up with shirq debug
Build with -fno-dwarf2-cfi-asm

Lalit Chandivade (1):
[SCSI] qla2xxx: Use correct value for max vport in LOOP topology.

Len Brown (1):
Revert "ACPI: make some IO ports off-limits to AML"

Lennert Buytenhek (1):
mv643xx_eth: fix unicast address filter corruption on mtu change

Li Zefan (1):
block: fix memory leak in bio_clone()

Linus Torvalds (7):
Fix potential fast PIT TSC calibration startup glitch
Fast TSC calibration: calculate proper frequency error bounds
Avoid 64-bit "switch()" statements on 32-bit architectures
Add '-fwrapv' to gcc CFLAGS
Fix race in create_empty_buffers() vs __set_page_dirty_buffers()
Move cc-option to below arch-specific setup
Linux 2.6.29

Luis R. Rodriguez (2):
ath9k: implement IO serialization
ath9k: AR9280 PCI devices must serialize IO as well

Maciej Sosnowski (1):
dca: add missing copyright/license headers

Manu Abraham (1):
V4L/DVB (10975): Bug: Use signed types, Offsets and range can be negative

Mark Brown (5):
[ARM] S3C64XX: Fix section mismatch for s3c64xx_register_clocks()
[ARM] SMDK6410: Correct I2C device name for WM8580
[ARM] SMDK6410: Declare iodesc table static
[ARM] S3C64XX: Staticise s3c64xx_init_irq_eint()
[ARM] S3C64XX: Do gpiolib configuration earlier

Mark Lord (1):
sata_mv: fix MSI irq race condition

Martin Schwidefsky (3):
[S390] __div64_31 broken for CONFIG_MARCH_G5
[S390] make page table walking more robust
[S390] make page table upgrade work again

Masami Hiramatsu (2):
prevent boosting kprobes on exception address
module: fix refptr allocation and release order

Mathieu Chouquet-Stringer (1):
thinkpad-acpi: fix module autoloading for older models

Matthew Wilcox (1):
[SCSI] sd: Don't try to spin up drives that are connected to an inactive port

Matthias Schwarzzot (1):
V4L/DVB (10978): Report tuning algorith correctly

Mauro Carvalho Chehab (1):
V4L/DVB (10834): zoran: auto-select bt866 for AverMedia 6 Eyes

Michael Chan (1):
bnx2: Fix problem of using wrong IRQ handler.

Michael Hennerich (1):
USB: serial: ftdi: enable UART detection on gnICE JTAG adaptors blacklist interface0

Mike Travis (1):
parisc: update parisc for new irq_desc

Miklos Szeredi (1):
fix ptrace slowness

Mikulas Patocka (3):
dm table: rework reference counting fix
dm io: respect BIO_MAX_PAGES limit
sparc64: Fix crash with /proc/iomem

Milan Broz (2):
dm ioctl: validate name length when renaming
dm crypt: wait for endio to complete before destruction

Moritz Muehlenhoff (1):
USB: Updated unusual-devs entry for USB mass storage on Nokia 6233

Nobuhiro Iwamatsu (2):
sh_eth: Change handling of IRQ
sh_eth: Fix mistake of the address of SH7763

Pablo Neira Ayuso (2):
netfilter: conntrack: don't deliver events for racy packets
netfilter: ctnetlink: fix crash during expectation creation

Pantelis Koukousoulas (1):
virtio_net: Make virtio_net support carrier detection

Piotr Ziecik (1):
powerpc/5200: Enable CPU_FTR_NEED_COHERENT for MPC52xx

Ralf Baechle (1):
MIPS: Mark Eins: Fix configuration.

Robert Love (11):
[SCSI] libfc: Don't violate transport template for rogue port creation
[SCSI] libfc: correct RPORT_TO_PRIV usage
[SCSI] libfc: rename rp to rdata in fc_disc_new_target()
[SCSI] libfc: check for err when recv and state is incorrect
[SCSI] libfc: Cleanup libfc_function_template comments
[SCSI] libfc, fcoe: Fix kerneldoc comments
[SCSI] libfc, fcoe: Cleanup function formatting and minor typos
[SCSI] libfc, fcoe: Remove unnecessary cast by removing inline wrapper
[SCSI] fcoe: Use setup_timer() and mod_timer()
[SCSI] fcoe: Correct fcoe_transports initialization vs. registration
[SCSI] fcoe: Change fcoe receive thread nice value from 19 (lowest priority) to -20

Robert M. Kenney (1):
USB: serial: new cp2101 device id

Roel Kluin (3):
[SCSI] fcoe: fix kfree(skb)
acpi-wmi: unsigned cannot be less than 0
net: kfree(napi->skb) => kfree_skb

Ron Mercer (4):
qlge: bugfix: Increase filter on inbound csum.
qlge: bugfix: Tell hw to strip vlan header.
qlge: bugfix: Move netif_napi_del() to common call point.
qlge: bugfix: Pad outbound frames smaller than 60 bytes.

Russell King (2):
[ARM] update mach-types
[ARM] Fix virtual to physical translation macro corner cases

Rusty Russell (1):
linux.conf.au 2009: Tuz

Saeed Bishara (1):
[ARM] orion5x: pass dram mbus data to xor driver

Sam Ravnborg (1):
kconfig: fix randconfig for choice blocks

Sathya Perla (3):
net: Add be2net driver.
be2net: replenish when posting to rx-queue is starved in out of mem conditions
be2net: fix to restore vlan ids into BE2 during a IF DOWN->UP cycle

Scott James Remnant (1):
sbus: Auto-load openprom module when device opened.

Sigmund Augdal (1):
V4L/DVB (10974): Use Diseqc 3/3 mode to send data

Stanislaw Gruszka (1):
net: Document /proc/sys/net/core/netdev_budget

Stephen Hemminger (1):
sungem: missing net_device_ops

Stephen Rothwell (1):
net: update dnet.c for bus_id removal

Steve Glendinning (1):
smsc911x: reset last known duplex and carrier on open

Steve Ma (1):
[SCSI] libfc: exch mgr is freed while lport still retrying sequences

Stuart MENEFY (1):
libata: Keep shadow last_ctl up to date during resets

Suresh Jayaraman (1):
NFS: Handle -ESTALE error in access()

Takashi Iwai (3):
ALSA: hda - Fix DMA mask for ATI controllers
ALSA: hda - Workaround for buggy DMA position on ATI controllers
ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()

Tao Ma (2):
ocfs2: Fix a bug found by sparse check.
ocfs2: Use xs->bucket to set xattr value outside

Tejun Heo (1):
ata_piix: add workaround for Samsung DB-P70

Theodore Ts'o (1):
ext4: Print the find_group_flex() warning only once

Thomas Bartosik (1):
USB: storage: Unusual USB device Prolific 2507 variation added

Tiger Yang (2):
ocfs2: reserve xattr block for new directory with inline data
ocfs2: tweak to get the maximum inline data size with xattr

Tilman Schmidt (1):
bas_gigaset: correctly allocate USB interrupt transfer buffer

Trond Myklebust (6):
SUNRPC: Tighten up the task locking rules in __rpc_execute()
NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2)
NFSv3: Fix posix ACL code
SUNRPC: Fix an Oops due to socket not set up yet...
SUNRPC: xprt_connect() don't abort the task if the transport isn't bound
NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined...

Tyler Hicks (3):
eCryptfs: don't encrypt file key with filename key
eCryptfs: Allocate a variable number of pages for file headers
eCryptfs: NULL crypt_stat dereference during lookup

Uwe Kleine-König (2):
[ARM] 5418/1: restore lr before leaving mcount
[ARM] 5421/1: ftrace: fix crash due to tracing of __naked functions

Vasu Dev (5):
[SCSI] libfc: handle RRQ exch timeout
[SCSI] libfc: fixed a soft lockup issue in fc_exch_recv_abts
[SCSI] libfc, fcoe: fixed locking issues with lport->lp_mutex around lport->link_status
[SCSI] libfc: fixed a read IO data integrity issue when a IO data frame lost
[SCSI] fcoe: Out of order tx frames was causing several check condition SCSI status

Viral Mehta (1):
ALSA: oss-mixer - Fixes recording gain control

Vitaly Wool (1):
V4L/DVB (10832): tvaudio: Avoid breakage with tda9874a

Werner Almesberger (1):
[ARM] S3C64XX: Fix s3c64xx_setrate_clksrc

Yi Zou (2):
[SCSI] libfc: do not change the fh_rx_id of a recevied frame
[SCSI] fcoe: ETH_P_8021Q is already in if_ether and fcoe is not using it anyway

Zhang Le (2):
MIPS: Fix TIF_32BIT undefined problem when seccomp is disabled
filp->f_pos not correctly updated in proc_task_readdir

Zhang Rui (1):
ACPI suspend: Blacklist Toshiba Satellite L300 that requires to set SCI_EN directly on resume

françois romieu (2):
r8169: use hardware auto-padding.
r8169: revert "r8169: read MAC address from EEPROM on init (2nd attempt)"

un'ichi Nomura (1):
block: Add gfp_mask parameter to bio_integrity_clone()
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jesper Krogh

unread,

Mar 24, 2009, 2:20:31 AM3/24/09

to Linus Torvalds, Linux Kernel Mailing List

Linus Torvalds wrote:
> This obviously starts the merge window for 2.6.30, although as usual, I'll
> probably wait a day or two before I start actively merging. I do that in
> order to hopefully result in people testing the final plain 2.6.29 a bit
> more before all the crazy changes start up again.

I know this has been discussed before:

[129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than
480 seconds.
[129402.084667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[129402.179331] updatedb.mloc D 0000000000000000 0 31092 31091
[129402.179335] ffff8805ffa1d900 0000000000000082 ffff8803ff5688a8
0000000000001000
[129402.179338] ffffffff806cc000 ffffffff806cc000 ffffffff806d3e80
ffffffff806d3e80
[129402.179341] ffffffff806cfe40 ffffffff806d3e80 ffff8801fb9f87e0
000000000000ffff
[129402.179343] Call Trace:
[129402.179353] [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
[129402.179358] [<ffffffff80493a50>] io_schedule+0x20/0x30
[129402.179360] [<ffffffff802d402b>] sync_buffer+0x3b/0x50
[129402.179362] [<ffffffff80493d2f>] __wait_on_bit+0x4f/0x80
[129402.179364] [<ffffffff802d3ff0>] sync_buffer+0x0/0x50
[129402.179366] [<ffffffff80493dda>] out_of_line_wait_on_bit+0x7a/0xa0
[129402.179369] [<ffffffff80252730>] wake_bit_function+0x0/0x30
[129402.179396] [<ffffffffa0264346>] ext3_find_entry+0xf6/0x610 [ext3]
[129402.179399] [<ffffffff802d3453>] __find_get_block+0x83/0x170
[129402.179403] [<ffffffff802c4a90>] ifind_fast+0x50/0xa0
[129402.179405] [<ffffffff802c5874>] iget_locked+0x44/0x180
[129402.179412] [<ffffffffa0266435>] ext3_lookup+0x55/0x100 [ext3]
[129402.179415] [<ffffffff802c32a7>] d_alloc+0x127/0x1c0
[129402.179417] [<ffffffff802ba2a7>] do_lookup+0x1b7/0x250
[129402.179419] [<ffffffff802bc51d>] __link_path_walk+0x76d/0xd60
[129402.179421] [<ffffffff802ba17f>] do_lookup+0x8f/0x250
[129402.179424] [<ffffffff802c8b37>] mntput_no_expire+0x27/0x150
[129402.179426] [<ffffffff802bcb64>] path_walk+0x54/0xb0
[129402.179428] [<ffffffff802bfd10>] filldir+0x0/0xf0
[129402.179430] [<ffffffff802bcc8a>] do_path_lookup+0x7a/0x150
[129402.179432] [<ffffffff802bbb55>] getname+0xe5/0x1f0
[129402.179434] [<ffffffff802bd8d4>] user_path_at+0x44/0x80
[129402.179437] [<ffffffff802b53b5>] cp_new_stat+0xe5/0x100
[129402.179440] [<ffffffff802b56d0>] vfs_lstat_fd+0x20/0x60
[129402.179442] [<ffffffff802b5737>] sys_newlstat+0x27/0x50
[129402.179445] [<ffffffff8020c35b>] system_call_fastpath+0x16/0x1b

Consensus seems to be something with large memory machines, lots of
dirty pages and a long writeout time due to ext3.

At the moment this the largest "usabillity" issue in the serversetup I'm
working with. Can there be done something to "autotune" it .. or perhaps
even fix it? .. or is it just to shift to xfs or wait for ext4?

Jesper
--
Jesper

David Rees

unread,

Mar 24, 2009, 2:47:20 AM3/24/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jes...@krogh.cc> wrote:
> I know this has been discussed before:
>
> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
> seconds.

Ouch - 480 seconds, how much memory is in that machine, and how slow
are the disks? What's your vm.dirty_background_ratio and
vm.dirty_ratio set to?

> Consensus seems to be something with large memory machines, lots of dirty
> pages and a long writeout time due to ext3.

All filesystems seem to suffer from this issue to some degree. I
posted to the list earlier trying to see if there was anything that
could be done to help my specific case. I've got a system where if
someone starts writing out a large file, it kills client NFS writes.
Makes the system unusable:
http://marc.info/?l=linux-kernel&m=123732127919368&w=2

Only workaround I've found is to reduce dirty_background_ratio and
dirty_ratio to tiny levels. Or throw good SSDs and/or a fast RAID
array at it so that large writes complete faster. Have you tried the
new vm_dirty_bytes in 2.6.29?

> At the moment this the largest "usabillity" issue in the serversetup I'm
> working with. Can there be done something to "autotune" it .. or perhaps
> even fix it? .. or is it just to shift to xfs or wait for ext4?

Everyone seems to agree that "autotuning" it is the way to go. But no
one seems willing to step up and try to do it. Probably because it's
hard to get right!

-Dave

Jesper Krogh

unread,

Mar 24, 2009, 3:32:50 AM3/24/09

to David Rees, Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jes...@krogh.cc> wrote:
>> I know this has been discussed before:
>>
>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>> seconds.
>
> Ouch - 480 seconds, how much memory is in that machine, and how slow
> are the disks?

The 480 secondes is not the "wait time" but the time gone before the
message is printed. It the kernel-default it was earlier 120 seconds but
thats changed by Ingo Molnar back in september. I do get a lot of less
noise but it really doesn't tell anything about the nature of the problem.

The systes spec:
32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to
decide if thats fast or slow?

The strange thing is actually that the above process (updatedb.mlocate)
is writing to / which is a device without any activity at all. All
activity is on the Fibre Channel device above, but process writing
outsid that seems to be effected as well.

> What's your vm.dirty_background_ratio and
> vm.dirty_ratio set to?

2.6.29-rc8 defaults:
jk@hest:/proc/sys/vm$ cat dirty_background_ratio
5
jk@hest:/proc/sys/vm$ cat dirty_ratio
10

>> Consensus seems to be something with large memory machines, lots of dirty
>> pages and a long writeout time due to ext3.
>
> All filesystems seem to suffer from this issue to some degree. I
> posted to the list earlier trying to see if there was anything that
> could be done to help my specific case. I've got a system where if
> someone starts writing out a large file, it kills client NFS writes.
> Makes the system unusable:
> http://marc.info/?l=linux-kernel&m=123732127919368&w=2

Yes, I've hit 120s+ penalties just by saving a file in vim.

> Only workaround I've found is to reduce dirty_background_ratio and
> dirty_ratio to tiny levels. Or throw good SSDs and/or a fast RAID
> array at it so that large writes complete faster. Have you tried the
> new vm_dirty_bytes in 2.6.29?

No.. What would you suggest to be a reasonable setting for that?

> Everyone seems to agree that "autotuning" it is the way to go. But no
> one seems willing to step up and try to do it. Probably because it's
> hard to get right!

I can test patches.. but I'm not a kernel-developer.. unfortunately.

Jesper

--
Jesper

Ingo Molnar

unread,

Mar 24, 2009, 4:17:37 AM3/24/09

to Jesper Krogh, David Rees, Linus Torvalds, Linux Kernel Mailing List

* Jesper Krogh <jes...@krogh.cc> wrote:

> David Rees wrote:
>> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jes...@krogh.cc> wrote:
>>> I know this has been discussed before:
>>>
>>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>>> seconds.
>>
>> Ouch - 480 seconds, how much memory is in that machine, and how slow
>> are the disks?
>
> The 480 secondes is not the "wait time" but the time gone before
> the message is printed. It the kernel-default it was earlier 120
> seconds but thats changed by Ingo Molnar back in september. I do
> get a lot of less noise but it really doesn't tell anything about
> the nature of the problem.

That's true - the detector is really simple and only tries to flag
suspiciously long uninterruptible waits. It prints out the context
it finds but otherwise does not try to go deep about exactly why
that delay happened.

Would you agree that the message is correct, and that there is some
sort of "tasks wait way too long" problem on your system?

Considering:

> The systes spec:
> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to
> decide if thats fast or slow?

[...]

> Yes, I've hit 120s+ penalties just by saving a file in vim.

i think it's fair to say that an almost 10 minutes uninterruptible
sleep sucks to the user, by any reasonable standard. It is the year
2009, not 1959.

The delay might be difficult to fix, but it's still reality - and
that's the purpose of this particular debug helper: to rub reality
under our noses, whether we like it or not.

( _My_ personal pain threshold for waiting for the computer is
around 1 _second_. If any command does something that i cannot
Ctrl-C or Ctrl-Z my way out of i get annoyed. So the historic
limit for the hung tasks check was 10 seconds, then 60 seconds.
But people argued that it's too low so it was raised to 120 then
480 seconds. If almost 10 minutes of uninterruptible wait is still
acceptable then the watchdog can be turned off (because it's
basically pointless to run it in that case - no amount of delay
will be 'bad'). )

Ingo

Alan Cox

unread,

Mar 24, 2009, 5:15:52 AM3/24/09

to David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> posted to the list earlier trying to see if there was anything that
> could be done to help my specific case. I've got a system where if
> someone starts writing out a large file, it kills client NFS writes.
> Makes the system unusable:
> http://marc.info/?l=linux-kernel&m=123732127919368&w=2

I have not had this problem since I applied Arjan's (for some reason
repeatedly rejected) patch to change the ioprio of the various writeback
daemons. Under some loads changing to the noop I/O scheduler also seems
to help (as do most of the non default ones)

> Everyone seems to agree that "autotuning" it is the way to go. But no
> one seems willing to step up and try to do it. Probably because it's
> hard to get right!

If this is a VM problem why does fixing the I/O priority of the various
daemons seem to cure at least some of it ?

Alan

Ingo Molnar

unread,

Mar 24, 2009, 5:33:24 AM3/24/09

to Alan Cox, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

* Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:

> > posted to the list earlier trying to see if there was anything that
> > could be done to help my specific case. I've got a system where if
> > someone starts writing out a large file, it kills client NFS writes.
> > Makes the system unusable:
> > http://marc.info/?l=linux-kernel&m=123732127919368&w=2
>
> I have not had this problem since I applied Arjan's (for some reason
> repeatedly rejected) patch to change the ioprio of the various writeback
> daemons. Under some loads changing to the noop I/O scheduler also seems
> to help (as do most of the non default ones)

(link would be useful)

Ingo

Alan Cox

unread,

Mar 24, 2009, 6:10:30 AM3/24/09

to Ingo Molnar, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> > I have not had this problem since I applied Arjan's (for some reason
> > repeatedly rejected) patch to change the ioprio of the various writeback
> > daemons. Under some loads changing to the noop I/O scheduler also seems
> > to help (as do most of the non default ones)
>
> (link would be useful)

"Give kjournald a IOPRIO_CLASS_RT io priority"

October 2007 (yes its that old)

And do the same as per discussion to the writeback tasks.

Which isn't to say there are not also vm problems - look at the I/O
patterns with any kernel after about 2.6.18/19 and there seems to be a
serious problem with writeback from the mm and fs writes falling over
each other and turning the smooth writeout into thrashing back and forth
as both try to write out different bits of the same stuff.

<Rant>
Really someone needs to sit down and actually build a proper model of the
VM behaviour in a tool like netlogo rather than continually keep adding
ever more complex and thus unpredictable hacks to it. That way we might
better understand what is occurring and why.
</Rant>

Alan

Ingo Molnar

unread,

Mar 24, 2009, 6:32:25 AM3/24/09

to Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Theodore Tso, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

* Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:

> > > I have not had this problem since I applied Arjan's (for some reason
> > > repeatedly rejected) patch to change the ioprio of the various writeback
> > > daemons. Under some loads changing to the noop I/O scheduler also seems
> > > to help (as do most of the non default ones)
> >
> > (link would be useful)
>
>
> "Give kjournald a IOPRIO_CLASS_RT io priority"
>
> October 2007 (yes its that old)

thx. A more recent submission from Arjan would be:

http://lkml.org/lkml/2008/10/1/405

Resolution was that Tytso indicated it went into some sort of ext4
patch queue:

| I've ported the patch to the ext4 filesystem, and dropped it into
| the unstable portion of the ext4 patch queue.
|
| ext4: akpm's locking hack to fix locking delays

but 6 months down the line and i can find no trace of this upstream
anywhere.

<let-me-rant-too>

The thing is ... this is a _bad_ ext3 design bug affecting ext3
users in the last decade or so of ext3 existence. Why is this issue
not handled with the utmost high priority and why wasnt it fixed 5
years ago already? :-)

It does not matter whether we have extents or htrees when there are
_trivially reproducible_ basic usability problems with ext3.

Ingo

Jesper Krogh

unread,

Mar 24, 2009, 7:10:59 AM3/24/09

to Ingo Molnar, David Rees, Linus Torvalds, Linux Kernel Mailing List

Ingo Molnar wrote:
> * Jesper Krogh <jes...@krogh.cc> wrote:
>
>> David Rees wrote:
>>> On Mon, Mar 23, 2009 at 11:19 PM, Jesper Krogh <jes...@krogh.cc> wrote:
>>>> I know this has been discussed before:
>>>>
>>>> [129401.996244] INFO: task updatedb.mlocat:31092 blocked for more than 480
>>>> seconds.
>>> Ouch - 480 seconds, how much memory is in that machine, and how slow
>>> are the disks?
>> The 480 secondes is not the "wait time" but the time gone before
>> the message is printed. It the kernel-default it was earlier 120
>> seconds but thats changed by Ingo Molnar back in september. I do
>> get a lot of less noise but it really doesn't tell anything about
>> the nature of the problem.
>
> That's true - the detector is really simple and only tries to flag
> suspiciously long uninterruptible waits. It prints out the context
> it finds but otherwise does not try to go deep about exactly why
> that delay happened.
>
> Would you agree that the message is correct, and that there is some
> sort of "tasks wait way too long" problem on your system?

The message is absolutely correct (it was even at 120s).. thats too long
for what I consider good.

> Considering:
>
>> The systes spec:
>> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
>> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to
>> decide if thats fast or slow?
> [...]
>> Yes, I've hit 120s+ penalties just by saving a file in vim.
>
> i think it's fair to say that an almost 10 minutes uninterruptible
> sleep sucks to the user, by any reasonable standard. It is the year
> 2009, not 1959.
>
> The delay might be difficult to fix, but it's still reality - and
> that's the purpose of this particular debug helper: to rub reality
> under our noses, whether we like it or not.
>
> ( _My_ personal pain threshold for waiting for the computer is
> around 1 _second_. If any command does something that i cannot
> Ctrl-C or Ctrl-Z my way out of i get annoyed. So the historic
> limit for the hung tasks check was 10 seconds, then 60 seconds.
> But people argued that it's too low so it was raised to 120 then
> 480 seconds. If almost 10 minutes of uninterruptible wait is still
> acceptable then the watchdog can be turned off (because it's
> basically pointless to run it in that case - no amount of delay
> will be 'bad'). )

Thats about the same definitions for me. But I can accept that if I
happen to be doing something really crazy.. but this is merely about
reading some files in and generating indexes out of them. None of the
file are "huge".. < 15GB for the top 3, average < 100MB.

--
Jesper

Andrew Morton

unread,

Mar 24, 2009, 7:25:23 AM3/24/09

to Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Theodore Tso, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

It's all there in that Oct 2008 thread.

The proposed tweak to kjournald is a bad fix - partly because it will
elevate the priority of vast amounts of IO whose priority we don't _want_
elevated.

But mainly because the problem lies elsewhere - in an area of contention
between the committing and running transactions which we knowingly and
reluctantly added to fix a bug in

commit 773fc4c63442fbd8237b4805627f6906143204a8
Author: akpm <akpm>
AuthorDate: Sun May 19 23:23:01 2002 +0000
Commit: akpm <akpm>
CommitDate: Sun May 19 23:23:01 2002 +0000

[PATCH] fix ext3 buffer-stealing

Patch from sct fixes a long-standing (I did it!) and rather complex
problem with ext3.

The problem is to do with buffers which are continually being dirtied
by an external agent. I had code in there (for easily-triggerable
livelock avoidance) which steals the buffer from checkpoint mode and
reattaches it to the running transaction. This violates ext3 ordering
requirements - it can permit journal space to be reclaimed before the
relevant data has really been written out.

Also, we do have to reliably get a lock on the buffer when moving it
between lists and inspecting its internal state. Otherwise a competing
read from the underlying block device can trigger an assertion failure,
and a competing write to the underlying block device can confuse ext3
journalling state completely.

Now this:

> Resolution was that Tytso indicated it went into some sort of ext4
> patch queue:

was not a fix at all. It was a known-buggy hack which I proposed simply to
remove that contention point to let us find out if we're on the right
track. IIRC Ric was going to ask someone to do some performance testing of
that hack, but we never heard back.

The bottom line is that someone needs to do some serious rooting through
the very heart of JBD transaction logic and nobody has yet put their hand
up. If we do that, and it turns out to be just too hard to fix then yes,
perhaps that's the time to start looking at palliative bandaids.

The number of people who can be looked at to do serious ext3/JBD work is
pretty small now. Ted, Stephen and I got old and died. Jan does good work
but is spread thinly.

Alan Cox

unread,

Mar 24, 2009, 8:24:37 AM3/24/09

to Andrew Morton, Ingo Molnar, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Theodore Tso, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> The proposed tweak to kjournald is a bad fix - partly because it will
> elevate the priority of vast amounts of IO whose priority we don't _want_
> elevated.

Its a huge improvement in practice because it both fixes the stupid
stalls and smooths out the rest of the I/O traffic. I spend a lot of my
time looking at what the disk driver is getting fed and its not a good
mix. Even more revealing is the noop scheduler and the fact this
frequently outperforms all the fancy I/O scheduling we do even on
relatively dumb hardware (as well as showing how mixed up our I/O
patterns currently are).

> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in

The problem emerges about 2007 not 2002, so its not that simple.

> The number of people who can be looked at to do serious ext3/JBD work is
> pretty small now. Ted, Stephen and I got old and died. Jan does good work
> but is spread thinly.

Which is all the more reason to use a temporary fix in the meantime so
the OS is usable. I think its pretty poor that for over a year those in
the know who need a good performing system are having to apply out of
tree trivial patches rejected on the basis that "eventually like maybe
whenever perhaps we'll possibly some day you know consider fixing this,
but don't hold your breath"

There is a second reason to do this: If ext4 is the future then it is far
better to fix this stuff in ext4 properly and leave ext3 clear of
extremely invasive high risk fixes when a quick bandaid will do just fine
for the remaining lifetime of fs/jbd

Also not kjournald is only one of the afflicted threads - the same is
true of the crypto, and of the vm writeback. Also note the other point
about the disk scheduler defaults being terrible for some streaming I/O
patterns and the patch for that is also stuck in bugzilla.

If picking "no-op" speeds up my generic x86 box with random onboard SATA
we are doing something very non-optimal

Andi Kleen

unread,

Mar 24, 2009, 8:28:16 AM3/24/09

to Alan Cox, Ingo Molnar, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Alan Cox <al...@lxorguk.ukuu.org.uk> writes:

>> > I have not had this problem since I applied Arjan's (for some reason
>> > repeatedly rejected) patch to change the ioprio of the various writeback
>> > daemons. Under some loads changing to the noop I/O scheduler also seems
>> > to help (as do most of the non default ones)
>>
>> (link would be useful)
>
>
> "Give kjournald a IOPRIO_CLASS_RT io priority"
>
> October 2007 (yes its that old)

One issue discussed back then (also for a similar XFS patch) was
that having the kernel use the RT priorities by default makes
them useless as user override.

The proposal was to have a new priority level between normal and RT
for this, but noone implemented this.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Ingo Molnar

unread,

Mar 24, 2009, 9:03:02 AM3/24/09

to Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List

Yesterday about half of my testboxes (3 out of 7) started getting
weird networking failures: their network interface just got stuck
completely - no rx and no tx at all. Restarting the interface did
not help.

The failures were highly sporadic and not reproducible - they
triggered in distcc workloads, and on random kernels and seemingly
random .config's.

After spending most of today trying to find a good reproducer (my
regular tests werent specific enough to catch it in any bisectable
manner), i settled down on 4 parallel instances of TCP traffic:

nohup ssh testbox yes &
nohup ssh testbox yes &
nohup ssh testbox yes &
nohup ssh testbox yes &

[ over gigabit, forcedeth driver. ]

If the box hung within 15 minutes, the kernel was deemed bad. Using
that method i arrived to this upstream networking fix which was
merged yesterday:

| 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
| commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
| Author: Herbert Xu <her...@gondor.apana.org.au>
| Date: Tue Mar 17 13:11:29 2009 -0700

|
| gro: Fix legacy path napi_complete crash

Applying the straight revert below cured the problem - i now have 10
million packets and 30 minutes of uptime and the box is still fine.

bisection log:

[ 10 iterations ] good: 73bc6e1: Merge branch 'linus'
[ 3 iterations ] bad: 4eac7d0: Merge branch 'irq/threaded'
[ 6.0m packets ] good: e17bbdb: Merge branch 'tracing/core'
[ 0.1m packets ] bad: 8e0ee43: Linux 2.6.29
[ 0.1m packets ] bad: e2fc4d1: dca: add missing copyright/license headers
[ 0.2m packets ] bad: 4783256: virtio_net: Make virtio_net support carrier detection
[ 0.4m packets ] bad: 4ada810: Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf
[ 7.0m packets ] good: ec8d540: netfilter: conntrack: fix dropping packet after l4proto->packet()
[ 4.0m packets ] good: d1238d5: netfilter: conntrack: check for NEXTHDR_NONE before header sanity checking
[ 0.1m packets ] bad: 303c6a0: gro: Fix legacy path napi_complete crash

(the first column is millions of packets tested.)

Looking at this commit also explains the assymetric test pattern i
found amongst boxes: all boxes with a new-style NAPI driver (e1000e)
work - the others (forcedeth, 5c9x/vortex) have stuck interfaces.

I've attached the reproducer (non-SMP) .config. The system has:

00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)

[ 34.722154] forcedeth: Reverse Engineered nForce ethernet driver. Version 0.62.
[ 34.729406] forcedeth 0000:00:0a.0: setting latency timer to 64
[ 34.735320] nv_probe: set workaround bit for reversed mac addr
[ 35.265783] PM: Adding info for No Bus:eth0
[ 35.270877] forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:13:d4:dc:41:12
[ 35.279086] forcedeth 0000:00:0a.0: highdma csum timirq gbit lnktim desc-v3
[ 35.286273] initcall init_nic+0x0/0x16 returned 0 after 550966 usecs

( but the bug does not seem to be driver specific - old-style NAPI
seems to be enough to trigger it. )

Please let me know if you need more info or if i can help with
testing a different patch. Bisecting it was hard, but testing
whether a fix patch does the trick will be a lot easier, as all
the testboxes are back in working order now.

Thanks,

Ingo

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
net/core/dev.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux2/net/core/dev.c
===================================================================
--- linux2.orig/net/core/dev.c
+++ linux2/net/core/dev.c
@@ -2588,9 +2588,9 @@ static int process_backlog(struct napi_s
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
+ __napi_complete(napi);
local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();

@@ -2599,7 +2599,6 @@ static int process_backlog(struct napi_s

napi_gro_flush(napi);

-out:
return work;
}

config

Ingo Molnar

unread,

Mar 24, 2009, 9:13:23 AM3/24/09

to Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, net...@vger.kernel.org

(netdev Cc:-ed)

* Ingo Molnar <mi...@elte.hu> wrote:

> Yesterday about half of my testboxes (3 out of 7) started getting
> weird networking failures: their network interface just got stuck
> completely - no rx and no tx at all. Restarting the interface did
> not help.

> I've attached the reproducer (non-SMP) .config. The system has:

Note, the .config is randconfig derived. There was a stage of the
tests when about every ~5-10th randconfig was failing, so i dont
think it's a rare config combo that triggers this. (but there were
other stages where 30 randconfig in a row went fine so it's hard to
tell.)

In the worst case the hang needed 2 million packets to trigger -
that's why i set the limit in the tests to 6 million packets.

Ingo

Theodore Tso

unread,

Mar 24, 2009, 9:21:33 AM3/24/09

to Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 11:31:11AM +0100, Ingo Molnar wrote:
> >
> > "Give kjournald a IOPRIO_CLASS_RT io priority"
> >
> > October 2007 (yes its that old)
>
> thx. A more recent submission from Arjan would be:
>
> http://lkml.org/lkml/2008/10/1/405
>
> Resolution was that Tytso indicated it went into some sort of ext4
> patch queue:
>
> | I've ported the patch to the ext4 filesystem, and dropped it into
> | the unstable portion of the ext4 patch queue.
> |
> | ext4: akpm's locking hack to fix locking delays
>
> but 6 months down the line and i can find no trace of this upstream
> anywhere.

Andrew really didn't like Arjan's patch because it forces
non-synchronous writes to have a real-time I/O priority. He suggested
an alternative approach which I coded up as "akpm's locking hack to
fix locking delays"; unfortunately, it doesn't work.

In ext4, I quietly put in a mount option, journal_ioprio, and set the
default to be slightly higher than the default I/O priority (but no a
real-time class priority) to prevent the write starvation problem.
This definitely helps for some workloads (when some task is reading
enough to starve out the rights).

More recently (as in this past weekend), I went back to the ext3
problem, and found a better solution, here:

http://lkml.org/lkml/2009/3/21/304
http://lkml.org/lkml/2009/3/21/302
http://lkml.org/lkml/2009/3/21/303

These patches cause the synchronous writes caused by an fsync() to be
submitted using WRITE_SYNC, instead of WRITE, which definitely helps
in the case where there is a heavy read workload in the background.

They don't solve the problem where there is a *huge* amount of writes
going on, though --- if something is dirtying pages at a rate far
greater than the local disk can write it out, say, either "dd
if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster
driving a huge amount of data towards a single system or a wget over a
local 100 megabit ethernet from a massive NFS server where everything
is in cache, then you can have a major delay with the fsync().

However, what I've found, though, is that if you're just doing a local
copy from one hard drive to another, or downloading a huge iso file
from an ftp server over a wide area network, the fsync() delays really
don't get *that* bad, even with ext3. At least, I haven't found a
workload that doesn't involve either dd if=/dev/zero or a massive
amount of data coming in over the network that will cause fsync()
delays in the > 1-2 second category. Ext3 has been around for a long
time, and it's only been the last couple of years that people have
really complained about this; my theory is that it was the rise of >
10 megabit ethernets and the use of systems like distcc that really
made this problem really become visible. The only realistic workload
I've found that triggers this requires a fast network dumping data to
a local filesystem.

(I'm sure someone will be ingeniuous enough to find something else
though, and if they're interested, I've attached an fsync latency
tester to this note. If you find something; let me know, I'd be
interested.)

> <let-me-rant-too>
>
> The thing is ... this is a _bad_ ext3 design bug affecting ext3
> users in the last decade or so of ext3 existence. Why is this issue
> not handled with the utmost high priority and why wasnt it fixed 5
> years ago already? :-)

OK, so there are a couple of solutions to this problem. One is to use
ext4 and delayed allocation. This solves the problem by simply not
allocating the blocks in the first place, so we don't have to force
them out to solve the security problem that data=ordered was trying to
solve. Simply mounting an ext3 filesystem using ext4, without making
any change to the filesystem format, should solve the problem.

Another is to use the mount option data=writeback. The whole reason
for forcing the writes out to disk was simply to prevent a security
problem that occurs if your system crashes before the data blocks get
forced out to disk. This could expose previously written data, which
could belong to another user, and might be his e-mail or p0rn.
Historically, this was always a problem with the BSD Fast Filesystem;
it sync'ed out data every 30 seconds, and metadata every 5 seconds.
(This is where the default ext3 commit interval of 5 seconds, and the
default /proc/sys/vm/dirty_expire_centiseconds came from.) After a
system crash, it was possible for files written just before the crash
to point to blocks that had not yet been written, and which contain
some other users' data files. This was the reason for Stephen Tweedie
implementing the data=ordered mode, and making it the default.

However, these days, nearly all Linux boxes are single user machines,
so the security concern is much less of a problem. So maybe the best
solution for now is to make data=writeback the default. This solves
the problem too. The only problem with this is that there are a lot
of sloppy application writers out there, and they've gotten lazy about
using fsync() where it's necessary; combine that with Ubuntu shipping
massively unstable video drivers that crash if you breath on the
system wrong (or exit World of Goo :-), and you've got the problem
which was recently slashdotted, and which I wrote about here:

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

> It does not matter whether we have extents or htrees when there are
> _trivially reproducible_ basic usability problems with ext3.

Try ext4, I think you'll like it. :-)

Failing that, data=writeback for single-user machines is probably your
best bet.

- Ted

/*
* fsync-tester.c
*
* Written by Theodore Ts'o, 3/21/09.
*
* This file may be redistributed under the terms of the GNU Public
* License, version 2.
*/

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>

#define SIZE (32768*32)

static float timeval_subtract(struct timeval *tv1, struct timeval *tv2)
{
return ((tv1->tv_sec - tv2->tv_sec) +
((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
}

int main(int argc, char **argv)
{
int fd;
struct timeval tv, tv2;
char buf[SIZE];

fd = open("fsync-tester.tst-file", O_WRONLY|O_CREAT);
if (fd < 0) {
perror("open");
exit(1);
}
memset(buf, 'a', SIZE);
while (1) {
pwrite(fd, buf, SIZE, 0);
gettimeofday(&tv, NULL);
fsync(fd);
gettimeofday(&tv2, NULL);
printf("fsync time: %5.4f\n", timeval_subtract(&tv2, &tv));
sleep(1);

Ingo Molnar

unread,

Mar 24, 2009, 9:31:00 AM3/24/09

to Theodore Tso, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

* Theodore Tso <ty...@mit.edu> wrote:

> More recently (as in this past weekend), I went back to the ext3
> problem, and found a better solution, here:
>
> http://lkml.org/lkml/2009/3/21/304
> http://lkml.org/lkml/2009/3/21/302
> http://lkml.org/lkml/2009/3/21/303
>
> These patches cause the synchronous writes caused by an fsync() to
> be submitted using WRITE_SYNC, instead of WRITE, which definitely
> helps in the case where there is a heavy read workload in the
> background.
>
> They don't solve the problem where there is a *huge* amount of
> writes going on, though --- if something is dirtying pages at a
> rate far greater than the local disk can write it out, say, either
> "dd if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc
> cluster driving a huge amount of data towards a single system or a
> wget over a local 100 megabit ethernet from a massive NFS server
> where everything is in cache, then you can have a major delay with
> the fsync().

Nice, thanks for the update! The situation isnt nearly as bleak as i
feared they are :)

> However, what I've found, though, is that if you're just doing a
> local copy from one hard drive to another, or downloading a huge
> iso file from an ftp server over a wide area network, the fsync()
> delays really don't get *that* bad, even with ext3. At least, I
> haven't found a workload that doesn't involve either dd
> if=/dev/zero or a massive amount of data coming in over the
> network that will cause fsync() delays in the > 1-2 second
> category. Ext3 has been around for a long time, and it's only
> been the last couple of years that people have really complained
> about this; my theory is that it was the rise of > 10 megabit
> ethernets and the use of systems like distcc that really made this
> problem really become visible. The only realistic workload I've
> found that triggers this requires a fast network dumping data to a
> local filesystem.

i think the problem became visible via the rise in memory size,
combined with the non-improvement of the performance of rotational
disks.

The disk speed versus RAM size ratio has become dramatically worse -
and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which
takes an eternity to write out if you happen to sync on that. When
we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out
- and worse than that, chances are that it's spread out widely on
the disk, the whole thing becoming seek-limited as well.

That's where the main difference in perception of this problem comes
from i believe. The problem was always there, but only in the last
1-2 years did 4G/8G systems become really common for people to
notice.

SSDs will save us eventually, but they will take up to a decade to
trickle through for us to forget about this problem altogether.

Ingo

Herbert Xu

unread,

Mar 24, 2009, 9:36:40 AM3/24/09

to Ingo Molnar, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
>
> Yesterday about half of my testboxes (3 out of 7) started getting
> weird networking failures: their network interface just got stuck
> completely - no rx and no tx at all. Restarting the interface did
> not help.

Darn, does this patch help?

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.

Signed-off-by: Herbert Xu <her...@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
int work = 0;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
+ struct sk_buff *skb;

napi->weight = weight_p;
do {
- struct sk_buff *skb;
-

local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);

- if (!skb) {
- local_irq_enable();

- napi_complete(napi);
- goto out;

- }
local_irq_enable();
+ if (!skb)
+ break;

napi_gro_receive(napi, skb);
} while (++work < quota && jiffies == start_time);

napi_gro_flush(napi);
+ if (skb)
+ goto out;
+
+ local_irq_disable();
+ __napi_complete(napi);
+ local_irq_enable();

out:
return work;

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Theodore Tso

unread,

Mar 24, 2009, 9:37:46 AM3/24/09

to Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 04:12:49AM -0700, Andrew Morton wrote:
> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and

> reluctantly added to fix a bug in "[PATCH] fix ext3 buffer-stealing"

Well, let's be clear here. The contention between committing and
running transaction is an issue, even if we solved this problem, it
wouldn't solve the issue of fsync() taking a long time in ext3's
data=ordered mode in the case of massive write starvation caused by a
read-heavy workload, or a vast number of dirty buffers associated with
an inode which is about to be committed, and a process triggers an
fsync(). So fixing this issue wouldn't have solved the problem which
Ingo complained about (which was an editor calling fsync() leading to
long delay when saving a file during or right after a
distcc-accelerated kernel compile) or the infamous Firefox 3.0 bug.

Fixing this contention *would* fix the problem where a normal process
which is doing normal file I/O could end up getting stalled
unnecessarily, but that's not what most people are complaining about
--- and shortening the amount of time that it takes do a commit
(either with ext4's delayed allocation or ext3's data=writeback mount
option) would also address this problem. That doesn't mean that it's
not worth it to fix this particular contention, but there are multiple
issues going on here.

(Basically we're here:

http://www.kernel.org/pub/linux/kernel/people/paulmck/Confessions/FOSSElephant.html

.. in Paul Mckenney's version of parable of the blind men and the elephant:

http://www.kernel.org/pub/linux/kernel/people/paulmck/Confessions/

:-)

> Now this:
>
> > Resolution was that Tytso indicated it went into some sort of ext4
> > patch queue:
>
> was not a fix at all. It was a known-buggy hack which I proposed simply to
> remove that contention point to let us find out if we're on the right
> track. IIRC Ric was going to ask someone to do some performance testing of
> that hack, but we never heard back.

Ric did do some preliminary performance testing, and it wasn't
encouraging. It's still in the unstable portion of the ext4 patch
queue, and it's in my "wish I had more time to look at it; I don't get
to work on ext3/4 full-time" queue.

> The bottom line is that someone needs to do some serious rooting through
> the very heart of JBD transaction logic and nobody has yet put their hand
> up. If we do that, and it turns out to be just too hard to fix then yes,
> perhaps that's the time to start looking at palliative bandaids.

I disagree that they are _just_ palliative bandaids, because you need
these in order to make sure fsync() completes in a reasonable time, so
that people like Ingo don't get cranky. :-) Fixing the contention
between the running and committing transaction is a good thing, and I
hope someone puts up their hand or I magically get the time I need to
really dive into the jbd layer, but it won't help the Firefox 3.0
problem or Ingo's problem with saving files during a distcc run.

- Ted

Theodore Tso

unread,

Mar 24, 2009, 9:51:51 AM3/24/09

to Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 02:30:11PM +0100, Ingo Molnar wrote:
> i think the problem became visible via the rise in memory size,
> combined with the non-improvement of the performance of rotational
> disks.
>
> The disk speed versus RAM size ratio has become dramatically worse -
> and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which
> takes an eternity to write out if you happen to sync on that. When
> we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out
> - and worse than that, chances are that it's spread out widely on
> the disk, the whole thing becoming seek-limited as well.

That's definitely a problem too, but keep in mind that by default the
journal gets committed every 5 seconds, so the data gets flushed out
that often. So the question is how quickly can you *dirty* 1.6GB of
memory?

"dd if=/dev/zero of=/u1/dirty-me-harder" will certainly do it, but
normally we're doing something useful, and so you're either copying
data from local disk, at which point you're limited by the read speed
of your local disk (I suppose it could be in cache, but how common of
a case is that?), *or*, you're copying from the network, and to copy
in 1.6GB of data in 5 seconds, that means you're moving 320
megabytes/second, which if we're copying in the data from the network,
requires a 10 gigabit ethernet.

Hence my statement that this probably became much more visible with
fast ethernets --- but you're right, the huge increase in memory sizes
was also a key factor; otherwise, write throttling would have kicked
in and the VM would have started pushing the dirty pages to disk much
sooner.

- Ted

Alan Cox

unread,

Mar 24, 2009, 9:54:46 AM3/24/09

to Theodore Tso, Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> They don't solve the problem where there is a *huge* amount of writes
> going on, though --- if something is dirtying pages at a rate far

At very high rates other things seem to go pear shaped. I've not traced
it back far enough to be sure but what I suspect occurs from the I/O at
disk level is that two people are writing stuff out at once - presumably
the vm paging pressure and the file system - as I see two streams of I/O
that are each reasonably ordered but are interleaved.

> don't get *that* bad, even with ext3. At least, I haven't found a
> workload that doesn't involve either dd if=/dev/zero or a massive
> amount of data coming in over the network that will cause fsync()
> delays in the > 1-2 second category. Ext3 has been around for a long

I see it with a desktop when it pages hard and also when doing heavy
desktop I/O (in my case the repeatable every time case is saving large
images in the gimp - A4 at 600-1200dpi).

The other one (#8636) seems to be a bug in the I/O schedulers as it goes
away if you use a different I/O sched.

> solve. Simply mounting an ext3 filesystem using ext4, without making
> any change to the filesystem format, should solve the problem.

I will try this experiment but not with production data just yet 8)

> some other users' data files. This was the reason for Stephen Tweedie
> implementing the data=ordered mode, and making it the default.

Yes and in the server environment or for typical enterprise customers
this is a *big issue*, especially the risk of it being undetected that
they just inadvertently did something like put your medical data into the
end of something public during a crash.

> Try ext4, I think you'll like it. :-)

I need to, so that I can double check none of the open jbd locking bugs
are there and close more bugzilla entries (#8147)

Thanks for the reply - I hadn't realised a lot of this was getting fixed
but in ext4 and quietly

Alan

Ingo Molnar

unread,

Mar 24, 2009, 10:16:40 AM3/24/09

to Herbert Xu, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List

* Herbert Xu <her...@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> >
> > Yesterday about half of my testboxes (3 out of 7) started getting
> > weird networking failures: their network interface just got stuck
> > completely - no rx and no tx at all. Restarting the interface did
> > not help.
>
> Darn, does this patch help?
>
> net: Fix netpoll lockup in legacy receive path

thanks, added it to the test mix. Should know the result fin 1-2
hours test time.

Ingo

Theodore Tso

unread,

Mar 24, 2009, 10:29:41 AM3/24/09

to Alan Cox, Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 01:52:49PM +0000, Alan Cox wrote:
>
> At very high rates other things seem to go pear shaped. I've not traced
> it back far enough to be sure but what I suspect occurs from the I/O at
> disk level is that two people are writing stuff out at once - presumably
> the vm paging pressure and the file system - as I see two streams of I/O
> that are each reasonably ordered but are interleaved.

Surely the elevator should have reordered the writes reasonably? (Or
is that what you meant by "the other one -- #8636 (I assume this is a
kernel Bugzilla #?) seems to be a bug in the I/O schedulers as it goes
away if you use a different I/O sched.?")

> > don't get *that* bad, even with ext3. At least, I haven't found a
> > workload that doesn't involve either dd if=/dev/zero or a massive
> > amount of data coming in over the network that will cause fsync()
> > delays in the > 1-2 second category. Ext3 has been around for a long
>
> I see it with a desktop when it pages hard and also when doing heavy
> desktop I/O (in my case the repeatable every time case is saving large
> images in the gimp - A4 at 600-1200dpi).

Yeah, I could see that doing it. How big is the image, and out of
curiosity, can you run the fsync-tester.c program I posted while
saving the gimp image, and tell me how much of a delay you end up
seeing?

> > solve. Simply mounting an ext3 filesystem using ext4, without making
> > any change to the filesystem format, should solve the problem.
>
> I will try this experiment but not with production data just yet 8)

Where's your bravery, man? :-)

I've been using it on my laptop since July, and haven't lost
significant amounts of data yet. (The only thing I did lose was bits
of a git repository fairly early on, and I was able to repair by
replacing the missing objects.)

> > some other users' data files. This was the reason for Stephen Tweedie
> > implementing the data=ordered mode, and making it the default.
>
> Yes and in the server environment or for typical enterprise customers
> this is a *big issue*, especially the risk of it being undetected that
> they just inadvertently did something like put your medical data into the
> end of something public during a crash.

True enough; changing the defaults to be data=writeback for the server
environment is probably not a good idea. (Then again, in the server
environment most of the workloads generally don't end up hitting the
nasty data=ordered failure modes; they tend to be
transaction-oriented, and fsync().)

> > Try ext4, I think you'll like it. :-)
>
> I need to, so that I can double check none of the open jbd locking bugs
> are there and close more bugzilla entries (#8147)

More testing would be appreciated --- and yeah, we need to groom the
bugzilla. For a long time no one in ext3 land was paying attention to
bugzilla, and more recently I've been trying to keep up with the
ext4-related bugs, but I don't get to do ext4 work full-time, and
occasionally Stacey gets annoyed at me when I work late into night...

> Thanks for the reply - I hadn't realised a lot of this was getting fixed
> but in ext4 and quietly

Yeah, there are a bunch of things, like the barrier=1 default, which
akpm has rejected for ext3, but which we've fixed in ext4. More help
in shaking down the bugs would definitely be appreciated.

- Ted

Robert Schwebel

unread,

Mar 24, 2009, 10:35:22 AM3/24/09

to Ingo Molnar, Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:

> If the box hung within 15 minutes, the kernel was deemed bad. Using
> that method i arrived to this upstream networking fix which was
> merged yesterday:
>
> | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
> | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
> | Author: Herbert Xu <her...@gondor.apana.org.au>
> | Date: Tue Mar 17 13:11:29 2009 -0700
> |
> | gro: Fix legacy path napi_complete crash

This commit breaks nfsroot booting on i.MX27 and other ARM boxes with
different network cards here in a reproducable way.

Ingo Molnar

unread,

Mar 24, 2009, 10:40:28 AM3/24/09

to Robert Schwebel, Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Robert Schwebel <r.sch...@pengutronix.de> wrote:

> On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> > If the box hung within 15 minutes, the kernel was deemed bad. Using
> > that method i arrived to this upstream networking fix which was
> > merged yesterday:
> >
> > | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
> > | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
> > | Author: Herbert Xu <her...@gondor.apana.org.au>
> > | Date: Tue Mar 17 13:11:29 2009 -0700
> > |
> > | gro: Fix legacy path napi_complete crash
>
> This commit breaks nfsroot booting on i.MX27 and other ARM boxes
> with different network cards here in a reproducable way.

Can you confirm that Herbert's fix (see it below) solves the
problem?

Ingo

--------------->
From b8b66ac07cab1b45aac93e4f406833a1e0d7677e Mon Sep 17 00:00:00 2001
From: Herbert Xu <her...@gondor.apana.org.au>
Date: Tue, 24 Mar 2009 21:35:42 +0800
Subject: [PATCH] net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

While this is fishy in itself, let's make the obvious fix right
now of reverting to the previous state where we always called
__napi_complete.

Signed-off-by: Herbert Xu <her...@gondor.apana.org.au>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Frank Blaschka <blas...@linux.vnet.ibm.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <20090324133...@gondor.apana.org.au>

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---

net/core/dev.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..523f53e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2580,24 +2580,26 @@ static int process_backlog(struct napi_struct *napi, int quota)
int work = 0;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
unsigned long start_time = jiffies;
+ struct sk_buff *skb;

napi->weight = weight_p;
do {
- struct sk_buff *skb;
-

local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);

- if (!skb) {
- local_irq_enable();

- napi_complete(napi);
- goto out;

- }
local_irq_enable();
+ if (!skb)
+ break;

napi_gro_receive(napi, skb);
} while (++work < quota && jiffies == start_time);

napi_gro_flush(napi);
+ if (skb)
+ goto out;
+
+ local_irq_disable();
+ __napi_complete(napi);
+ local_irq_enable();

out:
return work;

Herbert Xu

unread,

Mar 24, 2009, 11:10:33 AM3/24/09

to Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
>
> Subject: [PATCH] net: Fix netpoll lockup in legacy receive path

Actually, this patch is still racy. If some interrupt comes in
and we suddenly get the maximum amount of backlog we can still
hang when we call __napi_complete incorrectly. It's unlikely
but we certainly shouldn't allow that. Here's a better version.

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

This patch fixes this by essentially open-coding __napi_complete.

Note we no longer need the memory barrier because this function
is per-cpu.

Signed-off-by: Herbert Xu <her...@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)

local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);

if (!skb) {
+ list_del(&napi->poll_list);
+ clear_bit(NAPI_STATE_SCHED, &napi->state);

local_irq_enable();
- napi_complete(napi);
- goto out;

+ break;
}
local_irq_enable();

@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)

napi_gro_flush(napi);

-out:
return work;
}

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Alan Cox

unread,

Mar 24, 2009, 11:19:54 AM3/24/09

to Theodore Tso, Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> Surely the elevator should have reordered the writes reasonably? (Or
> is that what you meant by "the other one -- #8636 (I assume this is a
> kernel Bugzilla #?) seems to be a bug in the I/O schedulers as it goes
> away if you use a different I/O sched.?")

There are two cases there. One is a bug #8636 (kernel bugzilla) which is
where things like dump show awful performance with certain I/O scheduler
settings. That seems to be totally not connected to the fs but it is a
problem (and has a patch)

The second one the elevator is clearly trying to sort out but its
behaving as if someone is writing the file starting at say 0 and someone
else is trying to write it back starting some large distance further down
the file. The elevator can only do so much then.

> Yeah, I could see that doing it. How big is the image, and out of
> curiosity, can you run the fsync-tester.c program I posted while

150MB+ for the pnm files from gimp used as temporaries by Eve (Etch
Validation Engine), more like 10MB for xcf/tif files.

> saving the gimp image, and tell me how much of a delay you end up
> seeing?

Added to the TODO list once I can set up a suitable test box (my new dev
box is somewhere between Dell and my desk right now)

> More testing would be appreciated --- and yeah, we need to groom the
> bugzilla.

I'm currently doing this on a large scale (closed about 300 so far this
run). Bug 8147 might be worth a look as its a case where the jbd locking
and the jbd comments seem to disagree (the comments say you must hold a
lock but we don't seem to)

Sascha Hauer

unread,

Mar 24, 2009, 11:23:40 AM3/24/09

to Ingo Molnar, Robert Schwebel, Linus Torvalds, Herbert Xu, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

Hi Ingo,

On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
>

> * Robert Schwebel <r.sch...@pengutronix.de> wrote:
>
> > On Tue, Mar 24, 2009 at 02:02:02PM +0100, Ingo Molnar wrote:
> > > If the box hung within 15 minutes, the kernel was deemed bad. Using
> > > that method i arrived to this upstream networking fix which was
> > > merged yesterday:
> > >
> > > | 303c6a0251852ecbdc5c15e466dcaff5971f7517 is first bad commit
> > > | commit 303c6a0251852ecbdc5c15e466dcaff5971f7517
> > > | Author: Herbert Xu <her...@gondor.apana.org.au>
> > > | Date: Tue Mar 17 13:11:29 2009 -0700
> > > |
> > > | gro: Fix legacy path napi_complete crash
> >
> > This commit breaks nfsroot booting on i.MX27 and other ARM boxes
> > with different network cards here in a reproducable way.
>
> Can you confirm that Herbert's fix (see it below) solves the
> problem?

No, still doesn't work.

It seems to have something to do with enabling interrupts between
__skb_dequeue() and __napi_complete().

I reverted 303c6a0251852ecbdc5c15e466dcaff5971f7517 and added a

local_irq_enable(); local_irq_disable();

right before __napi_complete() and this already breaks networking.

Sascha

--

Sascha Hauer

unread,

Mar 24, 2009, 11:31:07 AM3/24/09

to Herbert Xu, Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 11:09:28PM +0800, Herbert Xu wrote:
> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
>
> Actually, this patch is still racy. If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly. It's unlikely
> but we certainly shouldn't allow that. Here's a better version.

Yes, this one works. I always wanted to give a

Tested-by: Sascha Hauer <s.h...@pengutronix.de>

Thanks
Sascha

--

Ingo Molnar

unread,

Mar 24, 2009, 11:36:59 AM3/24/09

to Herbert Xu, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Herbert Xu <her...@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
>
> Actually, this patch is still racy. If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly. It's unlikely
> but we certainly shouldn't allow that. Here's a better version.
>
> net: Fix netpoll lockup in legacy receive path

ok - i'm testing with this now.

Ingo

Ingo Molnar

unread,

Mar 24, 2009, 11:47:54 AM3/24/09

to Herbert Xu, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Ingo Molnar <mi...@elte.hu> wrote:

>
> * Herbert Xu <her...@gondor.apana.org.au> wrote:
>
> > On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> > >
> > > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
> >
> > Actually, this patch is still racy. If some interrupt comes in
> > and we suddenly get the maximum amount of backlog we can still
> > hang when we call __napi_complete incorrectly. It's unlikely
> > but we certainly shouldn't allow that. Here's a better version.
> >
> > net: Fix netpoll lockup in legacy receive path
>
> ok - i'm testing with this now.

test failure on one of the boxes, interface got stuck after ~100K
packets:

eth1 Link encap:Ethernet HWaddr 00:13:D4:DC:41:12
inet addr:10.0.1.13 Bcast:10.0.1.255 Mask:255.255.255.0
inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2435071 (2.3 MiB) TX bytes:503790 (491.9 KiB)
Interrupt:11 Base address:0x4000

i'm going back to your previous version for now - it might still be
racy but it worked well for about 1.5 hours of test-time.

Herbert Xu

unread,

Mar 24, 2009, 12:00:00 PM3/24/09

to Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 04:47:17PM +0100, Ingo Molnar wrote:
>
> test failure on one of the boxes, interface got stuck after ~100K
> packets:
>
> eth1 Link encap:Ethernet HWaddr 00:13:D4:DC:41:12
> inet addr:10.0.1.13 Bcast:10.0.1.255 Mask:255.255.255.0
> inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:2435071 (2.3 MiB) TX bytes:503790 (491.9 KiB)
> Interrupt:11 Base address:0x4000

What's the NIC and config on this one? If it's still using the
legacy/netif_rx path, where GRO is off by default, this patch
should make it exactly the same as with my original patch reverted.

Cheers,

--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Ingo Molnar

unread,

Mar 24, 2009, 12:05:51 PM3/24/09

to Herbert Xu, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Herbert Xu <her...@gondor.apana.org.au> wrote:

> On Tue, Mar 24, 2009 at 04:47:17PM +0100, Ingo Molnar wrote:
> >
> > test failure on one of the boxes, interface got stuck after ~100K
> > packets:
> >
> > eth1 Link encap:Ethernet HWaddr 00:13:D4:DC:41:12
> > inet addr:10.0.1.13 Bcast:10.0.1.255 Mask:255.255.255.0
> > inet6 addr: fe80::213:d4ff:fedc:4112/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:22555 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:1897 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:2435071 (2.3 MiB) TX bytes:503790 (491.9 KiB)
> > Interrupt:11 Base address:0x4000
>
> What's the NIC and config on this one? If it's still using the
> legacy/netif_rx path, where GRO is off by default, this patch
> should make it exactly the same as with my original patch
> reverted.

Same forcedeth box i reported before. Config below. (note: if you
want to use it you need to run it through 'make oldconfig', with all
defaults accepted)

Ingo

config-Tue_Mar_24_15_24_33_CET_2009.bad

Jesper Krogh

unread,

Mar 24, 2009, 12:35:32 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Theodore Tso wrote:
> On Tue, Mar 24, 2009 at 02:30:11PM +0100, Ingo Molnar wrote:
>> i think the problem became visible via the rise in memory size,
>> combined with the non-improvement of the performance of rotational
>> disks.
>>
>> The disk speed versus RAM size ratio has become dramatically worse -
>> and our "5% of RAM" dirty ratio on a 32 GB box is 1.6 GB - which
>> takes an eternity to write out if you happen to sync on that. When
>> we had 1 GB of RAM 5% meant 51 MB - one or two seconds to flush out
>> - and worse than that, chances are that it's spread out widely on
>> the disk, the whole thing becoming seek-limited as well.
>
> That's definitely a problem too, but keep in mind that by default the
> journal gets committed every 5 seconds, so the data gets flushed out
> that often. So the question is how quickly can you *dirty* 1.6GB of
> memory?

Say it's a file that you allready have in memory cache read in.. there
is plenty of space in 16GB for that.. then you can dirty it at
memory-speed.. that about 編ec. (correct me if I'm wrong).

Ok, this is probably unrealistic, but memory grows the largest we have
at the moment is 32GB and its steadily growing with the core-counts.
Then the available memory is used to cache the "active" portion of the
filsystems. I would even say that in the NFS-servers I depend on it to
do this efficiently. (2.6.29-rc8 delivered 1050MB/s over af 10GbitE
using nfsd - send speed to multiple clients).

The current workload is based of an active dataset of 600GB where
index'es are being generated and written back to the same disk. So
there is a fairly high read/write load on the machine (as you said was
required). The majority (perhaps 550GB ) is only read once where the
rest of the time it is stuff in the last 50GB being rewritten.

> "dd if=/dev/zero of=/u1/dirty-me-harder" will certainly do it, but
> normally we're doing something useful, and so you're either copying
> data from local disk, at which point you're limited by the read speed
> of your local disk (I suppose it could be in cache, but how common of
> a case is that?),

Increasingly the case as memory sizes grows.

> *or*, you're copying from the network, and to copy
> in 1.6GB of data in 5 seconds, that means you're moving 320
> megabytes/second, which if we're copying in the data from the network,
> requires a 10 gigabit ethernet.

or just around being processed on the 16-32 cores on the system.

Jesper
--
Jesper

Linus Torvalds

unread,

Mar 24, 2009, 1:40:02 PM3/24/09

to Jesper Krogh, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Linux Kernel Mailing List

On Tue, 24 Mar 2009, Jesper Krogh wrote:

>
> Theodore Tso wrote:
> > That's definitely a problem too, but keep in mind that by default the
> > journal gets committed every 5 seconds, so the data gets flushed out
> > that often. So the question is how quickly can you *dirty* 1.6GB of
> > memory?

Doesn't at least ext4 default to the _insane_ model of "data is less
important than meta-data, and it doesn't get journalled"?

And ext3 with "data=writeback" does the same, no?

Both of which are - as far as I can tell - total braindamage. At least
with ext3 it's not the _default_ mode.

I never understood how anybody doing filesystems (especially ones that
claim to be crash-resistant due to journalling) would _ever_ accept the
"writeback" behavior of having "clean fsck, but data loss".

> Say it's a file that you allready have in memory cache read in.. there
> is plenty of space in 16GB for that.. then you can dirty it at memory-speed..
> that about 編ec. (correct me if I'm wrong).

No, you'll still have to get per-page locks etc. If you use mmap(), you'll
page-fault on each page, if you use write() you'll do all the page lookups
etc. But yes, it can be pretty quick - the biggest cost probably _will_ be
the speed of memory itself (doing one-byte writes at each block would
change that, and the bottle-neck would become the system call and page
lookup/locking path, but it's probably in the same rough cost as cost of
writing out one page one page).

That said, this is all why we now have 'dirty_*bytes' limits too.

The problem is that the dirty_[background_]bytes value really should be
scaled up by the speed of IO. And we currently have no way to do that.
Some machines can write a gigabyte in a second with some fancy RAID
setups. Others will take minutes (or hours) to do that (crappy SSD's that
get 25kB/s throughput on random writes).

The "dirty_[background_ratio" percentage doesn't scale up by the speed of
IO either, of course, but at least historically there was generally a
pretty good correlation between amount of memory and speed of IO. The
machines that had gigs and gigs of RAM tended to always have fast IO too.
So scaling up dirty limits by memory size made sense both in the "we have
tons of memory, so allow tons of it to be dirty" sense _and_ in the "we
likely have a fast disk, so allow more pending dirty data".

Linus

Jan Kara

unread,

Mar 24, 2009, 1:58:26 PM3/24/09

to Alan Cox, Theodore Tso, Ingo Molnar, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

> > They don't solve the problem where there is a *huge* amount of writes
> > going on, though --- if something is dirtying pages at a rate far
>
> At very high rates other things seem to go pear shaped. I've not traced
> it back far enough to be sure but what I suspect occurs from the I/O at
> disk level is that two people are writing stuff out at once - presumably
> the vm paging pressure and the file system - as I see two streams of I/O
> that are each reasonably ordered but are interleaved.

There are different problems leading to this:
1) JBD commit code writes ordered data on each transaction commit. This
is done in dirtied-time order which is not necessarily optimal in case
of random access IO. IO scheduler helps here though because we submit a
lot of IO at once. ext4 has at least the randomness part of this problem
"fixed" because it submits ordered data via writepages(). Doing this
change requires non-trivial changes to the journaling layer so I wasn't
brave enough to do it with ext3 and JBD as well (although porting the
patch is trivial).

2) When we do dirty throttling, there are going to be several threads
writing out on the filesystem (if you have more pdflush threads which
translates to having more than one CPU). Jens' per-BDI writeback
threads could help here (but I haven't yet got to reading his patches in
detail to be sure).

These two problems together result in non-optimal IO pattern. At least
that's where I got to when I was looking into why Berkeley DB is so
slow. I was trying to somehow serialize more pdflush threads on the
filesystem but a stupid solution does not really help much - either I
was starving some throttled thread by other threads doing writeback or
I didn't quite keep the disk busy. So something like Jens' approach
is probably the way to go in the end.

> > don't get *that* bad, even with ext3. At least, I haven't found a
> > workload that doesn't involve either dd if=/dev/zero or a massive
> > amount of data coming in over the network that will cause fsync()
> > delays in the > 1-2 second category. Ext3 has been around for a long
>
> I see it with a desktop when it pages hard and also when doing heavy
> desktop I/O (in my case the repeatable every time case is saving large
> images in the gimp - A4 at 600-1200dpi).
>
> The other one (#8636) seems to be a bug in the I/O schedulers as it goes
> away if you use a different I/O sched.
>
> > solve. Simply mounting an ext3 filesystem using ext4, without making
> > any change to the filesystem format, should solve the problem.
>
> I will try this experiment but not with production data just yet 8)
>
> > some other users' data files. This was the reason for Stephen Tweedie
> > implementing the data=ordered mode, and making it the default.
>
> Yes and in the server environment or for typical enterprise customers
> this is a *big issue*, especially the risk of it being undetected that
> they just inadvertently did something like put your medical data into the
> end of something public during a crash.
>
> > Try ext4, I think you'll like it. :-)
>
> I need to, so that I can double check none of the open jbd locking bugs
> are there and close more bugzilla entries (#8147)

This one is still there. I'll have a look at it tomorrow and hopefully
will be able to answer...

Honza

--
Jan Kara <ja...@suse.cz>
SuSE CR Labs

Linus Torvalds

unread,

Mar 24, 2009, 2:03:18 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, 24 Mar 2009, Theodore Tso wrote:
>
> Try ext4, I think you'll like it. :-)
>
> Failing that, data=writeback for single-user machines is probably your
> best bet.

Isn't that the same fix? ext4 just defaults to the crappy "writeback"
behavior, which is insane.

Sure, it makes things _much_ smoother, since now the actual data is no
longer in the critical path for any journal writes, but anybody who thinks
that's a solution is just incompetent.

We might as well go back to ext2 then. If your data gets written out long
after the metadata hit the disk, you are going to hit all kinds of bad
issues if the machine ever goes down.

Linus

Mark Lord

unread,

Mar 24, 2009, 2:21:31 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Theodore Tso wrote:
> So the question is how quickly can you *dirty* 1.6GB of memory?

.

MythTV: rm /some/really/huge/video/file ; sync
## disk light stays on for several minutes..

Note quite the same thing, I suppose, but it does break
the shutdown scripts of every major Linux distribution.

Simple solution for MythTV is what people already do: use xfs instead.

Eric Sandeen

unread,

Mar 24, 2009, 2:44:33 PM3/24/09

to Mark Lord, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Mark Lord wrote:
> Theodore Tso wrote:
>> So the question is how quickly can you *dirty* 1.6GB of memory?

> ..

>
> MythTV: rm /some/really/huge/video/file ; sync
> ## disk light stays on for several minutes..
>
> Note quite the same thing, I suppose, but it does break
> the shutdown scripts of every major Linux distribution.

It is indeed a different issue. ext3 does a fair bit of IO on a (here
60G file) delete:

http://people.redhat.com/~esandeen/rm_test/ext3_rm.png

ext4 is much better:

http://people.redhat.com/~esandeen/rm_test/ext4_rm.png

> Simple solution for MythTV is what people already do: use xfs instead.

and yes, xfs does it very quickly:

http://people.redhat.com/~esandeen/rm_test/xfs_rm.png

-Eric

Theodore Tso

unread,

Mar 24, 2009, 2:46:55 PM3/24/09

to Linus Torvalds, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 10:55:40AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 24 Mar 2009, Theodore Tso wrote:
> >
> > Try ext4, I think you'll like it. :-)
> >
> > Failing that, data=writeback for single-user machines is probably your
> > best bet.
>
> Isn't that the same fix? ext4 just defaults to the crappy "writeback"
> behavior, which is insane.

Technically, it's not data=writeback. It's more like XFS's delayed
allocation; I've added workarounds so that files that which are
replaced via truncate or rename get pushed out right away, which
should solve most of the problems involved with files becoming
zero-length after a system crash.

> Sure, it makes things _much_ smoother, since now the actual data is no
> longer in the critical path for any journal writes, but anybody who thinks
> that's a solution is just incompetent.
>
> We might as well go back to ext2 then. If your data gets written out long
> after the metadata hit the disk, you are going to hit all kinds of bad
> issues if the machine ever goes down.

With ext2 after a system crash you need to run fsck. With ext4, fsck
isn't an issue, but if the application doesn't use fsync(), yes,
there's no guarantee (other than the workarounds for
replace-via-truncate and replace-via-rename), but there's plenty of
prior history that says that applications that care about data hitting
the disk should use fsync(). Otherwise, it will get spread out over a
few minutes; and for some files, that really won't make a difference.

For precious files, applications that use fsync() will be safe ---
otherwise, even with ext3, you can end up losing the contents of the
file if you crash right before 5 second commit window. At least back
in the days when people were proud of their Linux systems having 2-3
year uptimes, and where jiffies could actually wrap from time to time,
the difference between 5 seconds and 3 minutes really wasn't that big
of a deal. People who really care about this can turn off delayed
allocation with the nodelalloc mount option. Of course then they will
have the ext3 slower fsync() problem.

You are right that data=writeback and delayed allocation do both mean
that data can get pushed out much later than the metadata. But that's
allowed by POSIX, and it does give some very nice performance
benefits.

With either data=writeback or delayed allocation, we can also adjust
the default commit interval and the writeback timer settings; if we
say, change the default commit interval to be 30 seconds, and change
the writeback expire interval to be 15 seconds, it will also smooth
out the writes significantly. So that's yet another solution, with a
different set of tradeoffs.

Depending on the set of applications someone is running on their
system, running and the reliability of their hardware/power/system in
general, different tradeoffs will be more or less appropriate for the
system administrator in question.

- Ted

Kyle Moffett

unread,

Mar 24, 2009, 2:49:25 PM3/24/09

to Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 1:55 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Tue, 24 Mar 2009, Theodore Tso wrote:
>> Try ext4, I think you'll like it. :-)
>>
>> Failing that, data=writeback for single-user machines is probably your
>> best bet.
>
> Isn't that the same fix? ext4 just defaults to the crappy "writeback"
> behavior, which is insane.
>
> Sure, it makes things _much_ smoother, since now the actual data is no
> longer in the critical path for any journal writes, but anybody who thinks
> that's a solution is just incompetent.
>
> We might as well go back to ext2 then. If your data gets written out long
> after the metadata hit the disk, you are going to hit all kinds of bad
> issues if the machine ever goes down.

Not really...

Regardless of any journalling, a power-fail or a crash is almost
certainly going to cause "data loss" of some variety. We simply
didn't get to sync everything we needed to (otherwise we'd all be
shutting down our computers with the SCRAM switches just for kicks).
The difference is, with ext3/4 (in any journal mode) we guarantee our
metadata is consistent. This means that we won't double-allocate or
leak inodes or blocks, which means that we can safely *write* to the
filesystem as soon as we replay the journal. With ext2 you *CAN'T* do
that at all, as somebody may have allocated an inode but not yet
marked it as in use. The only way to safely figure all that out
without journalling is an fsck run.

That difference between ext4 and ext3-in-writeback-mode is this: If
you get a crash in the narrow window *after* writing initial metadata
and before writing the data, ext4 will give you a zero length file,
whereas ext3-in-writeback-mode will give you a proper-length file
filled with whatever used to be on disk (might be the contents of a
previous /etc/shadow, or maybe somebody's finance files).

In that same situation, ext3 in data-ordered or data-journal mode will
"close" the window by preventing anybody else from making forward
progress until the data and the metadata are both updated. The thing
is, even on ext3 I can get exactly the same kind of behavior with an
appropriately timed "kill -STOP $dumb_program", followed by a power
failure 60 seconds later. It's a relatively obvious race condition...

When you create a file, you can't guarantee that all of that file's
data and metadata has hit disk until after an fsync() call returns.
The only *possible* exceptions are in cases like the
previously-mentioned (and now patched)
open(A)+write(A)+close(A)+rename(A,B), where the
rename-over-existing-file should act as an implicit filesystem
barrier. It should ensure that all writes to the file get flushed
before it is renamed on top of an existing file, simply because so
much UNIX software expects it to act that way.

When you're dealing with programs that simply
open()+ftruncate()+write()+close(), however... there's always going to
be a window in-between the ftruncate and the write where the file *is*
an empty file, and in that case no amount of operating-system-level
cleverness can deal with application-level bugs.

Cheers,
Kyle Moffett

David Rees

unread,

Mar 24, 2009, 3:00:57 PM3/24/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 12:32 AM, Jesper Krogh <jes...@krogh.cc> wrote:
> David Rees wrote:
> The 480 secondes is not the "wait time" but the time gone before the
> message is printed. It the kernel-default it was earlier 120 seconds but
> thats changed by Ingo Molnar back in september. I do get a lot of less
> noise but it really doesn't tell anything about the nature of the problem.
>
> The systes spec:
> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to decide
> if thats fast or slow?

The drives should be fast enough to saturate 4Gbit FC in streaming
writes. How fast is the array in practice?

> The strange thing is actually that the above process (updatedb.mlocate) is
> writing to / which is a device without any activity at all. All activity is
> on the Fibre Channel device above, but process writing outsid that seems to
> be effected as well.

Ah. Sounds like your setup would benefit immensely from the per-bdi
patches from Jens Axobe. I'm sure he would appreciate some feedback
from users like you on them.

>> What's your vm.dirty_background_ratio and
>>
>> vm.dirty_ratio set to?
>
> 2.6.29-rc8 defaults:
> jk@hest:/proc/sys/vm$ cat dirty_background_ratio
> 5
> jk@hest:/proc/sys/vm$ cat dirty_ratio
> 10

On a 32GB system that's 1.6GB of dirty data, but your array should be
able to write that out fairly quickly (in a couple seconds) as long as
it's not too random. If it's spread all over the disk, write
throughput will drop significantly - how fast is data being written to
disk when your system suffers from large write latency?

>>> Consensus seems to be something with large memory machines, lots of dirty
>>> pages and a long writeout time due to ext3.
>>
>> All filesystems seem to suffer from this issue to some degree. I
>> posted to the list earlier trying to see if there was anything that
>> could be done to help my specific case. I've got a system where if
>> someone starts writing out a large file, it kills client NFS writes.
>> Makes the system unusable:
>> http://marc.info/?l=linux-kernel&m=123732127919368&w=2
>
> Yes, I've hit 120s+ penalties just by saving a file in vim.

Yeah, your disks aren't keeping up and/or data isn't being written out
efficiently.

>> Only workaround I've found is to reduce dirty_background_ratio and
>> dirty_ratio to tiny levels. Or throw good SSDs and/or a fast RAID
>> array at it so that large writes complete faster. Have you tried the
>> new vm_dirty_bytes in 2.6.29?
>
> No.. What would you suggest to be a reasonable setting for that?

Look at whatever is there by default and try cutting them in half to start.

>> Everyone seems to agree that "autotuning" it is the way to go. But no
>> one seems willing to step up and try to do it. Probably because it's
>> hard to get right!
>
> I can test patches.. but I'm not a kernel-developer.. unfortunately.

Me either - but luckily there have been plenty chiming in on this thread now.

-Dave

Ingo Molnar

unread,

Mar 24, 2009, 3:19:46 PM3/24/09

to Herbert Xu, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Ingo Molnar <mi...@elte.hu> wrote:

Hm, i justhad a test failure (hung interface) with this too.

I'll go back to the original straight revert of "303c6a0: gro: Fix
legacy path napi_complete crash", and will test it overnight - to
establish a baseline of stability again. (to make sure there are no
other bugs interacting)

Ingo

Linus Torvalds

unread,

Mar 24, 2009, 3:27:24 PM3/24/09

to Kyle Moffett, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, 24 Mar 2009, Kyle Moffett wrote:
>
> Regardless of any journalling, a power-fail or a crash is almost
> certainly going to cause "data loss" of some variety.

The point is, if you write your metadata earlier (say, every 5 sec) and
the real data later (say, every 30 sec), you're actually MORE LIKELY to
see corrupt files than if you try to write them together.

And if you write your data _first_, you're never going to see corruption
at all.

This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around - writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.

Linus

Train05

unread,

Mar 24, 2009, 3:27:57 PM3/24/09

to linux-...@vger.kernel.org

Hello Ingo,

I experienced identical issues with the network stack hanging after a very short period with 2.6.29 and forcedeth (transferring a few megabytes was normally enough). I applied your patch and it is now working correctly.

Many thanks

Ross

Ross Alexander
SAP Basis
NEC Europe Ltd
Corporate IT Centre
Tel: +44 20 8752 3394

Linus Torvalds

unread,

Mar 24, 2009, 3:28:31 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, 24 Mar 2009, Theodore Tso wrote:
>
> With ext2 after a system crash you need to run fsck. With ext4, fsck
> isn't an issue,

Bah. A corrupt filesystem is a corrupt filesystem. Whether you have to
fsck it or not should be a secondary concern.

I personally find silent corruption to be _worse_ than the non-silent one.
At least if there's some program that says "oops, your inode so-and-so
seems to be scrogged" that's better than just silently having bad data in
it.

Of course, never having bad data _nor_ needing fsck is clearly optimal.
data=ordered gets pretty close (and data=journal is unacceptable for
performance reasons).

But I really don't understand filesystem people who think that "fsck" is
the important part, regardless of whether the data is valid or not. That's
just stupid and _obviously_ bogus.

Linus

Ric Wheeler

unread,

Mar 24, 2009, 3:44:46 PM3/24/09

to Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Tue, 24 Mar 2009, Theodore Tso wrote:
>
>> With ext2 after a system crash you need to run fsck. With ext4, fsck
>> isn't an issue,
>>
>
> Bah. A corrupt filesystem is a corrupt filesystem. Whether you have to
> fsck it or not should be a secondary concern.
>
> I personally find silent corruption to be _worse_ than the non-silent one.
> At least if there's some program that says "oops, your inode so-and-so
> seems to be scrogged" that's better than just silently having bad data in
> it.
>
> Of course, never having bad data _nor_ needing fsck is clearly optimal.
> data=ordered gets pretty close (and data=journal is unacceptable for
> performance reasons).
>
> But I really don't understand filesystem people who think that "fsck" is
> the important part, regardless of whether the data is valid or not. That's
> just stupid and _obviously_ bogus.
>
> Linus
>

It is always interesting to try to explain to users that just because
fsck ran cleanly does not mean anything that they care about is actually
safely on disk. The speed that fsck can run at is important when you are
trying to recover data from a really hosed file system, but that is
thankfully relatively rare for most people.

Having been involved in many calls with customers after crashes, what
they really want to know is pretty routine - do you have all of the data
I wrote? can you prove that it is the same data that I wrote? if not,
what data is missing and needs to be restored?

We can get help answer those questions with checksums or digital hashes
to validate the actual user data of files (open question is when to
compute it, where to store, would the SCSI T10 DIF/DIX stuff be
sufficient), putting in place some background scrubbers to detect
corruptions (which can happen even without an IO error), etc.

Being able to pin point what was impacted is actually enormously useful
- for example, being able to map a bad sector back into some meaningful
object like a user file, meta-data (translation, run fsck) or so on.

Ric

Jeff Garzik

unread,

Mar 24, 2009, 3:57:43 PM3/24/09

to Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> But I really don't understand filesystem people who think that "fsck" is
> the important part, regardless of whether the data is valid or not. That's
> just stupid and _obviously_ bogus.

I think I can understand that point of view, at least:

More customers complain about hours-long fsck times than they do about
silent data corruption of non-fsync'd files.

> The point is, if you write your metadata earlier (say, every 5 sec) and
> the real data later (say, every 30 sec), you're actually MORE LIKELY to
> see corrupt files than if you try to write them together.
>
> And if you write your data _first_, you're never going to see corruption
> at all.

Amen.

And, personal filesystem pet peeve: please encourage proper FLUSH CACHE
use to give users the data guarantees they deserve. Linux's sync(2) and
fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee
a media write.

Jeff

P.S. Overall, I am thrilled that this ext3/ext4 transition and
associated slashdotting has spurred debate over filesystem data
guarantees. This is the kind of discussion that has needed to happen
for years, IMO.

David Rees

unread,

Mar 24, 2009, 4:24:58 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 6:20 AM, Theodore Tso <ty...@mit.edu> wrote:
> However, what I've found, though, is that if you're just doing a local
> copy from one hard drive to another, or downloading a huge iso file
> from an ftp server over a wide area network, the fsync() delays really

> don't get *that* bad, even with ext3. At least, I haven't found a
> workload that doesn't involve either dd if=/dev/zero or a massive
> amount of data coming in over the network that will cause fsync()
> delays in the > 1-2 second category. Ext3 has been around for a long

> time, and it's only been the last couple of years that people have
> really complained about this; my theory is that it was the rise of >
> 10 megabit ethernets and the use of systems like distcc that really
> made this problem really become visible. The only realistic workload
> I've found that triggers this requires a fast network dumping data to
> a local filesystem.

It's pretty easy to reproduce it these days. Here's my setup, and
it's not even that fancy: Dual core Xeon, 8GB RAM, SATA RAID1 array,
GigE network. All it takes is a single client writing a large file
using Samba or NFS to introduce huge latencies.

Looking at the raw throughput, the server's disks can sustain
30-60MB/s writes (older disks), but the network can handle up to
~100MB/s. Throw in some other random seeky IO on the server, a bunch
of fragmentation and it's sustained write throughput in reality for
these writes is more like 10-25MB/s, far slower than the rate at which
a client can throw data at it.

5% dirty_ratrio * 8GB is 400MB. Let's say in reality the system is
flushing 20MB/s to disk, this is a delay of up to 20 seconds. Let's
say you have a user application which needs to fsync a number of small
files (and unfortunately they are done serially) and now I've got
applications (like Firefox) which basically remain unresponsive the
entire time the write is being done.

> (I'm sure someone will be ingeniuous enough to find something else
> though, and if they're interested, I've attached an fsync latency
> tester to this note. If you find something; let me know, I'd be
> interested.)

Thanks - I'll give the program a shot later with my test case and see
what it reports. My simple test case[1] for reproducing this has
reported 6-45 seconds depending on the system. I'll try it with the
previously mentioned workload as well.

-Dave

[1] http://bugzilla.kernel.org/show_bug.cgi?id=12309#c249

Ingo Molnar

unread,

Mar 24, 2009, 4:55:47 PM3/24/09

to Herbert Xu, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

* Ingo Molnar <mi...@elte.hu> wrote:

> > Same forcedeth box i reported before. Config below. (note: if
> > you want to use it you need to run it through 'make oldconfig',
> > with all defaults accepted)
>

> Hm, i just had a test failure (hung interface) with this too.

>
> I'll go back to the original straight revert of "303c6a0: gro: Fix
> legacy path napi_complete crash", and will test it overnight - to
> establish a baseline of stability again. (to make sure there are
> no other bugs interacting)

FYI, this plain revert is holding up fine in my tests so far - 50
random iterations - the previous one failed after 5 iterations.

David Miller

unread,

Mar 24, 2009, 5:19:11 PM3/24/09

to mi...@elte.hu, her...@gondor.apana.org.au, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: Ingo Molnar <mi...@elte.hu>
Date: Tue, 24 Mar 2009 21:54:44 +0100

> * Ingo Molnar <mi...@elte.hu> wrote:
>
> > > Same forcedeth box i reported before. Config below. (note: if
> > > you want to use it you need to run it through 'make oldconfig',
> > > with all defaults accepted)
> >
> > Hm, i just had a test failure (hung interface) with this too.
> >
> > I'll go back to the original straight revert of "303c6a0: gro: Fix
> > legacy path napi_complete crash", and will test it overnight - to
> > establish a baseline of stability again. (to make sure there are
> > no other bugs interacting)
>
> FYI, this plain revert is holding up fine in my tests so far - 50
> random iterations - the previous one failed after 5 iterations.

Something must be up with respect to letting interrupts in during
certain windows of time, or similar.

I'll take a look at this and hopefully Herbert or myself will be
able to figure it out.

David Miller

unread,

Mar 24, 2009, 5:37:09 PM3/24/09

to her...@gondor.apana.org.au, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: Herbert Xu <her...@gondor.apana.org.au>
Date: Tue, 24 Mar 2009 23:09:28 +0800

> On Tue, Mar 24, 2009 at 03:39:42PM +0100, Ingo Molnar wrote:
> >
> > Subject: [PATCH] net: Fix netpoll lockup in legacy receive path
>
> Actually, this patch is still racy. If some interrupt comes in
> and we suddenly get the maximum amount of backlog we can still
> hang when we call __napi_complete incorrectly. It's unlikely
> but we certainly shouldn't allow that. Here's a better version.
>
> net: Fix netpoll lockup in legacy receive path

Hmmm...

> @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
> local_irq_disable();
> skb = __skb_dequeue(&queue->input_pkt_queue);
> if (!skb) {
> + list_del(&napi->poll_list);
> + clear_bit(NAPI_STATE_SCHED, &napi->state);
> local_irq_enable();
> - napi_complete(napi);
> - goto out;
> + break;
> }
> local_irq_enable();

I think the problem is that we need to do the GRO flush before the
list delete and clearing the NAPI_STATE_SCHED bit.

You can't disown the NAPI context until you've squared away the GRO
state, I think.

Ingo's case stresses TCP a lot so I think he's hitting these GRO
cases a lot as well as hitting the backlog maximum.

So this mis-ordering of completion operations could explain why
he still sees problems.

Ingo Molnar

unread,

Mar 24, 2009, 6:02:25 PM3/24/09

to David Miller, her...@gondor.apana.org.au, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

* David Miller <da...@davemloft.net> wrote:

> From: Ingo Molnar <mi...@elte.hu>
> Date: Tue, 24 Mar 2009 21:54:44 +0100
>
> > * Ingo Molnar <mi...@elte.hu> wrote:
> >
> > > > Same forcedeth box i reported before. Config below. (note: if
> > > > you want to use it you need to run it through 'make oldconfig',
> > > > with all defaults accepted)
> > >
> > > Hm, i just had a test failure (hung interface) with this too.
> > >
> > > I'll go back to the original straight revert of "303c6a0: gro: Fix
> > > legacy path napi_complete crash", and will test it overnight - to
> > > establish a baseline of stability again. (to make sure there are
> > > no other bugs interacting)
> >
> > FYI, this plain revert is holding up fine in my tests so far - 50
> > random iterations - the previous one failed after 5 iterations.
>
> Something must be up with respect to letting interrupts in during
> certain windows of time, or similar.
>
> I'll take a look at this and hopefully Herbert or myself will be
> able to figure it out.

It definitely did not show usual patterns of bug behavior - i'd have
found it yesterday morning if it did.

I spent most of the time trying to find a reliable reproducer
config and system. Sometimes the bug went away with a minor change
in the .config. Until today i didnt even suspect a mainline change
causing this.

Also, note that i have reduced the probability of UP kernels in my
randconfigs artificially to about 12.5% (it is 50% upstream). Still,
despite that measure, the 'best' .config i found was an UP config -
i dont think that's an accident. Also, i had to fully saturate the
target CPU over gigabit to hit the bug best.

Which suggests to me (empirically) that it's indeed a race and that
it needs a saturated system with lots of IRQs to trigger, and
perhaps that it needs saturated/overloaded network device queues and
complex userspace/softirq/hardirq interactions.

Ingo

David Miller

unread,

Mar 24, 2009, 6:48:16 PM3/24/09

to her...@gondor.apana.org.au, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: David Miller <da...@davemloft.net>
Date: Tue, 24 Mar 2009 14:36:22 -0700 (PDT)

> From: Herbert Xu <her...@gondor.apana.org.au>
> Date: Tue, 24 Mar 2009 23:09:28 +0800
>

> > @@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
> > local_irq_disable();
> > skb = __skb_dequeue(&queue->input_pkt_queue);
> > if (!skb) {
> > + list_del(&napi->poll_list);
> > + clear_bit(NAPI_STATE_SCHED, &napi->state);
> > local_irq_enable();
> > - napi_complete(napi);
> > - goto out;
> > + break;
> > }
> > local_irq_enable();
>
> I think the problem is that we need to do the GRO flush before the
> list delete and clearing the NAPI_STATE_SCHED bit.

Ok Herbert, I'm even more sure of this because in your original commit
log message you mention:

This simply doesn't work since we need to flush the held
GRO packets first.

We are certainly in a pickle here, actually.

We can't run the GRO flush until we re-enable interrupts. But if we
re-enable interrupts, more packets get queued to the input_pkt_queue
and we end up back where we started.

Jesse Barnes

unread,

Mar 24, 2009, 7:04:21 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 09:20:32 -0400
Theodore Tso <ty...@mit.edu> wrote:
> They don't solve the problem where there is a *huge* amount of writes
> going on, though --- if something is dirtying pages at a rate far

> greater than the local disk can write it out, say, either "dd
> if=/dev/zero of=/mnt/make-lots-of-writes" or a massive distcc cluster
> driving a huge amount of data towards a single system or a wget over a
> local 100 megabit ethernet from a massive NFS server where everything
> is in cache, then you can have a major delay with the fsync().

You make it sound like this is hard to do... I was running into this
problem *every day* until I moved to XFS recently. I'm running a
fairly beefy desktop (VMware running a crappy Windows install w/AV junk
on it, builds, icecream and large mailboxes) and have a lot of RAM, but
it became unusable for minutes at a time, which was just totally
unacceptable, thus the switch. Things have been better since, but are
still a little choppy.

I remember early in the 2.6.x days there was a lot of focus on making
interactive performance good, and for a long time it was. But this I/O
problem has been around for a *long* time now... What happened? Do not
many people run into this daily? Do all the filesystem hackers run
with special mount options to mitigate the problem?

--
Jesse Barnes, Intel Open Source Technology Center

Arjan van de Ven

unread,

Mar 24, 2009, 8:05:33 PM3/24/09

to Jesse Barnes, Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 16:03:53 -0700
Jesse Barnes <jba...@virtuousgeek.org> wrote:

>
> I remember early in the 2.6.x days there was a lot of focus on making
> interactive performance good, and for a long time it was. But this
> I/O problem has been around for a *long* time now... What happened?
> Do not many people run into this daily? Do all the filesystem
> hackers run with special mount options to mitigate the problem?
>

the people that care use my kernel patch on ext3 ;-)
(or the userland equivalent tweak in /etc/rc.local)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Herbert Xu

unread,

Mar 24, 2009, 8:24:08 PM3/24/09

to David Miller, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
>
> I think the problem is that we need to do the GRO flush before the
> list delete and clearing the NAPI_STATE_SCHED bit.

Well first of all GRO shouldn't even be on in Ingo's case, unless
he enabled it by hand with ethtool. Secondly the only thing that
touches the GRO state for the legacy path is process_backlog, and
since this is per-cpu, I can't see how another instance can run
while the first is still going.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu

unread,

Mar 24, 2009, 8:26:31 PM3/24/09

to David Miller, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 03:47:43PM -0700, David Miller wrote:
>
> > I think the problem is that we need to do the GRO flush before the
> > list delete and clearing the NAPI_STATE_SCHED bit.
>
> Ok Herbert, I'm even more sure of this because in your original commit
> log message you mention:
>
> This simply doesn't work since we need to flush the held
> GRO packets first.

That's only because I was calling __napi_complete, which is used
by drivers in general so I added the check to ensure that GRO
packets have been flushed. Now that we're open-coding it this is
no longer a requirement.

But what's more GRO should be off on Ingo's test machines because
we haven't added anything to turn it on by default for non-NAPI
drivers.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu

unread,

Mar 24, 2009, 8:53:34 PM3/24/09

to Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 05:02:41PM +0100, Ingo Molnar wrote:
>
> * Herbert Xu <her...@gondor.apana.org.au> wrote:
>
> > What's the NIC and config on this one? If it's still using the
> > legacy/netif_rx path, where GRO is off by default, this patch
> > should make it exactly the same as with my original patch
> > reverted.
>

> Same forcedeth box i reported before. Config below. (note: if you
> want to use it you need to run it through 'make oldconfig', with all
> defaults accepted)
>

> CONFIG_FORCEDETH=y
> CONFIG_FORCEDETH_NAPI=y

This means that we shouldn't even invoke netif_rx/process_backlog,
so something else is going on.

Herbert Xu

unread,

Mar 24, 2009, 8:53:48 PM3/24/09

to Ingo Molnar, Robert Schwebel, Linus Torvalds, Frank Blaschka, David S. Miller, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ker...@pengutronix.de

On Tue, Mar 24, 2009 at 08:19:00PM +0100, Ingo Molnar wrote:
>
> Hm, i justhad a test failure (hung interface) with this too.

Was this with NAPI on or off?

Thanks,

--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

David Miller

unread,

Mar 24, 2009, 10:09:44 PM3/24/09

to her...@gondor.apana.org.au, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: Herbert Xu <her...@gondor.apana.org.au>
Date: Wed, 25 Mar 2009 08:32:35 +0800

> On Tue, Mar 24, 2009 at 05:02:41PM +0100, Ingo Molnar wrote:
> >
> > * Herbert Xu <her...@gondor.apana.org.au> wrote:
> >
> > > What's the NIC and config on this one? If it's still using the
> > > legacy/netif_rx path, where GRO is off by default, this patch
> > > should make it exactly the same as with my original patch
> > > reverted.
> >
> > Same forcedeth box i reported before. Config below. (note: if you
> > want to use it you need to run it through 'make oldconfig', with all
> > defaults accepted)
> >
> > CONFIG_FORCEDETH=y
> > CONFIG_FORCEDETH_NAPI=y
>
> This means that we shouldn't even invoke netif_rx/process_backlog,
> so something else is going on.

There is always loopback which does netif_rx().

Combine that with the straight NAPI receive that forcedeth
is doing here and I'm sure there are all kinds of race
scenerios possible :-)

You're right about GRO not being relevant here. To be honest
I wouldn't be disappointed if GRO was simply on by default
even for the legacy paths.

Theodore Tso

unread,

Mar 24, 2009, 10:10:06 PM3/24/09

to Jesse Barnes, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 04:03:53PM -0700, Jesse Barnes wrote:
>
> You make it sound like this is hard to do... I was running into this
> problem *every day* until I moved to XFS recently. I'm running a
> fairly beefy desktop (VMware running a crappy Windows install w/AV junk
> on it, builds, icecream and large mailboxes) and have a lot of RAM, but
> it became unusable for minutes at a time, which was just totally
> unacceptable, thus the switch. Things have been better since, but are
> still a little choppy.
>

I have 4 gigs of memory on my laptop, and I've never seen it these
sorts of issues. So maybe filesystem hackers don't have enough
memory; or we don't use the right workloads? It would help if I
understood how to trigger these disaster cases. I've had to work
*really* hard (as in dd if=/dev/zero of=/mnt/dirty-me-harder) in order
to get even a 30 second fsync() delay. So understanding what sort of
things you do that cause that many files data blocks to be dirtied,
and/or what is causing a major read workload, would be useful.

It may be that we just need to tune the VM to be much more aggressive
about pushing dirty pages to the disk sooner. Understanding how the
dynamics are working would be the first step.

> I remember early in the 2.6.x days there was a lot of focus on making
> interactive performance good, and for a long time it was. But this I/O
> problem has been around for a *long* time now... What happened? Do not
> many people run into this daily? Do all the filesystem hackers run
> with special mount options to mitigate the problem?

All I can tell you is that *I* don't run into them, even when I was
using ext3 and before I got an SSD in my laptop. I don't understand
why; maybe because I don't get really nice toys like systems with
32G's of memory. Or maybe it's because I don't use icecream (whatever
that is). What ever it is, it would be useful to get some solid
reproduction information, with details about hardware configuration,
and information collecting using sar and scripts that gather
/proc/meminfo every 5 seconds, and what the applications were doing at
the time.

It might also be useful for someone to try reducing the amount of
memory the system is using by using mem= on the boot line, and see if
that changes things, and to try simplifying the application workload,
and/or using iotop to determine what is most contributing to the
problem. (And of course, this needs to be done with someone using
ext3, since both ext4 and XFS use delayed allocation, which will
largely make this problem go away.)

- Ted

David Miller

unread,

Mar 24, 2009, 10:12:00 PM3/24/09

to her...@gondor.apana.org.au, mi...@elte.hu, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: Herbert Xu <her...@gondor.apana.org.au>
Date: Wed, 25 Mar 2009 08:23:03 +0800

> On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
> >
> > I think the problem is that we need to do the GRO flush before the
> > list delete and clearing the NAPI_STATE_SCHED bit.
>
> Well first of all GRO shouldn't even be on in Ingo's case, unless
> he enabled it by hand with ethtool. Secondly the only thing that
> touches the GRO state for the legacy path is process_backlog, and
> since this is per-cpu, I can't see how another instance can run
> while the first is still going.

Right.

I think the conditions Ingo is running under is that both
loopback (using legacy paths) and his NAPI based device
(forcedeth) are processing a lot of packets at the same
time.

Another thing that seems to be critical is he can only trigger this on
UP, which means that we don't have the damn APIC potentially moving
the cpu target of the forcedeth interrupts around. And this means
also that all the processing will be on one cpu's backlog queue only.

Jesse Barnes

unread,

Mar 24, 2009, 11:57:44 PM3/24/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, 24 Mar 2009 22:09:15 -0400
Theodore Tso <ty...@mit.edu> wrote:

> On Tue, Mar 24, 2009 at 04:03:53PM -0700, Jesse Barnes wrote:
> >
> > You make it sound like this is hard to do... I was running into
> > this problem *every day* until I moved to XFS recently. I'm
> > running a fairly beefy desktop (VMware running a crappy Windows
> > install w/AV junk on it, builds, icecream and large mailboxes) and
> > have a lot of RAM, but it became unusable for minutes at a time,
> > which was just totally unacceptable, thus the switch. Things have
> > been better since, but are still a little choppy.
> >
>
> I have 4 gigs of memory on my laptop, and I've never seen it these
> sorts of issues. So maybe filesystem hackers don't have enough
> memory; or we don't use the right workloads? It would help if I
> understood how to trigger these disaster cases. I've had to work
> *really* hard (as in dd if=/dev/zero of=/mnt/dirty-me-harder) in order
> to get even a 30 second fsync() delay. So understanding what sort of
> things you do that cause that many files data blocks to be dirtied,
> and/or what is causing a major read workload, would be useful.
>
> It may be that we just need to tune the VM to be much more aggressive
> about pushing dirty pages to the disk sooner. Understanding how the
> dynamics are working would be the first step.

Well I think that's part of the problem; this is bigger than just
filesystems; I've been using ext3 since before I started seeing this,
so it seems like a bad VM/fs interaction may be to blame.

> > I remember early in the 2.6.x days there was a lot of focus on
> > making interactive performance good, and for a long time it was.
> > But this I/O problem has been around for a *long* time now... What
> > happened? Do not many people run into this daily? Do all the
> > filesystem hackers run with special mount options to mitigate the
> > problem?
>
> All I can tell you is that *I* don't run into them, even when I was
> using ext3 and before I got an SSD in my laptop. I don't understand
> why; maybe because I don't get really nice toys like systems with
> 32G's of memory. Or maybe it's because I don't use icecream (whatever
> that is). What ever it is, it would be useful to get some solid
> reproduction information, with details about hardware configuration,
> and information collecting using sar and scripts that gather
> /proc/meminfo every 5 seconds, and what the applications were doing at
> the time.

icecream is a distributed compiler system. Like distcc but a bit more
cross-compile & heterogeneous compiler friendly.

> It might also be useful for someone to try reducing the amount of
> memory the system is using by using mem= on the boot line, and see if
> that changes things, and to try simplifying the application workload,
> and/or using iotop to determine what is most contributing to the
> problem. (And of course, this needs to be done with someone using
> ext3, since both ext4 and XFS use delayed allocation, which will
> largely make this problem go away.)

Yep, and that's where my blame comes in. I whined about this to a few
people, like Arjan, who provided workarounds, but never got beyond
that. Some real debugging would be needed to find & fix the root
cause(s).

--
Jesse Barnes, Intel Open Source Technology Center

David Rees

unread,

Mar 25, 2009, 3:30:49 AM3/25/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 1:24 PM, David Rees <dre...@gmail.com> wrote:
> On Tue, Mar 24, 2009 at 6:20 AM, Theodore Tso <ty...@mit.edu> wrote:
>> The only realistic workload
>> I've found that triggers this requires a fast network dumping data to
>> a local filesystem.
>
> It's pretty easy to reproduce it these days. Here's my setup, and
> it's not even that fancy: Dual core Xeon, 8GB RAM, SATA RAID1 array,
> GigE network. All it takes is a single client writing a large file
> using Samba or NFS to introduce huge latencies.
>
> Looking at the raw throughput, the server's disks can sustain
> 30-60MB/s writes (older disks), but the network can handle up to
> ~100MB/s. Throw in some other random seeky IO on the server, a bunch
> of fragmentation and it's sustained write throughput in reality for
> these writes is more like 10-25MB/s, far slower than the rate at which
> a client can throw data at it.
>

>> (I'm sure someone will be ingeniuous enough to find something else
>> though, and if they're interested, I've attached an fsync latency
>> tester to this note. If you find something; let me know, I'd be
>> interested.)

OK, two simple tests on this system produce latencies well over 1-2s
using your fsync-tester.

The network client writing to disk scenario (~1GB file) resulted in this:
fsync time: 6.5272
fsync time: 35.6803
fsync time: 15.6488
fsync time: 0.3570

One thing to note - writing to this particular array seems to have
higher than expected latency without the big write, on the order of
0.2 seconds or so. I think this is because the system is not idle and
has a good number of programs on it doing logging and other small bits
of IO. vmstat 5 shows the system writing out about 300-1000 under the
bo column.

Copying that file to a separate disk was not as bad, but there were
still some big spikes:

fsync time: 6.8808
fsync time: 18.4634
fsync time: 9.6852
fsync time: 10.6146
fsync time: 8.5015
fsync time: 5.2160

The destination disk did not have any significant IO on it at the time.

The system is running Fedora 10 2.6.27.19-78.2.30.fc9.x86_64 and has
two RAID1 arrays attached to an aacraid controller. ext3 filesystems
mounted with noatime.

-Dave

Ingo Molnar

unread,

Mar 25, 2009, 3:34:30 AM3/25/09

to David Miller, her...@gondor.apana.org.au, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

* David Miller <da...@davemloft.net> wrote:

> From: Herbert Xu <her...@gondor.apana.org.au>
> Date: Wed, 25 Mar 2009 08:23:03 +0800
>
> > On Tue, Mar 24, 2009 at 02:36:22PM -0700, David Miller wrote:
> > >
> > > I think the problem is that we need to do the GRO flush before the
> > > list delete and clearing the NAPI_STATE_SCHED bit.
> >
> > Well first of all GRO shouldn't even be on in Ingo's case, unless
> > he enabled it by hand with ethtool. Secondly the only thing that
> > touches the GRO state for the legacy path is process_backlog, and
> > since this is per-cpu, I can't see how another instance can run
> > while the first is still going.
>
> Right.
>
> I think the conditions Ingo is running under is that both loopback
> (using legacy paths) and his NAPI based device (forcedeth) are
> processing a lot of packets at the same time.
>
> Another thing that seems to be critical is he can only trigger
> this on UP, which means that we don't have the damn APIC
> potentially moving the cpu target of the forcedeth interrupts
> around. And this means also that all the processing will be on
> one cpu's backlog queue only.

I tested the plain revert i sent in the original report overnight
(with about 12 hours of combined testing time), and all systems held
up fine. The system that would reproduce the bug within 10-20
iterations did 210 successful iterations. Other systems held up fine
too.

So if there's no definitive resolution for the real cause of the
bug, the plain revert looks like an acceptable interim choice for
29.1 - at least as far as my systems go.

Ingo

David Miller

unread,

Mar 25, 2009, 4:04:58 AM3/25/09

to mi...@elte.hu, her...@gondor.apana.org.au, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

From: Ingo Molnar <mi...@elte.hu>
Date: Wed, 25 Mar 2009 08:33:49 +0100

> So if there's no definitive resolution for the real cause of the
> bug, the plain revert looks like an acceptable interim choice for

> .29.1 - at least as far as my systems go.

Then we get back the instant OOPS that patch fixes :-)

I'm sure Herbert will look into fixing this properly.

Benny Halevy

unread,

Mar 25, 2009, 5:35:42 AM3/25/09

to Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Mar. 24, 2009, 21:55 +0200, Jeff Garzik <je...@garzik.org> wrote:
> Linus Torvalds wrote:
>> But I really don't understand filesystem people who think that "fsck" is
>> the important part, regardless of whether the data is valid or not. That's
>> just stupid and _obviously_ bogus.
>
> I think I can understand that point of view, at least:
>
> More customers complain about hours-long fsck times than they do about
> silent data corruption of non-fsync'd files.
>
>
>> The point is, if you write your metadata earlier (say, every 5 sec) and
>> the real data later (say, every 30 sec), you're actually MORE LIKELY to
>> see corrupt files than if you try to write them together.
>>
>> And if you write your data _first_, you're never going to see corruption
>> at all.
>
> Amen.
>
> And, personal filesystem pet peeve: please encourage proper FLUSH CACHE
> use to give users the data guarantees they deserve. Linux's sync(2) and
> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee
> a media write.

I completely agree. This also applies to nfsd_sync, by the way.
What's the right place to implement that?
How about sync_blockdev?

Benny

>
> Jeff
>
>
> P.S. Overall, I am thrilled that this ext3/ext4 transition and
> associated slashdotting has spurred debate over filesystem data
> guarantees. This is the kind of discussion that has needed to happen
> for years, IMO.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Benny Halevy
Software Architect
Panasas, Inc.
bha...@panasas.com
Tel/Fax: +972-3-647-8340
Mobile: +972-54-802-8340

Panasas: The Leader in Parallel Storage
www.panasas.com

Jens Axboe

unread,

Mar 25, 2009, 5:39:30 AM3/25/09

to Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Tue, Mar 24 2009, Jeff Garzik wrote:
> Linus Torvalds wrote:
>> But I really don't understand filesystem people who think that "fsck"
>> is the important part, regardless of whether the data is valid or not.
>> That's just stupid and _obviously_ bogus.
>
> I think I can understand that point of view, at least:
>
> More customers complain about hours-long fsck times than they do about
> silent data corruption of non-fsync'd files.
>
>
>> The point is, if you write your metadata earlier (say, every 5 sec) and
>> the real data later (say, every 30 sec), you're actually MORE LIKELY to
>> see corrupt files than if you try to write them together.
>>
>> And if you write your data _first_, you're never going to see
>> corruption at all.
>
> Amen.
>
> And, personal filesystem pet peeve: please encourage proper FLUSH CACHE
> use to give users the data guarantees they deserve. Linux's sync(2) and
> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee
> a media write.

fsync already does that, at least if you have barriers enabled on your
drive.

--
Jens Axboe

Herbert Xu

unread,

Mar 25, 2009, 8:10:21 AM3/25/09

to Ingo Molnar, David Miller, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

On Wed, Mar 25, 2009 at 08:33:49AM +0100, Ingo Molnar wrote:
>
> So if there's no definitive resolution for the real cause of the
> bug, the plain revert looks like an acceptable interim choice for

> .29.1 - at least as far as my systems go.

OK, let's just do the revert and disable GRO for the legacy path.
This should be the safest option for 2.6.29.

GRO: Disable GRO on legacy netif_rx path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete. Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

Since we can't seem to find a fix that works properly right now,
this patch reverts all the GRO support from the netif_rx path.

Signed-off-by: Herbert Xu <her...@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..e438f54 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,18 +2588,15 @@ static int process_backlog(struct napi_struct *napi, int quota)

local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {

+ __napi_complete(napi);

local_irq_enable();
- napi_complete(napi);
- goto out;
+ break;
}
local_irq_enable();

- napi_gro_receive(napi, skb);
+ netif_receive_skb(skb);
} while (++work < quota && jiffies == start_time);

- napi_gro_flush(napi);
-
-out:
return work;
}

Thanks,

--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Ingo Molnar

unread,

Mar 25, 2009, 8:21:24 AM3/25/09

to Herbert Xu, David Miller, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

* Herbert Xu <her...@gondor.apana.org.au> wrote:

> On Wed, Mar 25, 2009 at 08:33:49AM +0100, Ingo Molnar wrote:
> >
> > So if there's no definitive resolution for the real cause of the
> > bug, the plain revert looks like an acceptable interim choice for
> > .29.1 - at least as far as my systems go.
>
> OK, let's just do the revert and disable GRO for the legacy path.
> This should be the safest option for 2.6.29.

ok - i have started testing the delta below, on top of the plain
revert.

Ingo

diff --git a/net/core/dev.c b/net/core/dev.c
index c1e9dc0..e438f54 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2594,11 +2594,9 @@ static int process_backlog(struct napi_struct *napi, int quota)

}
local_irq_enable();

- napi_gro_receive(napi, skb);
+ netif_receive_skb(skb);
} while (++work < quota && jiffies == start_time);

- napi_gro_flush(napi);
-

return work;

Herbert Xu

unread,

Mar 25, 2009, 8:27:25 AM3/25/09

to Ingo Molnar, David Miller, r.sch...@pengutronix.de, torv...@linux-foundation.org, blas...@linux.vnet.ibm.com, tg...@linutronix.de, a.p.zi...@chello.nl, linux-...@vger.kernel.org, ker...@pengutronix.de

On Wed, Mar 25, 2009 at 01:20:46PM +0100, Ingo Molnar wrote:
>
> ok - i have started testing the delta below, on top of the plain
> revert.

Thanks! BTW Ingo, any chance you could help us identify the problem
with the previous patch? I don't have a forcedeth machine here
and the hang you had with my patch that open-coded __napi_complete
appears intimately connected to forcedeth (with NAPI enabled).

The simplest thing to try would be to build forcedeth.c with DEBUG
and see what it prints out after it locks up.

Cheers,

--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Jan Kara

unread,

Mar 25, 2009, 8:38:01 AM3/25/09

to Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Theodore Tso, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue 24-03-09 04:12:49, Andrew Morton wrote:
> On Tue, 24 Mar 2009 11:31:11 +0100 Ingo Molnar <mi...@elte.hu> wrote:
> > The thing is ... this is a _bad_ ext3 design bug affecting ext3
> > users in the last decade or so of ext3 existence. Why is this issue
> > not handled with the utmost high priority and why wasnt it fixed 5
> > years ago already? :-)
> >
> > It does not matter whether we have extents or htrees when there are
> > _trivially reproducible_ basic usability problems with ext3.
> >
>
> It's all there in that Oct 2008 thread.
>
> The proposed tweak to kjournald is a bad fix - partly because it will
> elevate the priority of vast amounts of IO whose priority we don't _want_
> elevated.
>
> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in
>
> commit 773fc4c63442fbd8237b4805627f6906143204a8
> Author: akpm <akpm>
> AuthorDate: Sun May 19 23:23:01 2002 +0000
> Commit: akpm <akpm>
> CommitDate: Sun May 19 23:23:01 2002 +0000
>
> [PATCH] fix ext3 buffer-stealing
>
> Patch from sct fixes a long-standing (I did it!) and rather complex
> problem with ext3.
>
> The problem is to do with buffers which are continually being dirtied
> by an external agent. I had code in there (for easily-triggerable
> livelock avoidance) which steals the buffer from checkpoint mode and
> reattaches it to the running transaction. This violates ext3 ordering
> requirements - it can permit journal space to be reclaimed before the
> relevant data has really been written out.
>
> Also, we do have to reliably get a lock on the buffer when moving it
> between lists and inspecting its internal state. Otherwise a competing
> read from the underlying block device can trigger an assertion failure,
> and a competing write to the underlying block device can confuse ext3
> journalling state completely.
I've looked at this a bit. I suppose you mean the contention arising from
us taking the buffer lock in do_get_write_access()? But it's not obvious
to me why we'd be contending there... We call this function only for
metadata buffers (unless in data=journal mode) so there isn't huge amount
of these blocks. This buffer should be locked for a longer time only when
we do writeout for checkpoint (hmm, maybe you meant this one?). In
particular, note that we don't take the buffer lock when committing this
block to journal - we lock only the BJ_IO buffer. But in this case we wait
when the buffer is on BJ_Shadow list later so there is some contention in
this case.
Also when I emailed with a few people about these sync problems, they
wrote that switching to data=writeback mode helps considerably so this
would indicate that handling of ordered mode data buffers is causing most
of the slowdown...

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Theodore Tso

unread,

Mar 25, 2009, 11:01:55 AM3/25/09

to Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 01:37:44PM +0100, Jan Kara wrote:
> > Also, we do have to reliably get a lock on the buffer when moving it
> > between lists and inspecting its internal state. Otherwise a competing
> > read from the underlying block device can trigger an assertion failure,
> > and a competing write to the underlying block device can confuse ext3
> > journalling state completely.
>
> I've looked at this a bit. I suppose you mean the contention arising from
> us taking the buffer lock in do_get_write_access()? But it's not obvious
> to me why we'd be contending there... We call this function only for
> metadata buffers (unless in data=journal mode) so there isn't huge amount
> of these blocks.

There isn't a huge number of those blocks, but if inode #1220 was
modified in the previous transaction which is now being committed, and
we then need to modify and write out inode #1221 in the current
contention, and they share the same inode table block, that would
cause the contention. That probably doesn't happen that often in a
synchronous code path, but it probably happens more often that you're
thinking. I still think the fsync() problem is the much bigger deal,
and solving the contention problem isn't going to solve the fsync()
latency problem with ext3 data=ordered mode.

> Also when I emailed with a few people about these sync problems, they
> wrote that switching to data=writeback mode helps considerably so this
> would indicate that handling of ordered mode data buffers is causing most
> of the slowdown...

Yes, but we need to be clear whether this was an fsync() problem or
some other random delay problem. If it's the fsync() problem,
obviously data=writeback will solve the fsync() latency delay problem.
(As will using delayed allocation in ext4 or XFS.)

- Ted

Bodo Eggert

unread,

Mar 25, 2009, 11:20:17 AM3/25/09

to Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Theodore Tso <ty...@mit.edu> wrote:

> OK, so there are a couple of solutions to this problem. One is to use
> ext4 and delayed allocation. This solves the problem by simply not
> allocating the blocks in the first place, so we don't have to force
> them out to solve the security problem that data=ordered was trying to
> solve. Simply mounting an ext3 filesystem using ext4, without making
> any change to the filesystem format, should solve the problem.

[...]

> However, these days, nearly all Linux boxes are single user machines,
> so the security concern is much less of a problem. So maybe the best
> solution for now is to make data=writeback the default. This solves
> the problem too. The only problem with this is that there are a lot
> of sloppy application writers out there, and they've gotten lazy about
> using fsync() where it's necessary;

The problem is not having accidential data loss because the inode /happened/
to be written before the data, but having /guaranteed/ data loss in a
60-seconds-window. This is about as acceptable as having a filesystem
replace _any_ data with "deadbeef" on each crash unless fsync was called.

Besides that: If the problem is due to crappy VM writeout (is it?), reducing
security to DOS level is not the answer. You'd want your fs to be usable on
servers, wouldn't you?

Linus Torvalds

unread,

Mar 25, 2009, 1:37:39 PM3/25/09

to Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, 25 Mar 2009, Theodore Tso wrote:
>
> I still think the fsync() problem is the much bigger deal, and solving
> the contention problem isn't going to solve the fsync() latency problem
> with ext3 data=ordered mode.

The fsync() problem is really annoying, but what is doubly annoying is
that sometimes one process doing fsync() (or sync) seems to cause other
processes to hickup too.

Now, I personally solved that problem by moving to (good) SSD's on my
desktop, and I think that's indeed the long-term solution. But it would be
good to try to figure out a solution in the short term for people who
don't have new hardware thrown at them from random companies too.

I suspect it's a combination of filesystem transaction locking, together
with the VM wanting to write out some unrelated blocks or inodes due to
the system just being close to the dirty limits. Which is why the
system-wide hickups then happen especially when writing big files.

The VM _tries_ to do writes in the background, but if the writepage() path
hits a filesystem-level blocking lock, that background write suddenly
becomes largely synchronous.

I suspect there is also some possibility of confusion with inter-file
(false) metadata dependencies. If a filesystem were to think that the file
size is metadata that should be journaled (in a single journal), and the
journaling code then decides that it needs to do those meta-data updates
in the correct order (ie the big file write _before_ the file write that
wants to be fsync'ed), then the fsync() will be delayed by a totally
irrelevant large file having to have its data written out (due to
data=ordered or whatever).

I'd like to think that no filesystem designer would ever be that silly,
but I'm too scared to try to actually go and check. Because I could well
imagine that somebody really thought that "size" is metadata.

Linus

Jesper Krogh

unread,

Mar 25, 2009, 1:43:24 PM3/25/09

to David Rees, Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
> On Tue, Mar 24, 2009 at 12:32 AM, Jesper Krogh <jes...@krogh.cc> wrote:
>> David Rees wrote:
>> The 480 secondes is not the "wait time" but the time gone before the
>> message is printed. It the kernel-default it was earlier 120 seconds but
>> thats changed by Ingo Molnar back in september. I do get a lot of less
>> noise but it really doesn't tell anything about the nature of the problem.
>>
>> The systes spec:
>> 32GB of memory. The disks are a Nexsan SataBeast with 42 SATA drives in
>> Raid10 connected using 4Gbit fibre-channel. I'll let it up to you to decide
>> if thats fast or slow?
>
> The drives should be fast enough to saturate 4Gbit FC in streaming
> writes. How fast is the array in practice?

Thats allways a good question.. This is by far not being the only user
of the array at the time of testing.. (there are 4 FC-channel connected
to a switch). Creating a fresh slice.. and just dd'ing onto it from
/dev/zero gives:
jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s
jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s

Watching using dstat while dd'ing it peaks at 220M/s

If I watch numbers on "dstat" output in production. It gets at peak
around the same(130MB/s) but average is in the 90-100 MB/s range.

It has 2GB of battery backed cache. I'm fairly sure that when it was new
(and I only had connected one host) I could get it up at around 350MB/s.

>> The strange thing is actually that the above process (updatedb.mlocate) is
>> writing to / which is a device without any activity at all. All activity is
>> on the Fibre Channel device above, but process writing outsid that seems to
>> be effected as well.
>
> Ah. Sounds like your setup would benefit immensely from the per-bdi
> patches from Jens Axobe. I'm sure he would appreciate some feedback
> from users like you on them.
>
>>> What's your vm.dirty_background_ratio and
>>>
>>> vm.dirty_ratio set to?
>> 2.6.29-rc8 defaults:
>> jk@hest:/proc/sys/vm$ cat dirty_background_ratio
>> 5
>> jk@hest:/proc/sys/vm$ cat dirty_ratio
>> 10
>
> On a 32GB system that's 1.6GB of dirty data, but your array should be
> able to write that out fairly quickly (in a couple seconds) as long as
> it's not too random. If it's spread all over the disk, write
> throughput will drop significantly - how fast is data being written to
> disk when your system suffers from large write latency?

Thats another thing. I havent been debugging while hitting it (yet) but
if I go ind and do a sync on the system manually. Then it doesn't get
above 50MB/s in writeout (measured using dstat). But even that doesn't
sum up to 8 minutes .. 1.6GB at 50MB/s ..=> 32 s.

--
Jesper

Alan Cox

unread,

Mar 25, 2009, 1:59:22 PM3/25/09

to Linus Torvalds, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

> The fsync() problem is really annoying, but what is doubly annoying is
> that sometimes one process doing fsync() (or sync) seems to cause other
> processes to hickup too.

Bug #5942 (interaction with anticipatory io scheduler)
Bug #9546 (with reproducer & logs)
Bug #9911 including a rather natty tester (albeit in java)
Bug #7372 (some info and figures on certain revs it seemed to get worse)
Bug #12309 (more info, including kjournald hack fix using ioprio)

General consensus seems to be 2.6.18 is where the manure intersected with
the air impeller

David Rees

unread,

Mar 25, 2009, 2:00:41 PM3/25/09

to Arjan van de Ven, Jesse Barnes, Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 5:05 PM, Arjan van de Ven <ar...@infradead.org> wrote:
> On Tue, 24 Mar 2009 16:03:53 -0700
> Jesse Barnes <jba...@virtuousgeek.org> wrote:
>> I remember early in the 2.6.x days there was a lot of focus on making
>> interactive performance good, and for a long time it was. But this
>> I/O problem has been around for a *long* time now... What happened?
>> Do not many people run into this daily? Do all the filesystem
>> hackers run with special mount options to mitigate the problem?
>
> the people that care use my kernel patch on ext3 ;-)
> (or the userland equivalent tweak in /etc/rc.local)

There's a couple of comments in bug 12309 [1] which confirm that
increasing the priority of kjournald reduces latency significantly
since I posted your tweak there yesterday. I hope to do some testing
today on my systems to see if it helps on them, too.

-Dave

[1] http://bugzilla.kernel.org/show_bug.cgi?id=12309

David Rees

unread,

Mar 25, 2009, 2:10:12 PM3/25/09

to Linus Torvalds, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:29 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, 25 Mar 2009, Theodore Tso wrote:
>> I still think the fsync() problem is the much bigger deal, and solving
>> the contention problem isn't going to solve the fsync() latency problem
>> with ext3 data=ordered mode.
>
> The fsync() problem is really annoying, but what is doubly annoying is
> that sometimes one process doing fsync() (or sync) seems to cause other
> processes to hickup too.
>
> Now, I personally solved that problem by moving to (good) SSD's on my
> desktop, and I think that's indeed the long-term solution. But it would be
> good to try to figure out a solution in the short term for people who
> don't have new hardware thrown at them from random companies too.

Throwing SSDs at it only increases the limit before which it becomes
an issue. They hide the underlying issue and are only a workaround.
Create enough dirty data and you'll get the same latencies, it's just
that that limit is now a lot higher. Your Intel SSD will write
streaming data 2-4 times faster than your typical disk - and can be an
order of magnitude faster when it comes to small, random writes.

> I suspect it's a combination of filesystem transaction locking, together
> with the VM wanting to write out some unrelated blocks or inodes due to
> the system just being close to the dirty limits. Which is why the
> system-wide hickups then happen especially when writing big files.
>
> The VM _tries_ to do writes in the background, but if the writepage() path
> hits a filesystem-level blocking lock, that background write suddenly
> becomes largely synchronous.
>
> I suspect there is also some possibility of confusion with inter-file
> (false) metadata dependencies. If a filesystem were to think that the file
> size is metadata that should be journaled (in a single journal), and the
> journaling code then decides that it needs to do those meta-data updates
> in the correct order (ie the big file write _before_ the file write that
> wants to be fsync'ed), then the fsync() will be delayed by a totally
> irrelevant large file having to have its data written out (due to
> data=ordered or whatever).

It certainly "feels" like that is the case from the workloads I have
that generate high latencies.

-Dave

David Rees

unread,

Mar 25, 2009, 2:17:02 PM3/25/09

to Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Hmm, not as fast as I expected.

> It has 2GB of battery backed cache. I'm fairly sure that when it was new
> (and I only had connected one host) I could get it up at around 350MB/s.

With 2GB of BBC, I'm surprised you are seeing as much latency as you
are. It should be able to suck down writes as fast as you can throw
at it. Is the array configured in writeback mode?

>> On a 32GB system that's 1.6GB of dirty data, but your array should be
>> able to write that out fairly quickly (in a couple seconds) as long as
>> it's not too random. If it's spread all over the disk, write
>> throughput will drop significantly - how fast is data being written to
>> disk when your system suffers from large write latency?
>
> Thats another thing. I havent been debugging while hitting it (yet) but if I
> go ind and do a sync on the system manually. Then it doesn't get above
> 50MB/s in writeout (measured using dstat). But even that doesn't sum up to 8
> minutes .. 1.6GB at 50MB/s ..=> 32 s.

Have you also tried increasing the IO priority of the kjournald
processes as a workaround as Arjan van de Ven suggests?

You must have a significant amount of activity going to that FC array
from other clients - it certainly doesn't seem to be performing as
well as it could/should be.

-Dave

Linus Torvalds

unread,

Mar 25, 2009, 2:30:38 PM3/25/09

to David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

On Wed, 25 Mar 2009, David Rees wrote:
>
> Your Intel SSD will write streaming data 2-4 times faster than your
> typical disk

Don't even bother with streaming data. The problem is _never_ streaming
data.

Even a suck-ass laptop drive can write streaming data fast enough that
people don't care. The problem is invariably that writes from different
sources (much of it being metadata) interact and cause seeking.

> and can be an order of magnitude faster when it comes to small, random
> writes.

Umm. More like two orders of magnitude or more.

Random writes on a disk (even a fast one) tends to be in the hundreds of
kilobytes per second. Have you worked with an Intel SSD? It does tens of
MB/s on pure random writes.

The problem really is gone with an SSD.

And please realize that the problem for me was never 30-second stalls. For
me, a 3-second stall is unacceptable. It's just very annoying.

Linus

Theodore Tso

unread,

Mar 25, 2009, 2:30:51 PM3/25/09

to David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

On Tue, Mar 24, 2009 at 12:00:41PM -0700, David Rees wrote:
> >>> Consensus seems to be something with large memory machines, lots of dirty
> >>> pages and a long writeout time due to ext3.
> >>
> >> All filesystems seem to suffer from this issue to some degree. I
> >> posted to the list earlier trying to see if there was anything that
> >> could be done to help my specific case. I've got a system where if
> >> someone starts writing out a large file, it kills client NFS writes.
> >> Makes the system unusable:
> >> http://marc.info/?l=linux-kernel&m=123732127919368&w=2
> >
> > Yes, I've hit 120s+ penalties just by saving a file in vim.
>
> Yeah, your disks aren't keeping up and/or data isn't being written out
> efficiently.

Agreed; we probably will need to get some blktrace outputs to see what
is going on.

> >> Only workaround I've found is to reduce dirty_background_ratio and
> >> dirty_ratio to tiny levels. Or throw good SSDs and/or a fast RAID
> >> array at it so that large writes complete faster. Have you tried the
> >> new vm_dirty_bytes in 2.6.29?
> >
> > No.. What would you suggest to be a reasonable setting for that?
>
> Look at whatever is there by default and try cutting them in half to start.

I'm beginning to think that using a "ratio" may be the wrong way to
go. We probably need to add an optional dirty_max_megabytes field
where we start pushing dirty blocks out when the number of dirty
blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
which ever comes first. The problem is that 5% might make sense for a
small machine with only 1G of memory, but it might not make so much
sense if you have 32G of memory.

But the other problem is whether we are issuing the writes in an
efficient way, and that means we need to see what is going on at the
blktrace level as a starting point, and maybe we'll need some
custom-designed trace outputs to see what is going on at the
inode/logical block level, not just at the physical block level.

- Ted

Linus Torvalds

unread,

Mar 25, 2009, 2:34:23 PM3/25/09

to David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

On Wed, 25 Mar 2009, Linus Torvalds wrote:
>
> Even a suck-ass laptop drive can write streaming data fast enough that
> people don't care. The problem is invariably that writes from different
> sources (much of it being metadata) interact and cause seeking.

Actually, not just writes.

The IO priority thing is almost certainly that _reads_ (which get higher
priority by default due to being synchronous) get interspersed with the
writes, and then even if you _could_ be having streaming writes, what you
actually end up with is lots of seeking.

Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC
disk array that can eat 300MB/s when streaming - once you start seeking,
that 300MB/s goes down like a rock. Battery-protected write caches will
help - but not a whole lot when streaming more data than they have RAM.
Basic queuing theory.

Stephen Clark

unread,

Mar 25, 2009, 2:41:51 PM3/25/09

to Arjan van de Ven, Jesse Barnes, Theodore Tso, Ingo Molnar, Alan Cox, Andrew Morton, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List

Arjan van de Ven wrote:
> On Tue, 24 Mar 2009 16:03:53 -0700
> Jesse Barnes <jba...@virtuousgeek.org> wrote:
>
>> I remember early in the 2.6.x days there was a lot of focus on making
>> interactive performance good, and for a long time it was. But this
>> I/O problem has been around for a *long* time now... What happened?
>> Do not many people run into this daily? Do all the filesystem
>> hackers run with special mount options to mitigate the problem?
>>
>
> the people that care use my kernel patch on ext3 ;-)
> (or the userland equivalent tweak in /etc/rc.local)
>
>
>

Ok, I bite what is the userland tweak?

--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)

Linus Torvalds

unread,

Mar 25, 2009, 2:43:34 PM3/25/09

to Theodore Tso, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, 25 Mar 2009, Theodore Tso wrote:
>
> I'm beginning to think that using a "ratio" may be the wrong way to
> go. We probably need to add an optional dirty_max_megabytes field
> where we start pushing dirty blocks out when the number of dirty
> blocks exceeds either the dirty_ratio or the dirty_max_megabytes,
> which ever comes first.

We have that. Except it's called "dirty_bytes" and
"dirty_background_bytes", and it defaults to zero (off).

The problem being that unlike the ratio, there's no sane default value
that you can at least argue is not _entirely_ pointless.

Linus

Jesper Krogh

unread,

Mar 25, 2009, 2:46:38 PM3/25/09

to David Rees, Linus Torvalds, Linux Kernel Mailing List

David Rees wrote:
>>> writes. How fast is the array in practice?
>> Thats allways a good question.. This is by far not being the only user
>> of the array at the time of testing.. (there are 4 FC-channel connected to a
>> switch). Creating a fresh slice.. and just dd'ing onto it from /dev/zero
>> gives:
>> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=10000
>> 10000+0 records in
>> 10000+0 records out
>> 10485760000 bytes (10 GB) copied, 78.0557 s, 134 MB/s
>> jk@hest:~$ sudo dd if=/dev/zero of=/dev/sdh bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB) copied, 8.11019 s, 129 MB/s
>>
>> Watching using dstat while dd'ing it peaks at 220M/s
>
> Hmm, not as fast as I expected.

Me neither, but I always get disappointed.

>> It has 2GB of battery backed cache. I'm fairly sure that when it was new
>> (and I only had connected one host) I could get it up at around 350MB/s.
>
> With 2GB of BBC, I'm surprised you are seeing as much latency as you
> are. It should be able to suck down writes as fast as you can throw
> at it. Is the array configured in writeback mode?

Yes, but I triple checked.. the memory upgrade hadn't been installed, so
its actually only 512MB.

>
>>> On a 32GB system that's 1.6GB of dirty data, but your array should be
>>> able to write that out fairly quickly (in a couple seconds) as long as
>>> it's not too random. If it's spread all over the disk, write
>>> throughput will drop significantly - how fast is data being written to
>>> disk when your system suffers from large write latency?
>> Thats another thing. I havent been debugging while hitting it (yet) but if I
>> go ind and do a sync on the system manually. Then it doesn't get above
>> 50MB/s in writeout (measured using dstat). But even that doesn't sum up to 8
>> minutes .. 1.6GB at 50MB/s ..=> 32 s.
>
> Have you also tried increasing the IO priority of the kjournald
> processes as a workaround as Arjan van de Ven suggests?

No. I'll try to slip that one in.

--
Jesper

Ric Wheeler

unread,

Mar 25, 2009, 2:51:30 PM3/25/09

to Linus Torvalds, David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Wed, 25 Mar 2009, Linus Torvalds wrote:
>
>> Even a suck-ass laptop drive can write streaming data fast enough that
>> people don't care. The problem is invariably that writes from different
>> sources (much of it being metadata) interact and cause seeking.
>>
>
> Actually, not just writes.
>
> The IO priority thing is almost certainly that _reads_ (which get higher
> priority by default due to being synchronous) get interspersed with the
> writes, and then even if you _could_ be having streaming writes, what you
> actually end up with is lots of seeking.
>
> Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC
> disk array that can eat 300MB/s when streaming - once you start seeking,
> that 300MB/s goes down like a rock. Battery-protected write caches will
> help - but not a whole lot when streaming more data than they have RAM.
> Basic queuing theory.
>
> Linus
>

This is actually not really true - random writes to an enterprise disk
array will make your Intel SSD look slow. Effectively, they are
extremely large, battery backed banks of DRAM with lots of fibre channel
ports. Some of the bigger ones can have several hundred GB of DRAM and
dozens of fibre channel ports to feed them.

Of course, if your random writes exceed the cache capacity and you fall
back to their internal disks (SSD or traditional), your random write
speed will drop.

Ric

Ric Wheeler

unread,

Mar 25, 2009, 2:58:03 PM3/25/09

to Alan Cox, Linus Torvalds, David Rees, Theodore Tso, Jan Kara, Andrew Morton, Ingo Molnar, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, Jesper Krogh, Linux Kernel Mailing List

Alan Cox wrote:
>> Again, good SSD's don't care. Disks do. It doesn't matter if you have a FC
>> disk array that can eat 300MB/s when streaming - once you start seeking,
>> that 300MB/s goes down like a rock. Battery-protected write caches will
>> help - but not a whole lot when streaming more data than they have RAM.
>> Basic queuing theory.
>>
>

> Subtly more complex than that. If your mashed up I/O streams fit into the
> 2GB or so of cache (minus one stream to disk) you win. You also win
> because you take a lot of fragmented OS I/O and turn it into bigger
> chunks of writing better scheduled. The latter win arguably shouldn't
> happen but it does occur (I guess in part that says we suck) and it
> occurs big time when you've got multiple accessors to a shared storage
> system (where the host OS's can't help)
>
> Alan
>

The other thing that can impact random writes on arrays is their
internal "track" size - if the random write is of a partial track, it
forces a read-modify-write with a back end disk read. Some arrays have
large internal tracks, others have smaller ones.

Again, not unlike what you see with some SSD's and their erase block
size - give them even multiples of that and they are quite happy.

Ric

Theodore Tso

unread,

Mar 25, 2009, 2:59:36 PM3/25/09

to Linus Torvalds, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 10:29:48AM -0700, Linus Torvalds wrote:
> I suspect there is also some possibility of confusion with inter-file
> (false) metadata dependencies. If a filesystem were to think that the file
> size is metadata that should be journaled (in a single journal), and the
> journaling code then decides that it needs to do those meta-data updates
> in the correct order (ie the big file write _before_ the file write that
> wants to be fsync'ed), then the fsync() will be delayed by a totally
> irrelevant large file having to have its data written out (due to
> data=ordered or whatever).

It's not just the file size; it's the block allocation decisions.
Ext3 doesn't have delayed allocation, so as soon as you issue the
write, we have to allocate the block, which means grabbing blocks and
making changes to the block bitmap, and then updating the inode with
those block allocation decisions. It's a lot more than just i_size.
And the problem is that if we do this for the big file write, and the
small file write happens to also touch the same inode table block
and/or block allocation bitmap, when we fsync() the small file, when
we end up pushing out the metadata updates associated with the big
file write, and so thus we need to flush out the data blocks
associated with the big file write as well.

Now, there are three ways of solving this problem. One is to use
delayed allocation, where we don't make the block allocation decisions
until the very last minute. This is what ext4 and XFS does. The
problem with this is that when we have unrelated filesystem operations
that end up causing zero length files before the file write (i.e.,
replace-via-truncate, where the application does open/truncate/write/
close) or the after the file write (i.e., replace-via-rename, where
the application does open/write/close/rename) and the application
omits the fsync(). So with ext4 we has workarounds that start pushing
out the data blocks in the for replace-via-rename and
replace-via-truncate cases, while XFS will do an implied fsync for
replace-via-truncate only, and btrfs will do an implied fsync for
replace-via-rename only.

The second solution is we could add a huge amount of machinery to try
track these logical dependencies, and then be able to "back out" the
changes to the inode table or block allocation bitmap for the big file
write when we want to fsync out the small file. This is roughly what
the BSD Soft Updates mechanisms does, and it works, but at the cost of
a *huge* amount of complexity. The amount of accounting data you have
to track so that you can partially back out various filesystem
operations, and then the state tables that make use of this accounting
data is not trivial. One of the downsides of this mechanism is that
it makes it extremely difficult to add new features/functionality such
as extended attributes or ACL's, since very few people understand the
complexities needed to support it. As a result Linux had acl and
xattr support long before Kirk McKusick got around to adding those
features in UFS2.

The third potential solution we can try doing is to make some tuning
adjustments to the VM so that we start pushing out these data blocks
much more aggressively out to the disk. If we assume that many
applications aren't going to be using fsync, and we need to worry
about all sorts of implied dependencies where a small file gets pushed
out to disk, but a large file does not, you can have endless amounts
of fun in terms of "application level file corruption", which is
simply caused by the fact that a small file has been pushed out to
disk, and a large file hasn't been pushed out to disk yet. If it's
going to be considered fair game that application programmers aren't
going to be required to use fsync() when they need to depend on
something being on stable storage after a crash, then we need to tune
the VM to much more aggressively clean dirty pages. Even if we remove
the false dependencies at the filesystem level (i.e., fsck-detectable
consistency problems), there is no way for the filesystem to be able
to guess about implied dependencies between different files at the
application level.

Traditionally, the way applications told us about such dependencies
was fsync(). But if application programmers are demanding that
fsync() is no longer required for correct operation after a filesystem
crash, all we can do is push things out to disk much more
aggressively.

- Ted

Jeff Garzik

unread,

Mar 25, 2009, 3:33:33 PM3/25/09

to Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> On Tue, Mar 24 2009, Jeff Garzik wrote:
>> Linus Torvalds wrote:
>>> But I really don't understand filesystem people who think that "fsck"
>>> is the important part, regardless of whether the data is valid or not.
>>> That's just stupid and _obviously_ bogus.
>> I think I can understand that point of view, at least:
>>
>> More customers complain about hours-long fsck times than they do about
>> silent data corruption of non-fsync'd files.
>>
>>
>>> The point is, if you write your metadata earlier (say, every 5 sec) and
>>> the real data later (say, every 30 sec), you're actually MORE LIKELY to
>>> see corrupt files than if you try to write them together.
>>>
>>> And if you write your data _first_, you're never going to see
>>> corruption at all.
>> Amen.
>>
>> And, personal filesystem pet peeve: please encourage proper FLUSH CACHE
>> use to give users the data guarantees they deserve. Linux's sync(2) and
>> fsync(2) (and fdatasync, etc.) should poke the block layer to guarantee
>> a media write.
>
> fsync already does that, at least if you have barriers enabled on your
> drive.

Erm, no, you don't enable barriers on your drive, they are not a
hardware feature. You enable barriers via your filesystem.

Stating "fsync already does that" borders on false, because that assumes
(a) the user has a fs that supports barriers
(b) the user is actually aware of a 'barriers' mount option and what it
means
(c) the user has turned on an option normally defaulted to off.

Or in other words, it pretty much never happens.

Furthermore, a blatantly obvious place to flush data to media --
fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block
layer to issue a FLUSH CACHE for __any__ filesystem. But that doesn't
happen either.

So, no, for 95% of Linux users, fsync does _not_ already do that. If
you are lucky enough to use XFS or ext4, you're covered. That's it.

Jeff

Jens Axboe

unread,

Mar 25, 2009, 3:44:03 PM3/25/09

to Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>> Linus Torvalds wrote:
>>>> But I really don't understand filesystem people who think that
>>>> "fsck" is the important part, regardless of whether the data is
>>>> valid or not. That's just stupid and _obviously_ bogus.
>>> I think I can understand that point of view, at least:
>>>
>>> More customers complain about hours-long fsck times than they do
>>> about silent data corruption of non-fsync'd files.
>>>
>>>
>>>> The point is, if you write your metadata earlier (say, every 5 sec)
>>>> and the real data later (say, every 30 sec), you're actually MORE
>>>> LIKELY to see corrupt files than if you try to write them together.
>>>>
>>>> And if you write your data _first_, you're never going to see
>>>> corruption at all.
>>> Amen.
>>>
>>> And, personal filesystem pet peeve: please encourage proper FLUSH
>>> CACHE use to give users the data guarantees they deserve. Linux's
>>> sync(2) and fsync(2) (and fdatasync, etc.) should poke the block
>>> layer to guarantee a media write.
>>
>> fsync already does that, at least if you have barriers enabled on your
>> drive.
>
> Erm, no, you don't enable barriers on your drive, they are not a
> hardware feature. You enable barriers via your filesystem.

Thanks for the lesson Jeff, I'm obviously not aware how that stuff
works...

> Stating "fsync already does that" borders on false, because that assumes
> (a) the user has a fs that supports barriers
> (b) the user is actually aware of a 'barriers' mount option and what it
> means
> (c) the user has turned on an option normally defaulted to off.
>
> Or in other words, it pretty much never happens.

That is true, except if you use xfs/ext4. And this discussion is fine,
as was the one a few months back that got ext4 to enable barriers by
default. If I had submitted patches to do that back in 2001/2 when the
barrier stuff was written, I would have been shot for introducing such a
slow down. After people found out that it just wasn't something silly,
then you have a way to enable it.

I'd still wager that most people would rather have a 'good enough
fsync' on their desktops than incur the penalty of barriers or write
through caching. I know I do.

> Furthermore, a blatantly obvious place to flush data to media --
> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block
> layer to issue a FLUSH CACHE for __any__ filesystem. But that doesn't
> happen either.
>
> So, no, for 95% of Linux users, fsync does _not_ already do that. If
> you are lucky enough to use XFS or ext4, you're covered. That's it.

The point is that you need to expose this choice somewhere, and that
'somewhere' isn't manually editing fstab and enabling barriers or
fsync-for-real. And it should be easier.

Another problem is that FLUSH_CACHE sucks. Really. And not just on
ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
wit for the world to finish. Pretty hard to teach people to use a nicer
fdatasync(), when the majority of the cost now becomes flushing the
cache of that 1TB drive you happen to have 8 partitions on. Good luck
with that.

--
Jens Axboe

Christoph Hellwig

unread,

Mar 25, 2009, 3:44:41 PM3/25/09

to Jeff Garzik, Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 03:32:13PM -0400, Jeff Garzik wrote:
> So, no, for 95% of Linux users, fsync does _not_ already do that. If
> you are lucky enough to use XFS or ext4, you're covered. That's it.

reiserfs also does the correct thing. As does ext3 on suse kernels.

Christoph Hellwig

unread,

Mar 25, 2009, 3:50:00 PM3/25/09

to Theodore Tso, Linus Torvalds, Jan Kara, Andrew Morton, Ingo Molnar, Alan Cox, Arjan van de Ven, Peter Zijlstra, Nick Piggin, Jens Axboe, David Rees, Jesper Krogh, Linux Kernel Mailing List

On Wed, Mar 25, 2009 at 02:58:24PM -0400, Theodore Tso wrote:
> omits the fsync(). So with ext4 we has workarounds that start pushing
> out the data blocks in the for replace-via-rename and
> replace-via-truncate cases, while XFS will do an implied fsync for
> replace-via-truncate only, and btrfs will do an implied fsync for
> replace-via-rename only.

The XFS one and the ext4 one that I saw only start an _asynchronous_
writeout. Which is not an implied fsync but snake oil to make the
most common complaints go away without providing hard guarantees.

IFF we want to go down this route we should better provide strong
guranteed semantics and document the propery. And of course implement
it consistently on all native filesystems.

> Traditionally, the way applications told us about such dependencies
> was fsync(). But if application programmers are demanding that
> fsync() is no longer required for correct operation after a filesystem
> crash, all we can do is push things out to disk much more
> aggressively.

Note that the rename for atomic commits trick originated in mail severs
which always did the proper fsync. When the word spread into the
desktop world it looks like this wisdom got lost.

Ric Wheeler

unread,

Mar 25, 2009, 3:53:39 PM3/25/09

to Jens Axboe, Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE
is per device (not file system).

When you issue an fsync() on a disk with multiple partitions, you will
flush the data for all of its partitions from the write cache....

ric

Jens Axboe

unread,

Mar 25, 2009, 3:58:08 PM3/25/09

to Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Exactly, that's what my (vague) 8 partition reference was for :-)
A range flush would be so much more palatable.

--
Jens Axboe

Jeff Garzik

unread,

Mar 25, 2009, 4:17:36 PM3/25/09

to Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Ric Wheeler wrote:> And, as I am sure that you do know, to add insult to

injury, FLUSH_CACHE
> is per device (not file system).
>
> When you issue an fsync() on a disk with multiple partitions, you will
> flush the data for all of its partitions from the write cache....

SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length) pair.
We could make use of that.

And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could
demonstrate clear benefit.

Jeff

Jeff Garzik

unread,

Mar 25, 2009, 4:26:41 PM3/25/09

to Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jens Axboe wrote:
> On Wed, Mar 25 2009, Jeff Garzik wrote:
>> Stating "fsync already does that" borders on false, because that assumes
>> (a) the user has a fs that supports barriers
>> (b) the user is actually aware of a 'barriers' mount option and what it
>> means
>> (c) the user has turned on an option normally defaulted to off.
>>
>> Or in other words, it pretty much never happens.
>
> That is true, except if you use xfs/ext4. And this discussion is fine,
> as was the one a few months back that got ext4 to enable barriers by
> default. If I had submitted patches to do that back in 2001/2 when the
> barrier stuff was written, I would have been shot for introducing such a
> slow down. After people found out that it just wasn't something silly,
> then you have a way to enable it.
>
> I'd still wager that most people would rather have a 'good enough
> fsync' on their desktops than incur the penalty of barriers or write
> through caching. I know I do.

That's a strawman argument: The choice is not between "good enough
fsync" and full use of barriers / write-through caching, at all.

It is clearly possible to implement an fsync(2) that causes FLUSH CACHE
to be issued, without adding full barrier support to a filesystem. It
is likely doable to avoid touching per-filesystem code at all, if we
issue the flush from a generic fsync(2) code path in the kernel.

Thus, you have a "third way": fsync(2) gives the guarantee it is
supposed to, but you do not take the full performance hit of
barriers-all-the-time.

Remember, fsync(2) means that the user _expects_ a performance hit.

And they took the extra step to call fsync(2) because they want a
guarantee, not a lie.

Jeff

Ric Wheeler

unread,

Mar 25, 2009, 4:28:27 PM3/25/09

to Jeff Garzik, James Bottomley, Ric Wheeler, Jens Axboe, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Jeff Garzik wrote:
> Ric Wheeler wrote:> And, as I am sure that you do know, to add insult
> to injury, FLUSH_CACHE
>> is per device (not file system).
>>
>> When you issue an fsync() on a disk with multiple partitions, you
>> will flush the data for all of its partitions from the write cache....
>
> SCSI'S SYNCHRONIZE CACHE command already accepts an (LBA, length)
> pair. We could make use of that.
>
> And I bet we could convince T13 to add FLUSH CACHE RANGE, if we could
> demonstrate clear benefit.
>
> Jeff

How well supported is this in SCSI? Can we try it out with a commodity
SAS drive?

Ric

Hugh Dickins

unread,

Mar 25, 2009, 4:43:57 PM3/25/09

to Jens Axboe, Ric Wheeler, Jeff Garzik, Linus Torvalds, Theodore Tso, Ingo Molnar, Alan Cox, Arjan van de Ven, Andrew Morton, Peter Zijlstra, Nick Piggin, David Rees, Jesper Krogh, Linux Kernel Mailing List

Tangential question, but am I right in thinking that BIO_RW_BARRIER
similarly bars across all partitions, whereas its WRITE_BARRIER and
DISCARD_BARRIER users would actually prefer it to apply to just one?

Hugh