Poor interactive performance with I/O loads with fsync()ing

Ben Gamari

unread,

Mar 16, 2010, 11:40:02 AM3/16/10

to

Hey all,

Recently I started using the Xapian-based notmuch mail client for everyday
use. One of the things I was quite surprised by after the switch was the
incredible hit in interactive performance that is observed during database
updates. Things are particularly bad during runs of 'notmuch new,' which scans
the file system looking for new messages and adds them to the database.
Specifically, the worst of the performance hit appears to occur when the
database is being updated.

During these periods, even small chunks of I/O can become minute-long ordeals.
It is common for latencytop to show 30 second long latencies for page faults
and writing pages. Interactive performance is absolutely abysmal, with other
unrelated processes feeling horrible latencies, causing media players,
editors, and even terminals to grind to a halt.

Despite the system being clearly I/O bound, iostat shows pitiful disk
throughput (700kByte/second read, 300 kByte/second write). Certainly this poor
performance can, at least to some degree, be attributable to the fact that
Xapian uses fdatasync() to ensure data consistency. That being said, it seems
like Xapian's page usage causes horrible thrashing, hence the performance hit
on unrelated processes. Moreover, the hit on unrelated processes is so bad
that I would almost suspect that swap I/O is being serialized by fsync() as
well, despite being on a separate swap partition beyond the control of the
filesystem.

Xapian, however, is far from the first time I have seen this sort of
performance cliff. Rsync, which also uses fsync(), can also trigger this sort
of thrashing during system backups, as can rdiff. slocate's updatedb
absolutely kills interactive performance as well.

Issues similar to this have been widely reported[1-5] in the past, and despite
many attempts[5-8] within both I/O and memory managements subsystems to fix
it, the problem certainly remains. I have tried reducing swappiness from 60 to
40, with some small improvement and it has been reported[20] that these sorts
of symptoms can be negated through use of memory control groups to prevent
interactive process pages from being evicted.

I would really like to see this issue finally fixed. I have tried
several[2][3] times to organize the known data about this bug, although in all
cases discussion has stopped with claims of insufficient data (which is fair,
admittedly, it's a very difficult issue to tackle). However, I do think that
_something_ has to be done to alleviate the thrashing and poor interactive
performance that these work-loads cause.

Thanks,

- Ben

[1] http://bugzilla.kernel.org/show_bug.cgi?id=5900
[2] http://bugzilla.kernel.org/show_bug.cgi?id=7372
[3] http://bugzilla.kernel.org/show_bug.cgi?id=12309
[4] http://lkml.org/lkml/2009/4/28/24
[5] http://lkml.org/lkml/2009/3/26/72
[6] http://notmuchmail.org/pipermail/notmuch/2010/001868.html

[10] http://lkml.org/lkml/2009/5/16/225
[11] http://lkml.org/lkml/2007/7/21/219
[12] http://lwn.net/Articles/328363/
[13] http://lkml.org/lkml/2009/4/6/114

[20] http://lkml.org/lkml/2009/4/28/68

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

ty...@mit.edu

unread,

Mar 16, 2010, 9:30:02 PM3/16/10

to

On Tue, Mar 16, 2010 at 08:31:12AM -0700, Ben Gamari wrote:
> Hey all,
>
> Recently I started using the Xapian-based notmuch mail client for everyday
> use. One of the things I was quite surprised by after the switch was the
> incredible hit in interactive performance that is observed during database
> updates. Things are particularly bad during runs of 'notmuch new,' which scans
> the file system looking for new messages and adds them to the database.
> Specifically, the worst of the performance hit appears to occur when the
> database is being updated.

What kernel version are you using; what distribution and what version
of that distro are you running; what file system are you using and
what if any mount options are you using? And what kind of hard drives
do you have?

I'm going to assume you're running into the standard ext3
"data=ordered" entagled writes problem. There are solutions, such as
switching to using ext4, mounting with data=writeback mode, but they
have various shortcomings.

A number of improvements have been made in ext3 and ext4 since some of
the discussions you quoted, but since you didn't tell us what
distribution version and/or what kernel version you are using, we
can't tell you are using those newer improvements yet.

- Ted

Ben Gamari

unread,

Mar 16, 2010, 11:20:01 PM3/16/10

to

Sorry about the lack of any useful information in my initial email.
I clearly didn't read it before sending.

On Tue, 16 Mar 2010 21:24:39 -0400, ty...@mit.edu wrote:
> What kernel version are you using; what distribution and what version
> of that distro are you running; what file system are you using and
> what if any mount options are you using? And what kind of hard drives
> do you have?

While this problem has been around for some time, my current configuration
is the following:

Kernel 2.6.32 (although also reproducible with kernels at least as early as 2.6.28)
Filesystem: Now Btrfs (was ext4 less than a week ago), default mount options
Hard drive: Seagate Momentus 7200.4 (ST9500420AS)
Distribution: Ubuntu 9.10 (Karmic)

>
> I'm going to assume you're running into the standard ext3
> "data=ordered" entagled writes problem. There are solutions, such as
> switching to using ext4, mounting with data=writeback mode, but they
> have various shortcomings.
>

Unfortunately several people have continued to encounter unacceptable
latency, even with ext4 and data=writeback.

> A number of improvements have been made in ext3 and ext4 since some of
> the discussions you quoted, but since you didn't tell us what
> distribution version and/or what kernel version you are using, we
> can't tell you are using those newer improvements yet.
>

Sorry about that. I should know better by now.

- Ben

ty...@mit.edu

unread,

Mar 16, 2010, 11:40:02 PM3/16/10

to

On Tue, Mar 16, 2010 at 08:18:09PM -0700, Ben Gamari wrote:
> Sorry about the lack of any useful information in my initial email.
> I clearly didn't read it before sending.
>
> On Tue, 16 Mar 2010 21:24:39 -0400, ty...@mit.edu wrote:
> > What kernel version are you using; what distribution and what version
> > of that distro are you running; what file system are you using and
> > what if any mount options are you using? And what kind of hard drives
> > do you have?
>
> While this problem has been around for some time, my current configuration
> is the following:
>
> Kernel 2.6.32 (although also reproducible with kernels at least as early as 2.6.28)
> Filesystem: Now Btrfs (was ext4 less than a week ago), default mount options
> Hard drive: Seagate Momentus 7200.4 (ST9500420AS)
> Distribution: Ubuntu 9.10 (Karmic)

.... so did switching to Btrfs solve your latency issues, or are you
still having problems?

- Ted

Ben Gamari

unread,

Mar 17, 2010, 12:40:01 AM3/17/10

to

On Tue, 16 Mar 2010 23:30:10 -0400, ty...@mit.edu wrote:
> .... so did switching to Btrfs solve your latency issues, or are you
> still having problems?

Still having troubles although I'm now running 2.6.34-rc1 and things seem
mildly better. I'll try doing a backup tonight and report back.

- Ben

Nick Piggin

unread,

Mar 17, 2010, 1:00:02 AM3/17/10

to

Hi,

On Tue, Mar 16, 2010 at 08:31:12AM -0700, Ben Gamari wrote:

> Hey all,
>
> Recently I started using the Xapian-based notmuch mail client for everyday
> use. One of the things I was quite surprised by after the switch was the
> incredible hit in interactive performance that is observed during database
> updates. Things are particularly bad during runs of 'notmuch new,' which scans
> the file system looking for new messages and adds them to the database.
> Specifically, the worst of the performance hit appears to occur when the
> database is being updated.
>
> During these periods, even small chunks of I/O can become minute-long ordeals.
> It is common for latencytop to show 30 second long latencies for page faults
> and writing pages. Interactive performance is absolutely abysmal, with other
> unrelated processes feeling horrible latencies, causing media players,
> editors, and even terminals to grind to a halt.
>
> Despite the system being clearly I/O bound, iostat shows pitiful disk
> throughput (700kByte/second read, 300 kByte/second write). Certainly this poor
> performance can, at least to some degree, be attributable to the fact that
> Xapian uses fdatasync() to ensure data consistency. That being said, it seems
> like Xapian's page usage causes horrible thrashing, hence the performance hit
> on unrelated processes.

Where are the unrelated processes waiting? Can you get a sample of
several backtraces? (/proc/<pid>/stack should do it)

> Moreover, the hit on unrelated processes is so bad
> that I would almost suspect that swap I/O is being serialized by fsync() as
> well, despite being on a separate swap partition beyond the control of the
> filesystem.

It shouldn't be, until it reaches the bio layer. If it is on the same
block device, it will still fight for access. It could also be blocking
on dirty data thresholds, or page reclaim though -- writeback and
reclaim could easily be getting slowed down by the fsync activity.

Swapping tends to cause fairly nasty disk access patterns, combined with
fsync it could be pretty unavoidable.

>
> Xapian, however, is far from the first time I have seen this sort of
> performance cliff. Rsync, which also uses fsync(), can also trigger this sort
> of thrashing during system backups, as can rdiff. slocate's updatedb
> absolutely kills interactive performance as well.
>
> Issues similar to this have been widely reported[1-5] in the past, and despite
> many attempts[5-8] within both I/O and memory managements subsystems to fix
> it, the problem certainly remains. I have tried reducing swappiness from 60 to
> 40, with some small improvement and it has been reported[20] that these sorts
> of symptoms can be negated through use of memory control groups to prevent
> interactive process pages from being evicted.

So the workload is causing quite a lot of swapping as well? How much
pagecache do you have? It could be that you have too much pagecache and
it is pushing out anonymous memory too easily, or you might have too
little pagecache causing suboptimal writeout patterns (possibly writeout
from page reclaim rather than asynchronous dirty page cleaner threads,
which can really hurt).

Thanks,
Nick

Ingo Molnar

unread,

Mar 17, 2010, 5:40:04 AM3/17/10

to

* Nick Piggin <npi...@suse.de> wrote:

> Hi,
>
> On Tue, Mar 16, 2010 at 08:31:12AM -0700, Ben Gamari wrote:
> > Hey all,
> >
> > Recently I started using the Xapian-based notmuch mail client for everyday
> > use. One of the things I was quite surprised by after the switch was the
> > incredible hit in interactive performance that is observed during database
> > updates. Things are particularly bad during runs of 'notmuch new,' which scans
> > the file system looking for new messages and adds them to the database.
> > Specifically, the worst of the performance hit appears to occur when the
> > database is being updated.
> >
> > During these periods, even small chunks of I/O can become minute-long ordeals.
> > It is common for latencytop to show 30 second long latencies for page faults
> > and writing pages. Interactive performance is absolutely abysmal, with other
> > unrelated processes feeling horrible latencies, causing media players,
> > editors, and even terminals to grind to a halt.
> >
> > Despite the system being clearly I/O bound, iostat shows pitiful disk
> > throughput (700kByte/second read, 300 kByte/second write). Certainly this poor
> > performance can, at least to some degree, be attributable to the fact that
> > Xapian uses fdatasync() to ensure data consistency. That being said, it seems
> > like Xapian's page usage causes horrible thrashing, hence the performance hit
> > on unrelated processes.
>
> Where are the unrelated processes waiting? Can you get a sample of several
> backtraces? (/proc/<pid>/stack should do it)

A call-graph profile will show the precise reason for IO latencies, and their
relatively likelihood.

It's really simple to do it with a recent kernel. Firstly, enable
CONFIG_BLK_DEV_IO_TRACE=y, CONFIG_EVENT_PROFILE=y:

Kernel performance events and counters (PERF_EVENTS) [Y/?] y
Tracepoint profiling sources (EVENT_PROFILE) [Y/n/?] y
Support for tracing block IO actions (BLK_DEV_IO_TRACE) [N/y/?] y

(boot into this kernel)

Then build perf via:

cd tools/perf/
make -j install

and then capture 10 seconds of the DB workload:

perf record -f -g -a -e block:block_rq_issue -c 1 sleep 10

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.251 MB perf.data (~10977 samples) ]

and look at the call-graph output:

perf report

# Samples: 5
#
# Overhead Command Shared Object Symbol
# ........ ............... ................. ......
#
80.00% kjournald [kernel.kallsyms] [k] perf_trace_block_rq_issue
|
--- perf_trace_block_rq_issue
scsi_request_fn
|
|--50.00%-- __blk_run_queue
| cfq_insert_request
| elv_insert
| __elv_add_request
| __make_request
| generic_make_request
| submit_bio
| submit_bh
| sync_dirty_buffer
| journal_commit_transaction
| kjournald
| kthread
| kernel_thread_helper
|
--50.00%-- __generic_unplug_device
generic_unplug_device
blk_unplug
blk_backing_dev_unplug
sync_buffer
__wait_on_bit
out_of_line_wait_on_bit
__wait_on_buffer
wait_on_buffer
journal_commit_transaction
kjournald
kthread
kernel_thread_helper

20.00% as [kernel.kallsyms] [k] perf_trace_block_rq_issue
|
--- perf_trace_block_rq_issue
scsi_request_fn
__generic_unplug_device
generic_unplug_device
blk_unplug
blk_backing_dev_unplug
page_cache_async_readahead
generic_file_aio_read
do_sync_read
vfs_read
sys_read
system_call_fastpath
0x39f8ad4930

This (very simple) example had 80% of the IO in kjournald and 20% of it in
'as'. The precise call-paths of IO issues are visible.

For general scheduler context-switch events you can use:

perf record -f -g -a -e context-switches -c 1 sleep 10

see 'perf list' for all events.

Thanks,

Ingo

Pawel S

unread,

Mar 23, 2010, 7:30:01 AM3/23/10

to

Hello

I am experiencing very similar issue. My system is a regular desktop
PC and it suffers from very high I/O latencies (sometimes desktop
"hangs" for eight seconds or more) when copying large files. I tried
kernels up to 2.6.34-rc2, but without luck. This issue was raised at
Phoronix forums and Arjan (from Intel) noticed it can be VM related:

http://www.phoronix.com/forums/showpost.php?p=114975&postcount=51

Here is my perf timechart where you can notice I/O "steals" CPU from
the other tasks:

http://hotfile.com/dl/30596827/ebe566b/output.svg.gz.html

Regards!

P.S. if there is some way I can help more just let me know please.

Jens Axboe

unread,

Mar 23, 2010, 9:30:02 AM3/23/10

to

On Tue, Mar 23 2010, Pawel S wrote:
> Hello
>
> I am experiencing very similar issue. My system is a regular desktop
> PC and it suffers from very high I/O latencies (sometimes desktop
> "hangs" for eight seconds or more) when copying large files. I tried
> kernels up to 2.6.34-rc2, but without luck. This issue was raised at
> Phoronix forums and Arjan (from Intel) noticed it can be VM related:
>
> http://www.phoronix.com/forums/showpost.php?p=114975&postcount=51
>
> Here is my perf timechart where you can notice I/O "steals" CPU from
> the other tasks:
>
> http://hotfile.com/dl/30596827/ebe566b/output.svg.gz.html

It's also been my sneaking suspicion that swap is involved. I had lots
of RAM in anything I use, even the laptop and workstation. I'll try and
run some tests with lower memory and force it into swap, I've seen nasty
hangs that way.

--
Jens Axboe

Jesper Krogh

unread,

Mar 23, 2010, 4:10:02 PM3/23/10

to

Ben Gamari wrote:
> Hey all,
>
> Recently I started using the Xapian-based notmuch mail client for everyday
> use. One of the things I was quite surprised by after the switch was the
> incredible hit in interactive performance that is observed during database
> updates. Things are particularly bad during runs of 'notmuch new,' which scans
> the file system looking for new messages and adds them to the database.
> Specifically, the worst of the performance hit appears to occur when the
> database is being updated.

I would suggest that you include a 2.6.31 kernel in your testing. I have
seen something that seems like "huge" stalls in 2.6.32 but I havent been
able to "dig into it" to find more.

In 2.6.32 I have seen IO-wait numbers around 80% on a 16 core machine
with 128GB of memory and load-numbers over 120 under workloads that
didn't make 2.6.31 sweat at all.

Filesystems are a mixture of ext3 and ext4 (so it could be the barriers)?

--
Jesper

Ben Gamari

unread,

Mar 25, 2010, 11:20:01 PM3/25/10

to

On Tue, 16 Mar 2010 21:31:03 -0700 (PDT), Ben Gamari <bgamar...@gmail.com> wrote:
> On Tue, 16 Mar 2010 23:30:10 -0400, ty...@mit.edu wrote:
> > .... so did switching to Btrfs solve your latency issues, or are you
> > still having problems?
>
> Still having troubles although I'm now running 2.6.34-rc1 and things seem
> mildly better. I'll try doing a backup tonight and report back.
>

I stand by my assertion that 2.6.34 does seem better in some regards. While
there certainly are still latency issues, it's now less often that heavy I/O
spills over into over processes' interactive performance. That being said,
earlier this evening Tracker and notmuch were both indexing and I saw several
events of tens of seconds of latency.

Ben Gamari

unread,

Mar 25, 2010, 11:20:02 PM3/25/10

to

On Tue, 16 Mar 2010 08:31:12 -0700 (PDT), Ben Gamari <bgamar...@gmail.com> wrote:
> Hey all,

I apologize for my extreme tardiness in replying to your responses. I was
hoping to have more time during Spring break in dealing with this issue than I
did (as always). Nevertheless, I'll hopefully be able to keep up with things at
this point. Specific replies will follow.

- Ben

Ben Gamari

unread,

Mar 25, 2010, 11:30:02 PM3/25/10

to

On Wed, 17 Mar 2010 15:53:50 +1100, Nick Piggin <npi...@suse.de> wrote:
> Where are the unrelated processes waiting? Can you get a sample of
> several backtraces? (/proc/<pid>/stack should do it)
>

I wish. One of the incredibly frustrating characteristics of this issue is the
difficulty in measuring it. By the time processes begin blocking, it's already
far too late to open a terminal and cat to a file. By the time the terminal has
opened, tens of seconds have passed and things have started to return to normal.

>
> > Moreover, the hit on unrelated processes is so bad
> > that I would almost suspect that swap I/O is being serialized by fsync() as
> > well, despite being on a separate swap partition beyond the control of the
> > filesystem.
>
> It shouldn't be, until it reaches the bio layer. If it is on the same
> block device, it will still fight for access. It could also be blocking
> on dirty data thresholds, or page reclaim though -- writeback and
> reclaim could easily be getting slowed down by the fsync activity.
>

Hmm, this sounds interesting. Is there a way to monitor writeback throughput.

> Swapping tends to cause fairly nasty disk access patterns, combined with
> fsync it could be pretty unavoidable.
>

This is definitely a possibility. However, it seems to me like swapping should
be at least mildly favored over other I/O by the I/O scheduler. That being
said, I can certainly see how it would be difficult to implement such a
heuristic in a fair way so as not to block out standard filesystem access
during a thrashing spree.

> >
> > Xapian, however, is far from the first time I have seen this sort of
> > performance cliff. Rsync, which also uses fsync(), can also trigger this sort
> > of thrashing during system backups, as can rdiff. slocate's updatedb
> > absolutely kills interactive performance as well.
> >
> > Issues similar to this have been widely reported[1-5] in the past, and despite
> > many attempts[5-8] within both I/O and memory managements subsystems to fix
> > it, the problem certainly remains. I have tried reducing swappiness from 60 to
> > 40, with some small improvement and it has been reported[20] that these sorts
> > of symptoms can be negated through use of memory control groups to prevent
> > interactive process pages from being evicted.
>
> So the workload is causing quite a lot of swapping as well? How much
> pagecache do you have? It could be that you have too much pagecache and
> it is pushing out anonymous memory too easily, or you might have too
> little pagecache causing suboptimal writeout patterns (possibly writeout
> from page reclaim rather than asynchronous dirty page cleaner threads,
> which can really hurt).
>

As far as I can tell, the workload should fit in memory without a problem. This
machine has 4 gigabytes of memory, of which currently 2.8GB is page cache.
Seems high perhaps? I've included meminfo below. I can completely see how
overly-aggressive page-cache would result in this sort of behavior.

- Ben

MemTotal: 4048068 kB
MemFree: 47232 kB
Buffers: 48 kB
Cached: 2774648 kB
SwapCached: 1148 kB
Active: 2353572 kB
Inactive: 1355980 kB
Active(anon): 1343176 kB
Inactive(anon): 342644 kB
Active(file): 1010396 kB
Inactive(file): 1013336 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4883756 kB
SwapFree: 4882532 kB
Dirty: 24736 kB
Writeback: 0 kB
AnonPages: 933820 kB
Mapped: 88840 kB
Shmem: 750948 kB
Slab: 150752 kB
SReclaimable: 121404 kB
SUnreclaim: 29348 kB
KernelStack: 2672 kB
PageTables: 31312 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6907788 kB
Committed_AS: 2773672 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 364080 kB
VmallocChunk: 34359299100 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 8552 kB
DirectMap2M: 4175872 kB

Ben Gamari

unread,

Mar 25, 2010, 11:40:02 PM3/25/10

to

On Tue, 23 Mar 2010 14:27:19 +0100, Jens Axboe <jens....@oracle.com> wrote:
> It's also been my sneaking suspicion that swap is involved. I had lots
> of RAM in anything I use, even the laptop and workstation. I'll try and
> run some tests with lower memory and force it into swap, I've seen nasty
> hangs that way.
>

Find out anything useful?

- Ben

Ben Gamari

unread,

Mar 25, 2010, 11:40:02 PM3/25/10

to

On Wed, 17 Mar 2010 10:37:04 +0100, Ingo Molnar <mi...@elte.hu> wrote:
>
> A call-graph profile will show the precise reason for IO latencies, and their
> relatively likelihood.
>

Once I get home I'll try to reproduce the issue and get a call graph. Thanks!

- Ben

Ben Gamari

unread,

Mar 27, 2010, 9:30:02 PM3/27/10

to

Hey all,

I have posted another profile[1] from an incident yesterday. As you can see,
both swapper and init (strange?) show up prominently in the profile. Moreover,
most processes seem to be in blk_peek_request a disturbingly large percentage
of the time. Both of these profiles were taken with 2.6.34-rc kernels.

Anyone have any ideas on how to proceed? Is more profile data necessary? Are
the existing profiles at all useful? Thanks,

- Ben

Ben Gamari

unread,

Mar 27, 2010, 9:30:01 PM3/27/10

to

On Sat, 27 Mar 2010 18:20:37 -0700 (PDT), Ben Gamari <bgamar...@gmail.com> wrote:
> Hey all,
>
> I have posted another profile[1] from an incident yesterday. As you can see,
> both swapper and init (strange?) show up prominently in the profile. Moreover,
> most processes seem to be in blk_peek_request a disturbingly large percentage
> of the time. Both of these profiles were taken with 2.6.34-rc kernels.
>

Apparently my initial email announcing my first set of profiles never made it
out. Sorry for the confusion. I've included it below.

From: Ben Gamari <bgamar...@gmail.com>
Subject: Re: Poor interactive performance with I/O loads with fsync()ing
To: Ingo Molnar <mi...@elte.hu>, Nick Piggin <npi...@suse.de>
Cc: ty...@mit.edu, linux-...@vger.kernel.org, Olly Betts
<ol...@survex.com>, martin f krafft <mad...@madduck.net>
Bcc: bga...@gmail.com
In-Reply-To: <20100317093...@elte.hu>
References: <4b9fa440.12135e...@mx.google.com>
<20100317045350.GA2869@laptop> <20100317093...@elte.hu>
On Wed, 17 Mar 2010 10:37:04 +0100, Ingo Molnar <mi...@elte.hu> wrote:
> A call-graph profile will show the precise reason for IO latencies, and their
> relatively likelihood.

Well, here is something for now. I'm not sure how valid the reproduction
workload is (git pull, rsync, and 'notmuch new' all running at once), but I
certainly did produce a few stalls and swapper is highest on the profile.
This was on 2.6.34-rc2. I've included part of the profile below, although more
complete set of data is available at [1].

Thanks,

- Ben

[1] http://mw0.mooo.com/~ben/latency-2010-03-25-a/

# Samples: 25295

#
# Overhead Command Shared Object Symbol
# ........ ............... ................. ......
#

5.85% chromium-browse [kernel.kallsyms] [k] blk_peek_request
|
--- blk_peek_request
scsi_request_fn
__blk_run_queue
blk_run_queue
scsi_run_queue
scsi_next_command
scsi_io_completion
scsi_finish_command
scsi_softirq_done
blk_done_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
|
|--50.00%-- check_match.8653
|
--50.00%-- unlink_anon_vmas
free_pgtables
exit_mmap
mmput
exit_mm
do_exit
do_group_exit
sys_exit_group
system_call
...

Arjan van de Ven

unread,

Mar 27, 2010, 11:50:01 PM3/27/10

to

On Sat, 27 Mar 2010 18:20:37 -0700 (PDT)

Ben Gamari <bgamar...@gmail.com> wrote:

> Hey all,
>
> I have posted another profile[1] from an incident yesterday. As you
> can see, both swapper and init (strange?) show up prominently in the
> profile. Moreover, most processes seem to be in blk_peek_request a
> disturbingly large percentage of the time. Both of these profiles
> were taken with 2.6.34-rc kernels.
>
> Anyone have any ideas on how to proceed? Is more profile data
> necessary? Are the existing profiles at all useful? Thanks,

profiles tend to be about cpu usage... and are rather poor to deal with
anything IO related.

latencytop might get closer in giving useful information....

(btw some general suggestion.. make sure you're using noatime or
relatime as mount option)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Ben Gamari

unread,

Mar 28, 2010, 10:10:01 AM3/28/10

to

On Sat, 27 Mar 2010 20:42:33 -0700, Arjan van de Ven <ar...@infradead.org> wrote:
> On Sat, 27 Mar 2010 18:20:37 -0700 (PDT)
> Ben Gamari <bgamar...@gmail.com> wrote:
>
> > Hey all,
> >
> > I have posted another profile[1] from an incident yesterday. As you
> > can see, both swapper and init (strange?) show up prominently in the
> > profile. Moreover, most processes seem to be in blk_peek_request a
> > disturbingly large percentage of the time.
> >

I suppose this statement was a tad misleading. The provided profiles were taken
with,

perf record -f -g -a -e block:block_rq_issue -c 1

Which I believe measures block requests issued, not CPU usage (correct me if
I'm wrong).

> profiles tend to be about cpu usage... and are rather poor to deal with
> anything IO related.
>

See above.

> latencytop might get closer in giving useful information....
>

Latencytop generally shows a large amount of time handling page faults.

> (btw some general suggestion.. make sure you're using noatime or
> relatime as mount option)

Thanks for the suggestion. I had actually forgotten relatime in my fstab, so
we'll see if there's any improvement now. That being said, I/O loads over small
numbers of files (e.g. xapian) are just as bad as loads over large numbers of
files. To me that weakly suggests perhaps atime updates aren't the issue (I
could be horribly wrong though).

- Ben

Andi Kleen

unread,

Mar 28, 2010, 6:10:02 PM3/28/10

to

Ben Gamari <bgamar...@gmail.com> writes:

You don't say which file system you use, but ext3 and the file systems
with similar journal design (like reiserfs) all have known fsync starvation
issues. The problem is that any fsync has to wait for all transactions
to commit, and this might take a long time depending on how busy
the disk is.

ext4/XFS/JFS/btrfs should be better in this regard

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Pawel S

unread,

Mar 30, 2010, 6:50:02 AM3/30/10

to

2010/3/23 Jens Axboe <jens....@oracle.com>:

> It's also been my sneaking suspicion that swap is involved. I had lots
> of RAM in anything I use, even the laptop and workstation. I'll try and
> run some tests with lower memory and force it into swap, I've seen nasty
> hangs that way.

I am not sure if the memory swapping is the case. According to KDE
System monitor swap is not even touched when copying files. I also
noticed similar responsiveness problems when extracting some rar files
- twelve parts , each about 100MB and the system becomes unresponsive
for some time since the first seconds of operation, then it behaves
normally and then problem is back. I have 2GB of RAM. I am using ext4
file system and I am using noatime mount option.

P.S. It seems it is a little better when I have AHCI mode set in BIOS
(at least when extracting archives).

P.S.2 I would be glad to provide some useful data. I created perf
chart, but if this is not enough just instruct me what should I do
next, please.

Regards

Pawel

Ben Gamari

unread,

Apr 9, 2010, 11:00:01 AM4/9/10

to

On Mon, 29 Mar 2010 00:08:58 +0200, Andi Kleen <an...@firstfloor.org> wrote:
> Ben Gamari <bgamar...@gmail.com> writes:
> ext4/XFS/JFS/btrfs should be better in this regard
>

I am using btrfs, so yes, I was expecting things to be better. Unfortunately,
the improvement seems to be non-existent under high IO/fsync load.

- Ben

Ben Gamari

unread,

Apr 9, 2010, 11:30:03 AM4/9/10

to

On Wed, 17 Mar 2010 10:37:04 +0100, Ingo Molnar <mi...@elte.hu> wrote:

Are these backtraces at all useful? I've received no feedback thusfar, so I can
only that either,
a) there is insufficient data to draw any conclusions and there is no
interest in pursuing this further, or
b) nobody has looked at the backtraces

As I've said in the past, I am very interested in seeing this problem looked at and
would love to contribute whatever I can to that effort. However, without knowing what
information is necessary, I can be of only very limited use in my own debugging
efforts. Thanks,

- Ben

Avi Kivity

unread,

Apr 11, 2010, 11:10:04 AM4/11/10

to

On 04/09/2010 05:56 PM, Ben Gamari wrote:
> On Mon, 29 Mar 2010 00:08:58 +0200, Andi Kleen<an...@firstfloor.org> wrote:
>
>> Ben Gamari<bgamar...@gmail.com> writes:
>> ext4/XFS/JFS/btrfs should be better in this regard
>>
>>
> I am using btrfs, so yes, I was expecting things to be better. Unfortunately,
> the improvement seems to be non-existent under high IO/fsync load.
>
>

btrfs is known to perform poorly under fsync.

--
error compiling committee.c: too many arguments to function

Ben Gamari

unread,

Apr 11, 2010, 12:40:02 PM4/11/10

to

On Sun, 11 Apr 2010 18:03:00 +0300, Avi Kivity <a...@redhat.com> wrote:
> On 04/09/2010 05:56 PM, Ben Gamari wrote:
> > On Mon, 29 Mar 2010 00:08:58 +0200, Andi Kleen<an...@firstfloor.org> wrote:
> >
> >> Ben Gamari<bgamar...@gmail.com> writes:
> >> ext4/XFS/JFS/btrfs should be better in this regard
> >>
> >>
> > I am using btrfs, so yes, I was expecting things to be better. Unfortunately,
> > the improvement seems to be non-existent under high IO/fsync load.
> >
>
> btrfs is known to perform poorly under fsync.
>

Has the reason for this been identified? Judging from the nature of metadata
loads, it would seem that it should be substantially easier to implement
fsync() efficiently.

- Ben

Andi Kleen

unread,

Apr 11, 2010, 1:30:02 PM4/11/10

to

> Has the reason for this been identified? Judging from the nature of metadata
> loads, it would seem that it should be substantially easier to implement
> fsync() efficiently.

By design a copy on write tree fs would need to flush a whole
tree hierarchy on a sync. btrfs avoids this by using a special
log for fsync, but that causes more overhead if you have that
log on the same disk. So IO subsystem will do more work.

It's a bit like JBD data journaling.

However it should not have the stalls inherent in ext3's journaling.

-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Thomas Gleixner

unread,

Apr 11, 2010, 2:20:02 PM4/11/10

to

On Sun, 11 Apr 2010, Avi Kivity wrote:

> On 04/09/2010 05:56 PM, Ben Gamari wrote:
> > On Mon, 29 Mar 2010 00:08:58 +0200, Andi Kleen<an...@firstfloor.org> wrote:
> >
> > > Ben Gamari<bgamar...@gmail.com> writes:
> > > ext4/XFS/JFS/btrfs should be better in this regard
> > >
> > >
> > I am using btrfs, so yes, I was expecting things to be better.
> > Unfortunately,
> > the improvement seems to be non-existent under high IO/fsync load.
> >
> >
>
> btrfs is known to perform poorly under fsync.

XFS does not do much better. Just moved my VM images back to ext for
that reason.

Thanks,

tglx

Andi Kleen

unread,

Apr 11, 2010, 2:50:02 PM4/11/10

to

> XFS does not do much better. Just moved my VM images back to ext for
> that reason.

Did you move from XFS to ext3? ext3 defaults to barriers off, XFS on,
which can make a big difference depending on the disk. You can
disable them on XFS too of course, with the known drawbacks.

XFS also typically needs some tuning to get reasonable log sizes.

My point was merely (before people chime in with counter examples)
that XFS/btrfs/jfs don't suffer from the "need to sync all transactions for
every fsync" issue. There can (and will be) still other issues.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Thomas Gleixner

unread,

Apr 11, 2010, 6:00:02 PM4/11/10

to

On Sun, 11 Apr 2010, Andi Kleen wrote:

> > XFS does not do much better. Just moved my VM images back to ext for
> > that reason.
>
> Did you move from XFS to ext3? ext3 defaults to barriers off, XFS on,
> which can make a big difference depending on the disk. You can
> disable them on XFS too of course, with the known drawbacks.
>
> XFS also typically needs some tuning to get reasonable log sizes.
>
> My point was merely (before people chime in with counter examples)
> that XFS/btrfs/jfs don't suffer from the "need to sync all transactions for
> every fsync" issue. There can (and will be) still other issues.

Yes, I moved them back from XFS to ext3 simply because moving them
from ext3 to XFS turned out to be a completely unusable disaster.

I know that I can tweak knobs on XFS (or any other file system), but I
would not have expected that it sucks that much for KVM with the
default settings which are perfectly fine for the other use cases
which made us move to XFS.

Thanks,

tglx

Hans-Peter Jansen

unread,

Apr 11, 2010, 7:50:01 PM4/11/10

to

On Sunday 11 April 2010, 23:54:34 Thomas Gleixner wrote:
> On Sun, 11 Apr 2010, Andi Kleen wrote:
> > > XFS does not do much better. Just moved my VM images back to ext for
> > > that reason.
> >
> > Did you move from XFS to ext3? ext3 defaults to barriers off, XFS on,
> > which can make a big difference depending on the disk. You can
> > disable them on XFS too of course, with the known drawbacks.
> >
> > XFS also typically needs some tuning to get reasonable log sizes.
> >
> > My point was merely (before people chime in with counter examples)
> > that XFS/btrfs/jfs don't suffer from the "need to sync all transactions
> > for every fsync" issue. There can (and will be) still other issues.
>
> Yes, I moved them back from XFS to ext3 simply because moving them
> from ext3 to XFS turned out to be a completely unusable disaster.
>
> I know that I can tweak knobs on XFS (or any other file system), but I
> would not have expected that it sucks that much for KVM with the
> default settings which are perfectly fine for the other use cases
> which made us move to XFS.

Thomas, what Andi was merely turning out, is that xfs has a really
concerning different default: barriers, that hurts with fsync().

In order to make a fair comparison of the two, you may want to mount xfs
with nobarrier or ext3 with barrier option set, and _then_ check which one
is sucking less.

I guess, that outcome will be interesting for quite a bunch of people in the
audience (including me¹).

Pete

¹) while in transition of getting rid of even suckier technology junk like
VMware-Server - but digging out a current², but _stable_ kernel release
seems harder then ever nowadays.
²) with operational VT-d support for kvm

Dave Chinner

unread,

Apr 12, 2010, 9:30:01 PM4/12/10

to

On Sun, Apr 11, 2010 at 08:16:09PM +0200, Thomas Gleixner wrote:
> On Sun, 11 Apr 2010, Avi Kivity wrote:
> > On 04/09/2010 05:56 PM, Ben Gamari wrote:
> > > On Mon, 29 Mar 2010 00:08:58 +0200, Andi Kleen<an...@firstfloor.org> wrote:
> > > > Ben Gamari<bgamar...@gmail.com> writes:
> > > > ext4/XFS/JFS/btrfs should be better in this regard
> > > >
> > > I am using btrfs, so yes, I was expecting things to be better.
> > > Unfortunately,
> > > the improvement seems to be non-existent under high IO/fsync load.
> >
> > btrfs is known to perform poorly under fsync.
>
> XFS does not do much better. Just moved my VM images back to ext for
> that reason.

Numbers? Workload description? Mount options? I hate it when all I
hear is "XFS sucked, so I went back to extN" reports without any
more details - it's hard to improve anything without any details
of the problems.

Also worth remembering is that XFS defaults to slow-but-safe
options, but ext3 defaults to fast-and-I-don't-give-a-damn-about-
data-safety, so there's a world of difference between the
filesystem defaults....

And FWIW, I run all my VMs on XFS using default mkfs and mount options,
and I can't say that I've noticed any performance problems at all
despite hammering the IO subsystems all the time. The only thing
I've ever done is occasionally run xfs_fsr across permanent qcow2
VM images to defrag them as the grow slowly over time...

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Ric Wheeler

unread,

Apr 14, 2010, 2:50:02 PM4/14/10

to

And if you are asking for details, the type of storage you use is also
quite interesting.

Thanks!

Ric