[RFC PATCH] Fix Readahead stalling by plugged device queues

Christian Ehrhardt

unread,

Mar 10, 2010, 7:40:02 AM3/10/10

to

A few days ago by checking out some blktrace logs I got from a SLES10
and a SLES11 based systems I realized that readahead might get stalled
in newer kernels. "Newer" meaning upstream git kernels as well.

The following RFC patch applies cleanly to everything between 2.6.32
and git head I tested so far.

I don't know if unplugging on any readahead is too aggressive, but
it was intended for theory verification in the first place.
Check out the improvements described below - I think it is
definitely worth a discussion or two :-)

--- patch ---

Subject: [PATCH] readahead: unplug backing device to lower latencies

From: Christian Ehrhardt <ehrh...@linux.vnet.ibm.com>

This unplugs the backing device we just submitted a readahead to.

It should be save as in low utilized environments it is a huge win
by avoiding latencies in making the readahead available early and
on high load systems the queue is unplugged being drained&filled
concurrently anyway where unplugging is a almost a nop.

On the win side we have huge throughput increases especially in
sequential read loads with <4 processes (4 = unplug threshhold).

Without this patch these scenarios get stalled by plugging, here
some blktrace data.

Old pattern:
8,208 3 25 0.028152940 29226 Q R 173880 + 1024 [iozone]
8,208 3 26 0.028153378 29226 G R 173880 + 1024 [iozone]
8,208 3 27 0.028155690 29226 P N [iozone]
8,208 3 28 0.028155909 29226 I R 173880 + 1024 ( 2531) [iozone]
8,208 3 30 0.028621723 29226 Q R 174904 + 1024 [iozone]
8,208 3 31 0.028623941 29226 M R 174904 + 1024 [iozone]
8,208 3 32 0.028624535 29226 U N [iozone] 1
8,208 3 33 0.028625035 29226 D R 173880 + 2048 ( 469126) [iozone]
8,208 1 26 0.032984442 0 C R 173880 + 2048 ( 4359407) [0]

New pattern:
8,209 2 63 0.014241032 18361 Q R 152360 + 1024 [iozone]
8,209 2 64 0.014241657 18361 G R 152360 + 1024 [iozone]
8,209 2 65 0.014243750 18361 P N [iozone]
8,209 2 66 0.014243844 18361 I R 152360 + 1024 ( 2187) [iozone]
8,209 2 67 0.014244438 18361 U N [iozone] 2
8,209 2 68 0.014244844 18361 D R 152360 + 1024 ( 1000) [iozone]
8,209 1 1 0.016682532 0 C R 151336 + 1024 ( 3111375) [0]

We already had such a good pattern in the past e.g. in 2.6.27
based kernels, but I didn't find any explicit piece of code that
was removed - maybe it was not intentionally, but just a side
effect in those older kernels.

As the effectiveness of readahead is directly related to its
latency (meaning is it available once the application wants to
read it) the effect of this to application throughput is quite
impressive.
Here some numbers from parallel iozone sequential reads with
one disk per process.

#Processes TP Improvement in %
1 68.8%
2 58.4%
4 51.9%
8 37.3%
16 16.2%
32 -0.1%
64 0.3%

This is a low (256m) memory environment and so in the high
parallel cases the readahead scales down properly.
I expect that the benefit of this patch would be visible in
loads >16 threads too with more memory available
(measurements ongoing).

Signed-off-by: Christian Ehrhardt <ehrh...@linux.vnet.ibm.com>
---

[diffstat]
readahead.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

[diff]
Index: linux/mm/readahead.c
===================================================================
--- linux.orig/mm/readahead.c
+++ linux/mm/readahead.c
@@ -188,8 +188,11 @@ __do_page_cache_readahead(struct address
* uptodate then the caller will launch readpage again, and
* will then handle the error.
*/
- if (ret)
+ if (ret) {
read_pages(mapping, filp, &page_pool, ret);
+ /* unplug backing dev to avoid latencies */
+ blk_run_address_space(mapping);
+ }
BUG_ON(!list_empty(&page_pool));
out:
return ret;

--

Gr�sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Wu Fengguang

unread,

Mar 10, 2010, 8:20:04 AM3/10/10

to

> --- linux.orig/mm/readahead.c
> +++ linux/mm/readahead.c
> @@ -188,8 +188,11 @@ __do_page_cache_readahead(struct address
> * uptodate then the caller will launch readpage again, and
> * will then handle the error.
> */
> - if (ret)
> + if (ret) {
> read_pages(mapping, filp, &page_pool, ret);
> + /* unplug backing dev to avoid latencies */
> + blk_run_address_space(mapping);
> + }

Christian, did you notice this commit for 2.6.33?

commit 65a80b4c61f5b5f6eb0f5669c8fb120893bfb388
Author: Hisashi Hifumi <hifumi....@oss.ntt.co.jp>
Date: Thu Dec 17 15:27:26 2009 -0800

readahead: add blk_run_backing_dev

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -547,5 +547,17 @@ page_cache_async_readahead(struct address_space *mapping,

/* do read-ahead */
ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+
+#ifdef CONFIG_BLOCK
+ /*
+ * Normally the current page is !uptodate and lock_page() will be
+ * immediately called to implicitly unplug the device. However this
+ * is not always true for RAID conifgurations, where data arrives
+ * not strictly in their submission order. In this case we need to
+ * explicitly kick off the IO.
+ */
+ if (PageUptodate(page))
+ blk_run_backing_dev(mapping->backing_dev_info, NULL);
+#endif
}

It should at least improve performance between .32 and .33, because
once two readahead requests are merged into one single IO request,
the PageUptodate() will be true at next readahead, and hence
blk_run_backing_dev() get called to break out of the suboptimal
situation.

Your patch does reduce the possible readahead submit latency to 0.

Is your workload a simple dd on a single disk? If so, it sounds like
something illogical hidden in the block layer.

Thanks,
Fengguang

Christian Ehrhardt

unread,

Mar 10, 2010, 9:40:02 AM3/10/10

to

Wu Fengguang wrote:
[...]

> Christian, did you notice this commit for 2.6.33?
>
> commit 65a80b4c61f5b5f6eb0f5669c8fb120893bfb388

[...]

I didn't see that particular one, due to the fact that whatever the
result is it needs to work .32

Anyway I'll test it tomorrow and if that already accepted one fixes my
issue as well I'll recommend distros older than 2.6.33 picking that one
up in their on top patches.

>
> It should at least improve performance between .32 and .33, because
> once two readahead requests are merged into one single IO request,
> the PageUptodate() will be true at next readahead, and hence
> blk_run_backing_dev() get called to break out of the suboptimal
> situation.

As you saw from my blktrace thats already the case without that patch.
Once the second readahead comes in and merged it gets unplugged in
2.6.32 too - but still that is bad behavior as it denies my things like
68% throughput improvement :-).

>
> Your patch does reduce the possible readahead submit latency to 0.

yeah and I think/hope that is fine, because as I stated:
- low utilized disk -> not an issue
- high utilized disk -> unplug is an noop

At least personally I consider a case where merging of a readahead
window with anything except its own sibling very rare - and therefore
fair to unplug after and RA is submitted.

> Is your workload a simple dd on a single disk? If so, it sounds like
> something illogical hidden in the block layer.

It might still be illogical hidden as e.g. 2.6.27 unplugged after the
first readahead as well :-)
But no my load is iozone running with different numbers of processes

with one disk per process.

That neatly resembles e.g. nightly backup jobs which tend to take longer
and longer in all time increasing customer scenarios. Such an
improvement might banish the backups back to the night were they belong :-)

> Thanks,
> Fengguang

--

Gr�sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

Wu Fengguang

unread,

Mar 10, 2010, 8:50:01 PM3/10/10

to

On Wed, Mar 10, 2010 at 10:31:46PM +0800, Christian Ehrhardt wrote:
>
>
> Wu Fengguang wrote:
> [...]
> > Christian, did you notice this commit for 2.6.33?
> >
> > commit 65a80b4c61f5b5f6eb0f5669c8fb120893bfb388
> [...]
>
> I didn't see that particular one, due to the fact that whatever the
> result is it needs to work .32
>
> Anyway I'll test it tomorrow and if that already accepted one fixes my
> issue as well I'll recommend distros older than 2.6.33 picking that one
> up in their on top patches.

OK, thanks!

> >
> > It should at least improve performance between .32 and .33, because
> > once two readahead requests are merged into one single IO request,
> > the PageUptodate() will be true at next readahead, and hence
> > blk_run_backing_dev() get called to break out of the suboptimal
> > situation.
>
> As you saw from my blktrace thats already the case without that patch.
> Once the second readahead comes in and merged it gets unplugged in
> 2.6.32 too - but still that is bad behavior as it denies my things like
> 68% throughput improvement :-).

I mean, when readahead windows A and B are submitted in one IO --
let's call it AB -- commit 65a80b4c61 will explicitly unplug on doing
readahead C. While in your trace, the unplug appears on AB.

The 68% improvement is very impressive. Wondering if commit 65a80b4c61
(the _conditional_ unplug) can achieve the same level of improvement :)

> >
> > Your patch does reduce the possible readahead submit latency to 0.
>
> yeah and I think/hope that is fine, because as I stated:
> - low utilized disk -> not an issue
> - high utilized disk -> unplug is an noop
>
> At least personally I consider a case where merging of a readahead
> window with anything except its own sibling very rare - and therefore
> fair to unplug after and RA is submitted.

They are reasonable assumptions. However I'm not sure if this
unconditional unplug will defeat CFQ's anticipatory logic -- if there
are any. You know commit 65a80b4c61 is more about a *defensive*
protection against the rare case that two readahead windows get
merged.

> > Is your workload a simple dd on a single disk? If so, it sounds like
> > something illogical hidden in the block layer.
>
> It might still be illogical hidden as e.g. 2.6.27 unplugged after the
> first readahead as well :-)
> But no my load is iozone running with different numbers of processes
> with one disk per process.
> That neatly resembles e.g. nightly backup jobs which tend to take longer
> and longer in all time increasing customer scenarios. Such an
> improvement might banish the backups back to the night were they belong :-)

Exactly one process per disk? Are they doing sequential reads or more
complicated access patterns?

Thanks,
Fengguang

Christian Ehrhardt

unread,

Mar 11, 2010, 5:00:03 AM3/11/10

to

Wu Fengguang wrote:
> On Wed, Mar 10, 2010 at 10:31:46PM +0800, Christian Ehrhardt wrote:
>>
>> Wu Fengguang wrote:
>> [...]
>>> Christian, did you notice this commit for 2.6.33?
>>>
>>> commit 65a80b4c61f5b5f6eb0f5669c8fb120893bfb388
>> [...]
>>
>> I didn't see that particular one, due to the fact that whatever the
>> result is it needs to work .32
>>
>> Anyway I'll test it tomorrow and if that already accepted one fixes my
>> issue as well I'll recommend distros older than 2.6.33 picking that one
>> up in their on top patches.
>
> OK, thanks!

That patch fixes my issue completely and is as we discussed less
aggressive which is fine - thanks for pointing it out - Now I have
something already upstream accepted to fix the issue, thats much better!

>>> It should at least improve performance between .32 and .33, because
>>> once two readahead requests are merged into one single IO request,
>>> the PageUptodate() will be true at next readahead, and hence
>>> blk_run_backing_dev() get called to break out of the suboptimal
>>> situation.
>> As you saw from my blktrace thats already the case without that patch.
>> Once the second readahead comes in and merged it gets unplugged in
>> 2.6.32 too - but still that is bad behavior as it denies my things like
>> 68% throughput improvement :-).
>
> I mean, when readahead windows A and B are submitted in one IO --
> let's call it AB -- commit 65a80b4c61 will explicitly unplug on doing
> readahead C. While in your trace, the unplug appears on AB.
>
> The 68% improvement is very impressive. Wondering if commit 65a80b4c61
> (the _conditional_ unplug) can achieve the same level of improvement :)

Yep it can !
We can post update the patch description to bigger numbers :-)

>>> Your patch does reduce the possible readahead submit latency to 0.
>> yeah and I think/hope that is fine, because as I stated:
>> - low utilized disk -> not an issue
>> - high utilized disk -> unplug is an noop
>>
>> At least personally I consider a case where merging of a readahead
>> window with anything except its own sibling very rare - and therefore
>> fair to unplug after and RA is submitted.
>
> They are reasonable assumptions. However I'm not sure if this
> unconditional unplug will defeat CFQ's anticipatory logic -- if there
> are any. You know commit 65a80b4c61 is more about a *defensive*
> protection against the rare case that two readahead windows get
> merged.
>
>>> Is your workload a simple dd on a single disk? If so, it sounds like
>>> something illogical hidden in the block layer.
>> It might still be illogical hidden as e.g. 2.6.27 unplugged after the
>> first readahead as well :-)
>> But no my load is iozone running with different numbers of processes
>> with one disk per process.
>> That neatly resembles e.g. nightly backup jobs which tend to take longer
>> and longer in all time increasing customer scenarios. Such an
>> improvement might banish the backups back to the night were they belong :-)
>
> Exactly one process per disk? Are they doing sequential reads or more
> complicated access patterns?

Just sequential read where I see the win, but I also had sequential
write, and random read/write as well as some mixed stuff like dbench.
It improved sequential read and did not impact the others which is fine.

Thank you for you quick replies!

> Thanks,
> Fengguang

--

Grüsse / regards, Christian Ehrhardt

IBM Linux Technology Center, System z Linux Performance

Wu Fengguang

unread,

Mar 11, 2010, 8:30:02 AM3/11/10

to

On Thu, Mar 11, 2010 at 05:58:08PM +0800, Christian Ehrhardt wrote:
> Wu Fengguang wrote:
> > On Wed, Mar 10, 2010 at 10:31:46PM +0800, Christian Ehrhardt wrote:
> >>
> >> Wu Fengguang wrote:
> >> [...]
> >>> Christian, did you notice this commit for 2.6.33?
> >>>
> >>> commit 65a80b4c61f5b5f6eb0f5669c8fb120893bfb388
> >> [...]
> >>
> >> I didn't see that particular one, due to the fact that whatever the
> >> result is it needs to work .32
> >>
> >> Anyway I'll test it tomorrow and if that already accepted one fixes my
> >> issue as well I'll recommend distros older than 2.6.33 picking that one
> >> up in their on top patches.
> >
> > OK, thanks!
>
> That patch fixes my issue completely and is as we discussed less
> aggressive which is fine - thanks for pointing it out - Now I have
> something already upstream accepted to fix the issue, thats much better!

That's great news, it works beyond my expectation.. :)

> >>> It should at least improve performance between .32 and .33, because
> >>> once two readahead requests are merged into one single IO request,
> >>> the PageUptodate() will be true at next readahead, and hence
> >>> blk_run_backing_dev() get called to break out of the suboptimal
> >>> situation.
> >> As you saw from my blktrace thats already the case without that patch.
> >> Once the second readahead comes in and merged it gets unplugged in
> >> 2.6.32 too - but still that is bad behavior as it denies my things like
> >> 68% throughput improvement :-).
> >
> > I mean, when readahead windows A and B are submitted in one IO --
> > let's call it AB -- commit 65a80b4c61 will explicitly unplug on doing
> > readahead C. While in your trace, the unplug appears on AB.
> >
> > The 68% improvement is very impressive. Wondering if commit 65a80b4c61
> > (the _conditional_ unplug) can achieve the same level of improvement :)
>
> Yep it can !
> We can post update the patch description to bigger numbers :-)

Andrew/Greg, shall we push the patch to .32 stable?

That would give us an opportunity to change the patch description ;)

Ah OK.

> Thank you for you quick replies!

You are welcome~

Thanks,
Fengguang

Greg KH

unread,

Mar 18, 2010, 8:40:01 PM3/18/10

to

I've now queued it up.

thanks,

greg k-h