[PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead

Wu Fengguang

unread,

Feb 2, 2010, 10:40:02 AM2/2/10

to

Andrew,

This is to lift default readahead size to 512KB, which I believe yields
more I/O throughput without noticeably increasing I/O latency for today's HDD.

For example, for a 100MB/s and 8ms access time HDD:

io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
4 8 0.04 8.04 0.49% 497.57 124.39
8 8 0.08 8.08 0.97% 990.33 123.79
16 8 0.16 8.16 1.92% 1961.69 122.61
32 8 0.31 8.31 3.76% 3849.62 120.30
64 8 0.62 8.62 7.25% 7420.29 115.94
128 8 1.25 9.25 13.51% 13837.84 108.11
256 8 2.50 10.50 23.81% 24380.95 95.24
512 8 5.00 13.00 38.46% 39384.62 76.92
1024 8 10.00 18.00 55.56% 56888.89 55.56
2048 8 20.00 28.00 71.43% 73142.86 35.71
4096 8 40.00 48.00 83.33% 85333.33 20.83

The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
merely increases IO latency from 9.25ms to 13.00ms.

As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads (the first patch has benchmark details):

rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s

As suggested by Linus, decrease default readahead size for small devices at the same time.

[PATCH 01/11] readahead: limit readahead size for small devices
[PATCH 02/11] readahead: bump up the default readahead size
[PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use

The two other impacts of an enlarged readahead size are

- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead cache hit ratio = 143MB/230MB = 62%
512KB readahead cache hit ratio = 138MB/248MB = 55%
- under console: (seems more stable than the Xwindow data)
128KB readahead cache hit ratio = 30MB/56MB = 53%
1MB readahead cache hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.

- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight systems
typically do not run multiple streams; but if they do so, it should
help I/O performance as long as we can avoid thrashing, which can be
achieved with the following patches.

[PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
[PATCH 06/11] readahead: thrashing safe context readahead

This is a major rewrite of the readahead algorithm, so I did careful tests with
the following tracing/stats patches:

[PATCH 07/11] readahead: record readahead patterns
[PATCH 08/11] readahead: add tracing event
[PATCH 09/11] readahead: add /debug/readahead/stats

I verified the new readahead behavior on various access patterns,
as well as stress tested the thrashing safety, by running 300 streams
with mem=128M.

Only 2031/61325=3.3% readahead windows are thrashed (due to workload
variation):

# cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 20 9 4 20 20 12 73 37 35
subsequent 3 3 0 1 0 1 8 8 1
context 61325 1 5479 61325 6788 5 14 2 13
thrash 2031 0 1222 2031 2031 0 9 0 6
around 235 90 142 235 235 235 60 0 19
fadvise 0 0 0 0 0 0 0 0 0
random 223 133 0 91 91 1 1 0 1
all 63837 236 6847 63703 9165 0 14 2 13

And the readahead inside a single stream is working as expected:

# grep streams-3162 /debug/tracing/trace
streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
[...]

btw, Linus suggested to disable start-of-file readahead if lseek() has been called:

[PATCH 10/11] readahead: dont do start-of-file readahead after lseek()

At last, the updated context readahead will do more radix tree scans, so need
to optimize radix_tree_prev_hole():

[PATCH 11/11] radixtree: speed up next/prev hole search

It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
plus 8 level-1 node searches.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Wu Fengguang

unread,

Feb 2, 2010, 10:40:03 AM2/2/10

to

radixtree-scan-hole-fast.patch

Wu Fengguang

unread,

Feb 2, 2010, 10:40:03 AM2/2/10

to

readahead-retain-pages-find_get_page.patch

Wu Fengguang

unread,

Feb 2, 2010, 10:40:04 AM2/2/10

to

readahead-enlarge-default-size.patch

Wu Fengguang

unread,

Feb 2, 2010, 10:40:03 AM2/2/10

to

readahead-min-max-pages.patch

Vivek Goyal

unread,

Feb 2, 2010, 5:40:02 PM2/2/10

to

On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> Andrew,
>
> This is to lift default readahead size to 512KB, which I believe yields
> more I/O throughput without noticeably increasing I/O latency for today's HDD.
>

Hi Fengguang,

I was doing a quick test with the patches. I was using fio to run some
sequential reader threads. I have got one access to one Lun from an HP
EVA. In my case it looks like with the patches throughput has come down.

Folllowing are the results.

Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 141768 130965 0 0
bsr 3 2 131979 135402 0 0
bsr 3 4 132351 420733 0 0
bsr 3 8 133152 455434 0 0
bsr 3 16 130316 674499 0 0

Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 84749.3 53213 0 0
bsr 3 2 83189.7 157473 0 0
bsr 3 4 77583.3 330030 0 0
bsr 3 8 88545.7 378201 0 0
bsr 3 16 95331.7 482657 0 0

I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.

I have run the tests 3 times (3sets) and taken the average of it.

Thanks
Vivek

Vivek Goyal

unread,

Feb 2, 2010, 6:20:02 PM2/2/10

to

I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller. Here I
do see improvement in sequenetial reader performance with the patches.

Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K

=========================================================================

AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------

bsr 3 1 147569 14369.7 0 0
bsr 3 2 124716 243932 0 0
bsr 3 4 123451 327665 0 0
bsr 3 8 122486 455102 0 0
bsr 3 16 117645 1.03957e+06 0 0

Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K

=========================================================================

AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------

bsr 3 1 160191 22752 0 0
bsr 3 2 149343 184698 0 0
bsr 3 4 147183 430875 0 0
bsr 3 8 144568 484045 0 0
bsr 3 16 137485 1.06257e+06 0 0

Wu Fengguang

unread,

Feb 3, 2010, 1:30:01 AM2/3/10

to

Vivek,

On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is to lift default readahead size to 512KB, which I believe yields
> > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> >
>
> Hi Fengguang,
>
> I was doing a quick test with the patches. I was using fio to run some
> sequential reader threads. I have got one access to one Lun from an HP
> EVA. In my case it looks like with the patches throughput has come down.

Thank you for the quick testing!

This patchset does 3 things:

1) 512K readahead size
2) new readahead algorithms
3) new readahead tracing/stats interfaces

(1) will impact performance, while (2) _might_ impact performance in
case of bugs.

Would you kindly retest the patchset with readahead size manually set
to 128KB? That would help identify the root cause of the performance
drop:

DEV=sda
echo 128 > /sys/block/$DEV/queue/read_ahead_kb

The readahead stats provided by the patchset are very useful for
analyzing the problem:

mount -t debugfs none /debug

# for each benchmark:
echo > /debug/readahead/stats # reset counters
# do benchmark
cat /debug/readahead/stats # check counters

Vivek Goyal

unread,

Feb 3, 2010, 10:30:03 AM2/3/10

to

On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> Vivek,
>
> On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > This is to lift default readahead size to 512KB, which I believe yields
> > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > >
> >
> > Hi Fengguang,
> >
> > I was doing a quick test with the patches. I was using fio to run some
> > sequential reader threads. I have got one access to one Lun from an HP
> > EVA. In my case it looks like with the patches throughput has come down.
>
> Thank you for the quick testing!
>
> This patchset does 3 things:
>
> 1) 512K readahead size
> 2) new readahead algorithms
> 3) new readahead tracing/stats interfaces
>
> (1) will impact performance, while (2) _might_ impact performance in
> case of bugs.
>
> Would you kindly retest the patchset with readahead size manually set
> to 128KB? That would help identify the root cause of the performance
> drop:
>
> DEV=sda
> echo 128 > /sys/block/$DEV/queue/read_ahead_kb
>

I have got two paths to the HP EVA and got multipath device setup(dm-3). I
noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
applied it is set to 4. So looks like something went wrong with device
size/capacity detection hence wrong defaults. Manually setting
read_ahead_kb=512, got me better performance as compare to vanilla kernel.

AVERAGE[bsr]

-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------

bsr 3 1 190302 97937.3 0 0
bsr 3 2 185636 223286 0 0
bsr 3 4 185986 363658 0 0
bsr 3 8 184352 428478 0 0
bsr 3 16 185646 594311 0 0

Thanks
Vivek

Vivek Goyal

unread,

Feb 3, 2010, 11:00:03 AM2/3/10

to

I put a printk in add_disk and noticed that for multipath device get_capacity() is returning 0 and that's why ra_pages is being set to 1.

Thanks
Vivek

Wu Fengguang

unread,

Feb 4, 2010, 8:30:01 AM2/4/10

to

Vivek,

> > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > applied it is set to 4. So looks like something went wrong with device
> > size/capacity detection hence wrong defaults. Manually setting
> > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> >
>
> I put a printk in add_disk and noticed that for multipath device
> get_capacity() is returning 0 and that's why ra_pages is being set
> to 1.

Good catch, Thanks!

It makes no sense to limit readahead size for multipath or other
compound devices. So we may just ignore the get_capacity() == 0 case,
as in the following updated patch.

Thanks,
Fengguang
---

readahead: limit readahead size for small devices

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k

(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.

The formula is determined on the following data, collected by script:

#!/bin/sh

# please make sure BDEV is not mounted or opened by others
BDEV=sdb

for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s

SSD 30G SanDisk SATA 5000

4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s

USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151

4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s

CC: Li Shaohua <shaoh...@intel.com>
CC: Clemens Ladisch <cle...@ladisch.de>
Acked-by: Jens Axboe <jens....@oracle.com>
Tested-by: Vivek Goyal <vgo...@redhat.com>
Tested-by: Linus Torvalds <torv...@linux-foundation.org>
Signed-off-by: Wu Fengguang <fenggu...@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;

/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}

EXPORT_SYMBOL(add_disk);

Wu Fengguang

unread,

Feb 4, 2010, 8:50:02 AM2/4/10

to

Vivek,

> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
> AVERAGE[bsr]
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 190302 97937.3 0 0
> bsr 3 2 185636 223286 0 0
> bsr 3 4 185986 363658 0 0
> bsr 3 8 184352 428478 0 0
> bsr 3 16 185646 594311 0 0

This looks good, thank you for the data! I added them to the changelog :)

Thanks,
Fengguang
---

readahead: bump up the default readahead size

Use 512kb max readahead size, and 32kb min readahead size.

The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.

-- Rationals on the 512kb size --

I believe it yields more I/O throughput without noticeably increasing

I/O latency for today's HDD.

For example, for a 100MB/s and 8ms access time HDD:

io_size KB access_time transfer_time io_latency util% throughput KB/s

4 8 0.04 8.04 0.49% 497.57

8 8 0.08 8.08 0.97% 990.33

16 8 0.16 8.16 1.92% 1961.69

32 8 0.31 8.31 3.76% 3849.62

64 8 0.62 8.62 7.25% 7420.29

128 8 1.25 9.25 13.51% 13837.84

256 8 2.50 10.50 23.81% 24380.95

512 8 5.00 13.00 38.46% 39384.62

1024 8 10.00 18.00 55.56% 56888.89

2048 8 20.00 28.00 71.43% 73142.86

4096 8 40.00 48.00 83.33% 85333.33

The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.

As for SSD, I find that Intel X25-M SSD desires large readahead size

even for sequential reads:

rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s

4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s

The two other impacts of an enlarged readahead size are

- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:

128KB readahead hit ratio = 143MB/230MB = 62%
512KB readahead hit ratio = 138MB/248MB = 55%
1MB readahead hit ratio = 130MB/253MB = 51%

- under console: (seems more stable than the Xwindow data)

128KB readahead hit ratio = 30MB/56MB = 53%
1MB readahead hit ratio = 30MB/59MB = 51%

So the impact to memory footprint looks acceptable.

- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight
systems typically do not run multiple streams; but if they do
so, it should help I/O performance as long as we can avoid
thrashing, which can be achieved with the following patches.

-- Benchmarks by Vivek Goyal --

I have got two paths to the HP EVA and got multipath device setup(dm-3).

I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.

Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 141768 130965 190302 97937.3
bsr 3 2 131979 135402 185636 223286
bsr 3 4 132351 420733 185986 363658
bsr 3 8 133152 455434 184352 428478
bsr 3 16 130316 674499 185646 594311

I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.

Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 147569 14369.7 160191 22752
bsr 3 2 124716 243932 149343 184698
bsr 3 4 123451 327665 147183 430875
bsr 3 8 122486 455102 144568 484045
bsr 3 16 117645 1.03957e+06 137485 1.06257e+06

Tested-by: Vivek Goyal <vgo...@redhat.com>
CC: Jens Axboe <jens....@oracle.com>
CC: Peter Zijlstra <a.p.zi...@chello.nl>
CC: Martin Schwidefsky <schwi...@de.ibm.com>
CC: Christian Ehrhardt <ehrh...@linux.vnet.ibm.com>

Signed-off-by: Wu Fengguang <fenggu...@intel.com>
---

include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);

/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */

int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);

Vivek Goyal

unread,

Feb 4, 2010, 11:00:02 AM2/4/10

to

On Thu, Feb 04, 2010 at 09:21:54PM +0800, Wu Fengguang wrote:
> Vivek,
>
> > > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > > applied it is set to 4. So looks like something went wrong with device
> > > size/capacity detection hence wrong defaults. Manually setting
> > > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> > >
> >
> > I put a printk in add_disk and noticed that for multipath device
> > get_capacity() is returning 0 and that's why ra_pages is being set
> > to 1.
>
> Good catch, Thanks!
>
> It makes no sense to limit readahead size for multipath or other
> compound devices. So we may just ignore the get_capacity() == 0 case,
> as in the following updated patch.
>

Thanks. This patch fixes the issue of read_ahead_kb being set to 4kb on device
mapper targets.

Thanks
Vivek