I know this is very last minute but I believe we should consider disabling
the "low_latency" tunable for block devices by default for 2.6.32. There was
evidence that low_latency was a problem last week for page allocation failure
reports but the reproduction-case was unusual and involved high-order atomic
allocations in low-memory conditions. It took another few days to accurately
show the problem for more normal workloads and it's a bit more wide-spread
than just allocation failures.
Basically, low_latency looks great as long as you have plenty of memory
but in low memory situations, it appears to cause problems that manifest
as reduced performance, desktop stalls and in some cases, page allocation
failures. I think most kernel developers are not seeing the problem as they
tend to test on beefier machines and without hitting swap or low-memory
situations for the most part. When they are hitting low-memory situations,
it tends to be for stress tests where stalls and low performance are expected.
To show the problem, I used an x86-64 machine booting booted with 512MB of
memory. This is a small amount of RAM but the bug reports related to page
allocation failures were on smallish machines and the disks in the system
are not very high-performance.
I used three tests. The first was sysbench on postgres running an IO-heavy
test against a large database with 10,000,000 rows. The second was IOZone
running most of the automatic tests with a record length of 4KB and the
last was a simulated launching of gitk with a music player running in the
background to act as a desktop-like scenario. The final test was similar
to the test described here http://lwn.net/Articles/362184/ except that
dm-crypt was not used as it has its own problems.
Sysbench results looks as follows
sysbench-with sysbench-without
low-latency low-latency
1 1266.02 ( 0.00%) 1278.55 ( 0.98%)
2 1182.58 ( 0.00%) 1379.25 (14.26%)
3 1257.08 ( 0.00%) 1580.08 (20.44%)
4 1212.11 ( 0.00%) 1534.17 (20.99%)
5 1046.77 ( 0.00%) 1552.48 (32.57%)
6 1187.14 ( 0.00%) 1661.19 (28.54%)
7 1179.37 ( 0.00%) 790.26 (-49.24%)
8 1164.62 ( 0.00%) 854.10 (-36.36%)
9 1125.04 ( 0.00%) 1655.04 (32.02%)
10 1147.52 ( 0.00%) 1653.89 (30.62%)
11 823.38 ( 0.00%) 1627.45 (49.41%)
12 813.73 ( 0.00%) 1494.63 (45.56%)
13 898.22 ( 0.00%) 1521.64 (40.97%)
14 873.50 ( 0.00%) 1311.09 (33.38%)
15 808.32 ( 0.00%) 1009.70 (19.94%)
16 758.17 ( 0.00%) 725.17 (-4.55%)
The first column is threads. Disabling low_latency performs much better
for the most part. I should point out that with plenty of memory, sysbench
tends to perform better *with* low_latency but as we're seeing page allocation
failure reports in low memory situations and desktop stalls, the lower memory
situation is also important.
The IOZone results are long I'm afraid.
iozone-with iozone-without
low-latency low-latency
write-64 151212 ( 0.00%) 159856 ( 5.41%)
write-128 189357 ( 0.00%) 206233 ( 8.18%)
write-256 219883 ( 0.00%) 223174 ( 1.47%)
write-512 224932 ( 0.00%) 220227 (-2.14%)
write-1024 227738 ( 0.00%) 226155 (-0.70%)
write-2048 227564 ( 0.00%) 224848 (-1.21%)
write-4096 208556 ( 0.00%) 223430 ( 6.66%)
write-8192 219484 ( 0.00%) 219389 (-0.04%)
write-16384 206670 ( 0.00%) 206295 (-0.18%)
write-32768 203023 ( 0.00%) 201852 (-0.58%)
write-65536 162134 ( 0.00%) 189173 (14.29%)
write-131072 68534 ( 0.00%) 67417 (-1.66%)
write-262144 32936 ( 0.00%) 27750 (-18.69%)
write-524288 24044 ( 0.00%) 23759 (-1.20%)
rewrite-64 755681 ( 0.00%) 755681 ( 0.00%)
rewrite-128 581518 ( 0.00%) 799840 (27.30%)
rewrite-256 639427 ( 0.00%) 659861 ( 3.10%)
rewrite-512 669577 ( 0.00%) 684954 ( 2.24%)
rewrite-1024 680960 ( 0.00%) 686182 ( 0.76%)
rewrite-2048 685263 ( 0.00%) 692780 ( 1.09%)
rewrite-4096 631352 ( 0.00%) 643266 ( 1.85%)
rewrite-8192 442146 ( 0.00%) 442624 ( 0.11%)
rewrite-16384 428641 ( 0.00%) 432613 ( 0.92%)
rewrite-32768 425361 ( 0.00%) 430568 ( 1.21%)
rewrite-65536 405183 ( 0.00%) 389242 (-4.10%)
rewrite-131072 66110 ( 0.00%) 58472 (-13.06%)
rewrite-262144 29254 ( 0.00%) 29306 ( 0.18%)
rewrite-524288 23812 ( 0.00%) 24543 ( 2.98%)
read-64 934589 ( 0.00%) 840903 (-11.14%)
read-128 1601534 ( 0.00%) 1280633 (-25.06%)
read-256 1255511 ( 0.00%) 1310683 ( 4.21%)
read-512 1291158 ( 0.00%) 1319723 ( 2.16%)
read-1024 1319408 ( 0.00%) 1347557 ( 2.09%)
read-2048 1316016 ( 0.00%) 1347393 ( 2.33%)
read-4096 1253710 ( 0.00%) 1251882 (-0.15%)
read-8192 995149 ( 0.00%) 1011794 ( 1.65%)
read-16384 883156 ( 0.00%) 897458 ( 1.59%)
read-32768 844368 ( 0.00%) 856364 ( 1.40%)
read-65536 816099 ( 0.00%) 826473 ( 1.26%)
read-131072 818055 ( 0.00%) 824351 ( 0.76%)
read-262144 827225 ( 0.00%) 835693 ( 1.01%)
read-524288 24653 ( 0.00%) 22519 (-9.48%)
reread-64 2329708 ( 0.00%) 1985134 (-17.36%)
reread-128 1446222 ( 0.00%) 2137031 (32.33%)
reread-256 1828508 ( 0.00%) 1879725 ( 2.72%)
reread-512 1521718 ( 0.00%) 1579934 ( 3.68%)
reread-1024 1347557 ( 0.00%) 1375171 ( 2.01%)
reread-2048 1340664 ( 0.00%) 1350783 ( 0.75%)
reread-4096 1259592 ( 0.00%) 1284839 ( 1.96%)
reread-8192 1007285 ( 0.00%) 1011317 ( 0.40%)
reread-16384 891404 ( 0.00%) 905022 ( 1.50%)
reread-32768 850492 ( 0.00%) 862772 ( 1.42%)
reread-65536 836565 ( 0.00%) 847020 ( 1.23%)
reread-131072 844516 ( 0.00%) 853155 ( 1.01%)
reread-262144 851524 ( 0.00%) 860653 ( 1.06%)
reread-524288 24927 ( 0.00%) 22487 (-10.85%)
randread-64 1605256 ( 0.00%) 1775099 ( 9.57%)
randread-128 1179358 ( 0.00%) 1528576 (22.85%)
randread-256 1421755 ( 0.00%) 1310683 (-8.47%)
randread-512 1306873 ( 0.00%) 1281909 (-1.95%)
randread-1024 1201314 ( 0.00%) 1231629 ( 2.46%)
randread-2048 1179413 ( 0.00%) 1190529 ( 0.93%)
randread-4096 1107005 ( 0.00%) 1116792 ( 0.88%)
randread-8192 894337 ( 0.00%) 899487 ( 0.57%)
randread-16384 783760 ( 0.00%) 791341 ( 0.96%)
randread-32768 740498 ( 0.00%) 743511 ( 0.41%)
randread-65536 721640 ( 0.00%) 728139 ( 0.89%)
randread-131072 715284 ( 0.00%) 720825 ( 0.77%)
randread-262144 709855 ( 0.00%) 714943 ( 0.71%)
randread-524288 394 ( 0.00%) 431 ( 8.58%)
randwrite-64 730988 ( 0.00%) 730988 ( 0.00%)
randwrite-128 746459 ( 0.00%) 742331 (-0.56%)
randwrite-256 695778 ( 0.00%) 727850 ( 4.41%)
randwrite-512 666253 ( 0.00%) 691126 ( 3.60%)
randwrite-1024 651223 ( 0.00%) 659625 ( 1.27%)
randwrite-2048 655558 ( 0.00%) 664073 ( 1.28%)
randwrite-4096 635556 ( 0.00%) 642400 ( 1.07%)
randwrite-8192 467357 ( 0.00%) 469734 ( 0.51%)
randwrite-16384 413188 ( 0.00%) 417282 ( 0.98%)
randwrite-32768 404161 ( 0.00%) 407580 ( 0.84%)
randwrite-65536 379372 ( 0.00%) 381273 ( 0.50%)
randwrite-131072 21780 ( 0.00%) 19758 (-10.23%)
randwrite-262144 6249 ( 0.00%) 6316 ( 1.06%)
randwrite-524288 2915 ( 0.00%) 2859 (-1.96%)
bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%)
bkwdread-128 1066865 ( 0.00%) 1101900 ( 3.18%)
bkwdread-256 877797 ( 0.00%) 1105556 (20.60%)
bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%)
bkwdread-1024 1163562 ( 0.00%) 1195962 ( 2.71%)
bkwdread-2048 1163439 ( 0.00%) 1204552 ( 3.41%)
bkwdread-4096 1116792 ( 0.00%) 1150600 ( 2.94%)
bkwdread-8192 912288 ( 0.00%) 934724 ( 2.40%)
bkwdread-16384 817707 ( 0.00%) 829152 ( 1.38%)
bkwdread-32768 775898 ( 0.00%) 787691 ( 1.50%)
bkwdread-65536 759643 ( 0.00%) 772174 ( 1.62%)
bkwdread-131072 763215 ( 0.00%) 773816 ( 1.37%)
bkwdread-262144 765491 ( 0.00%) 780021 ( 1.86%)
bkwdread-524288 3688 ( 0.00%) 3724 ( 0.97%)
The first column is "operation-sizeInKB". The other figures are measured
in operations (-O in iozone). It's a little less clear-cut but disabling
low_latency wins more often than not although many of the gains are small and
in the 1-3% range (or is that considered lots in iozone land?) There were
big gains and losses for some tests but the really big differences were
around 128 bytes so it might be a CPU caching effect.
Running a simulation of multiple instances of gitk and a music player results
in the following
gitk-with gitk-without
low-latency low-latency
min 954.46 ( 0.00%) 640.65 (32.88%)
mean 964.79 ( 0.00%) 655.57 (32.05%)
stddev 10.01 ( 0.00%) 13.33 (-33.18%)
max 981.23 ( 0.00%) 675.65 (31.14%)
The measure is the time taken for the fake-gitk program to complete its job.
Disabling low_latency completes the test far faster. On previous tests,
I had busted networking to do high-order atomic allocations to simualate
wireless cards which are high-order happy. In those tests, disabling
low_latency performed better, produced more stable results, stalled less
(which I think would look like a desktop stall in a normal environment)
and critically, it didn't fail high-order page allocations. i.e. Enabling
low_latency hurts reclaim in some unspecified fashion.
On my laptop (2GB RAM), I find the desktop stalls less when I disable
low_latency in the situation where something kicks off a lot of IO. For
example, if I do a large git operation and switch to a browser while that
is doing its thing, I notice that the desktop sometimes stalls for almost a
second. I do not see this with low_latency disabled but I cannot quantify
this better and it's tricky to reproduce. I also might be fooling myself
because I expect to see problems with low_latency enabled.
I regret that I do not have an explanation as to why low_latency causes
problems other than a hunch that low_latency is preventing page writeback
happening fast enough and that causes stalls later. Theories and patches
welcome but if it cannot be resolved, should the following be applied?
Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---
block/cfq-iosched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..dc33045 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2543,7 +2543,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
- cfqd->cfq_latency = 1;
+ cfqd->cfq_latency = 0;
cfqd->hw_tag = 1;
cfqd->last_end_sync_rq = jiffies;
return cfqd;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ouch. It was bad desktop stalls under heavy write that kicked the whole
thing off.
-Mike
The problem is that 'desktop' means different things for different people
(for some kernel developers 'desktop' is more like 'a workstation' and for
others it is more like 'an embedded device').
--
Bartlomiej Zolnierkiewicz
The stalls I'm talking about were reported for garden variety desktop
PC. I reproduced them on my supermarket special Q6600 desktop PC. That
problem has been with us roughly forever, but I'd hoped it had been
cured. Guess not.
As an idle speculation, I wonder if the sync vs async slice ratios may
not have been knocked out of kilter a bit by giving more to sync.
-Mike
The low latency tunable controls various policies inside cfq.
The one that could affect memory reclaim is:
/*
* Async queues must wait a bit before being allowed dispatch.
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;
depth = last_sync / cfqd->cfq_slice[1];
if (!depth && !cfqq->dispatched)
depth = 1;
if (depth < max_dispatch)
max_dispatch = depth;
}
here the async queues max depth is limited to 1 for up to 200 ms after
a sync I/O is completed.
Note: dirty page writeback goes through an async queue, so it is
penalized by this.
This can affect both low and high end hardware. My non-NCQ sata disk
can handle a depth of 2 when writing. NCQ sata disks can handle a
depth up to 31, so limiting depth to 1 can cause write performance
drop, and this in turn will slow down dirty page reclaim, and cause
allocation failures.
It would be good to re-test the OOM conditions with that code commented out.
>
> To show the problem, I used an x86-64 machine booting booted with 512MB of
> memory. This is a small amount of RAM but the bug reports related to page
> allocation failures were on smallish machines and the disks in the system
> are not very high-performance.
>
> I used three tests. The first was sysbench on postgres running an IO-heavy
> test against a large database with 10,000,000 rows. The second was IOZone
> running most of the automatic tests with a record length of 4KB and the
> last was a simulated launching of gitk with a music player running in the
> background to act as a desktop-like scenario. The final test was similar
> to the test described here http://lwn.net/Articles/362184/ except that
> dm-crypt was not used as it has its own problems.
low_latency was tested on other scenarios:
http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
where it improved actual and perceived performance, so disabling it
completely may not be good.
Thanks,
Corrado
Will concede that - the term "desktop" is fuzzy at best. The
characteristics of note are a mid-range machine running workloads that
are not steady, have abupt phase changes and are not very well sized to
the available memory. "Desktops" fall into this category but it's also
possible that badly-or-borderline-provisioned servers would also fall
into it.
>
> The stalls I'm talking about were reported for garden variety desktop
> PC.
The stalls I'm seeing on the laptop are tiny but there. It's prefectly
possible a whole host of stalls for people have been resolved but there
is one corner case.
> I reproduced them on my supermarket special Q6600 desktop PC. That
> problem has been with us roughly forever, but I'd hoped it had been
> cured. Guess not.
>
It's possible the corner case causing stalls is specific to low-memory rather
than writes. Conceivably, what is going wrong is that writes need to complete
for pages to be clean so pages can be reclaimed. The cleaning of pages is
getting pre-empted by sync IO until such point as pages cannot be reclaimed
and they stall allowing writes to complete. I'll prototype something to
disable low_latency if kswapd is awake. If it makes as difference, this
might be plausible.
As Jens would say though, this is "mostly hand-wavy nonsense".
> As an idle speculation, I wonder if the sync vs async slice ratios may
> not have been knocked out of kilter a bit by giving more to sync.
>
I don't know enough to speculate.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
All of it or just the cfq_latency part?
As it turns out the test machine does report for the disk NCQ (depth 31/32)
and it's the same on the laptop so slowing down dirty page cleaning
could be impacting reclaim.
> >
> > To show the problem, I used an x86-64 machine booting booted with 512MB of
> > memory. This is a small amount of RAM but the bug reports related to page
> > allocation failures were on smallish machines and the disks in the system
> > are not very high-performance.
> >
> > I used three tests. The first was sysbench on postgres running an IO-heavy
> > test against a large database with 10,000,000 rows. The second was IOZone
> > running most of the automatic tests with a record length of 4KB and the
> > last was a simulated launching of gitk with a music player running in the
> > background to act as a desktop-like scenario. The final test was similar
> > to the test described here http://lwn.net/Articles/362184/ except that
> > dm-crypt was not used as it has its own problems.
>
> low_latency was tested on other scenarios:
> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
> where it improved actual and perceived performance, so disabling it
> completely may not be good.
>
It may not indeed.
In case you mean a partial disabling of cfq_latency, I'm try the
following patch. The intention is to disable the low_latency logic if
kswapd is at work and presumably needs clean pages. Alternative
suggestions welcome.
======
cfq: Do not limit the async queue depth while kswapd is awake
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..dcab74e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,7 +1308,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
- if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+ if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && !kswapd_awake()) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..b593aff 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -655,6 +655,7 @@ typedef struct pglist_data {
void get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free);
void build_all_zonelists(void);
+int kswapd_awake(void);
void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..75cdd9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2201,6 +2201,15 @@ static int kswapd(void *p)
return 0;
}
+int kswapd_awake(void)
+{
+ pg_data_t *pgdat;
+ for_each_online_pgdat(pgdat)
+ if (!waitqueue_active(&pgdat->kswapd_wait))
+ return 1;
+ return 0;
+}
+
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
>
> As it turns out the test machine does report for the disk NCQ (depth 31/32)
> and it's the same on the laptop so slowing down dirty page cleaning
> could be impacting reclaim.
Yes, I think so.
>
>> >
>> > To show the problem, I used an x86-64 machine booting booted with 512MB of
>> > memory. This is a small amount of RAM but the bug reports related to page
>> > allocation failures were on smallish machines and the disks in the system
>> > are not very high-performance.
>> >
>> > I used three tests. The first was sysbench on postgres running an IO-heavy
>> > test against a large database with 10,000,000 rows. The second was IOZone
>> > running most of the automatic tests with a record length of 4KB and the
>> > last was a simulated launching of gitk with a music player running in the
>> > background to act as a desktop-like scenario. The final test was similar
>> > to the test described here http://lwn.net/Articles/362184/ except that
>> > dm-crypt was not used as it has its own problems.
>>
>> low_latency was tested on other scenarios:
>> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
>> where it improved actual and perceived performance, so disabling it
>> completely may not be good.
>>
>
> It may not indeed.
>
> In case you mean a partial disabling of cfq_latency, I'm try the
> following patch. The intention is to disable the low_latency logic if
> kswapd is at work and presumably needs clean pages. Alternative
> suggestions welcome.
Yes, I meant exactly to disable that part, and doing it when kswapd is
active is probably a good choice.
I have a different idea for 2.6.33, though.
If you have a reliable reproducer of the issue, can you test it on
git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
It may already be unaffected, since we had various performance
improvements there, but I think a better way to boost writeback is
possible.
Thanks,
Corrado
Great. Probably we can reenable this feature at 2.6.33. but there isn't any reason to take
any risk at 2.6.32. i.e. This simple disabling is best. I like this.
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
I like treat vmscan writeout as special. because
- vmscan use various process context. but it doesn't write own process's page.
IOW, it doesn't so match cfq's io fairness logic.
- plus, the above mean vmscan writeout doesn't need good i/o latency.
- vmscan maintain page granularity lru list. It mean vmscan makes awful
seekful I/O. it assume block-layer buffered much i/o request.
- plus, the above mena vmscan. writeout need good io throughput. otherwise
system might cause hangup.
However, I don't think kswapd_awake is good choice. because
- zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.
- On large (many memory node) machine, one of much kswapd always run.
Instead, PF_MEMALLOC is good idea?
Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim
Not-Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com> (I haven't test this)
---
block/cfq-iosched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..9546f64 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
- if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+ if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency &&
+ !(current->flags & PF_MEMALLOC)) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;
--
1.6.5.2
This patch was obviously wrong. please forget it. i'm sorry.
As it turned out, that patch sucked so I aborted the test and I need to
think about it a lot more.
> Yes, I meant exactly to disable that part, and doing it when kswapd is
> active is probably a good choice.
> I have a different idea for 2.6.33, though.
> If you have a reliable reproducer of the issue, can you test it on
> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
> It may already be unaffected, since we had various performance
> improvements there, but I think a better way to boost writeback is
> possible.
>
I haven't tested the high-order allocation scenario yet but the results
as thing stands are below. There are four kernels being compared
1. with-low-latency is 2.6.32-rc8 vanilla
2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied
3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
4. without-low-latency is with low_latency disabled
SYSBENCH
sysbench-with low-latency low-latency sysbench-without
low-latency block-2.6.33 async-rampup low-latency
1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%)
2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%)
3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%)
4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%)
5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%)
6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%)
7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%)
8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%)
9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%)
10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%)
11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%)
12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%)
13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%)
14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%)
15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%)
16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%)
sysbench is helped by both both block-2.6.33 and async-rampup to some
extent. For many of the results, plain old disabling low_latency still
helps the most.
desktop-net-gitk
gitk-with low-latency low-latency gitk-without
low-latency block-2.6.33 async-rampup low-latency
min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%)
mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%)
stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%)
max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%)
The changes for block in 2.6.33 make a massive difference here, notably
beating the disabling of low_latency.
IOZone
iozone-with low-latency low-latency iozone-without
low-latency block-2.6.33 async-rampup low-latency
write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%)
write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%)
write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%)
write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%)
write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%)
write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%)
write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%)
write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%)
write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%)
write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%)
write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%)
write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%)
write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%)
write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%)
rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%)
rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%)
rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%)
rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%)
rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%)
rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%)
rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%)
rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%)
rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%)
rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%)
rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%)
rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%)
rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%)
rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%)
read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%)
read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%)
read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%)
read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%)
read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%)
read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%)
read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%)
read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%)
read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%)
read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%)
read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%)
read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%)
read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%)
read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%)
reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%)
reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%)
reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%)
reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%)
reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%)
reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%)
reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%)
reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%)
reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%)
reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%)
reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%)
reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%)
reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%)
reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%)
randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%)
randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%)
randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%)
randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%)
randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%)
randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%)
randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%)
randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%)
randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%)
randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%)
randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%)
randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%)
randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%)
randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%)
randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%)
randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%)
randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%)
randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%)
randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%)
randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%)
randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%)
randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%)
randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%)
randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%)
randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%)
randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%)
randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%)
randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%)
bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%)
bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%)
bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%)
bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%)
bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%)
bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%)
bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%)
bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%)
bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%)
bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%)
bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%)
bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%)
bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%)
bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%)
The upcoming changes for 2.6.33 also help iozone in many cases, often by more
than just disabling low_latency. It has the occasional massive gain or loss
for the larger file sizes. I don't know why this is but as the big losses
appear to be mostly in the write-tests, I would guess that it's differences
in heavy-writer-throttling.
The only downside with block-2.6.33 is that there are a lot of patches in
there and doesn't help with the 2.6.32 release as such. I could do a reverse
bisect to see what helps the most in there but under ideal conditions, it'll
take 3 days to complete and I wouldn't be able to start until Monday as I'm
out of the country for the weekend. That's a bit late.
p.s. As a consequence of being out of the country, I also won't be able to
respond to mail over the weekend.
--
Mel Gorman
While it might not need good latency as such, it does need pages to be
clean because direct reclaim has trouble cleaning pages in its own
behalf.
> - vmscan maintain page granularity lru list. It mean vmscan makes awful
> seekful I/O. it assume block-layer buffered much i/o request.
> - plus, the above mena vmscan. writeout need good io throughput. otherwise
> system might cause hangup.
>
> However, I don't think kswapd_awake is good choice. because
> - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
> btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.
Good point.
> - On large (many memory node) machine, one of much kswapd always run.
>
Also true.
>
> Instead, PF_MEMALLOC is good idea?
>
It doesn't work out either because a process with PF_MEMALLOC is in
direct reclaim and like kswapd, it may not be able to clean the pages at
all, let alone in a small period of time.
>
> Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim
>
> Not-Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com> (I haven't test this)
> ---
> block/cfq-iosched.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index aa1e953..9546f64 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> * We also ramp up the dispatch depth gradually for async IO,
> * based on the last sync IO we serviced
> */
> - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency &&
> + !(current->flags & PF_MEMALLOC)) {
> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> unsigned int depth;
>
> --
> 1.6.5.2
>
>
>
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
How would one go about selecting the proper ratio at which to disable
the low_latency logic?
> >> Yes, I meant exactly to disable that part, and doing it when kswapd is
> >> active is probably a good choice.
> >> I have a different idea for 2.6.33, though.
> >> If you have a reliable reproducer of the issue, can you test it on
> >> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
> >> It may already be unaffected, since we had various performance
> >> improvements there, but I think a better way to boost writeback is
> >> possible.
> >>
> >
> > I haven't tested the high-order allocation scenario yet but the results
> > as thing stands are below. There are four kernels being compared
> >
> > 1. with-low-latency ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ is 2.6.32-rc8 vanilla
> > 2. with-low-latency-block-2.6.33 ᅵis with the for-2.6.33 from linux-block applied
> > 3. with-low-latency-async-rampup ᅵis with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
> > 4. without-low-latency ᅵ ᅵ ᅵ ᅵ ᅵ ᅵis with low_latency disabled
> >
> > SYSBENCH
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ sysbench-with ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ low-latency ᅵsysbench-without
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ low-latency
> > ᅵ ᅵ ᅵ ᅵ ᅵ 1 ᅵ1266.02 ( 0.00%) ᅵ 824.08 (-53.63%) ᅵ1265.15 (-0.07%) ᅵ1278.55 ( 0.98%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 2 ᅵ1182.58 ( 0.00%) ᅵ1226.42 ( 3.57%) ᅵ1223.03 ( 3.31%) ᅵ1379.25 (14.26%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 3 ᅵ1218.64 ( 0.00%) ᅵ1271.38 ( 4.15%) ᅵ1246.42 ( 2.23%) ᅵ1580.08 (22.87%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 4 ᅵ1212.11 ( 0.00%) ᅵ1257.84 ( 3.64%) ᅵ1325.17 ( 8.53%) ᅵ1534.17 (20.99%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 5 ᅵ1046.77 ( 0.00%) ᅵ 981.71 (-6.63%) ᅵ1008.44 (-3.80%) ᅵ1552.48 (32.57%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 6 ᅵ1187.14 ( 0.00%) ᅵ1132.89 (-4.79%) ᅵ1147.18 (-3.48%) ᅵ1661.19 (28.54%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 7 ᅵ1179.37 ( 0.00%) ᅵ1183.61 ( 0.36%) ᅵ1202.49 ( 1.92%) ᅵ 790.26 (-49.24%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 8 ᅵ1164.62 ( 0.00%) ᅵ1143.54 (-1.84%) ᅵ1184.56 ( 1.68%) ᅵ 854.10 (-36.36%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ 9 ᅵ1095.22 ( 0.00%) ᅵ1178.72 ( 7.08%) ᅵ1002.42 (-9.26%) ᅵ1655.04 (33.83%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ10 ᅵ1147.52 ( 0.00%) ᅵ1153.46 ( 0.52%) ᅵ1151.73 ( 0.37%) ᅵ1653.89 (30.62%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ11 ᅵ 823.38 ( 0.00%) ᅵ 820.64 (-0.33%) ᅵ 754.15 (-9.18%) ᅵ1627.45 (49.41%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ12 ᅵ 813.73 ( 0.00%) ᅵ 791.44 (-2.82%) ᅵ 848.32 ( 4.08%) ᅵ1494.63 (45.56%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ13 ᅵ 898.22 ( 0.00%) ᅵ 789.63 (-13.75%) ᅵ 931.47 ( 3.57%) ᅵ1521.64 (40.97%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ14 ᅵ 873.50 ( 0.00%) ᅵ 938.90 ( 6.97%) ᅵ 875.75 ( 0.26%) ᅵ1311.09 (33.38%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ15 ᅵ 808.32 ( 0.00%) ᅵ 979.88 (17.51%) ᅵ 877.87 ( 7.92%) ᅵ1009.70 (19.94%)
> > ᅵ ᅵ ᅵ ᅵ ᅵ16 ᅵ 758.17 ( 0.00%) ᅵ1096.81 (30.87%) ᅵ 881.23 (13.96%) ᅵ 725.17 (-4.55%)
> >
> > sysbench is helped by both both block-2.6.33 and async-rampup to some
> > extent. For many of the results, plain old disabling low_latency still
> > helps the most.
> >
> > desktop-net-gitk
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ gitk-with ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵgitk-without
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ low-latency
> > min ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ954.46 ( 0.00%) ᅵ 570.06 (40.27%) ᅵ 796.22 (16.58%) ᅵ 640.65 (32.88%)
> > mean ᅵ ᅵ ᅵ ᅵ ᅵ 964.79 ( 0.00%) ᅵ 573.96 (40.51%) ᅵ 798.01 (17.29%) ᅵ 655.57 (32.05%)
> > stddev ᅵ ᅵ ᅵ ᅵ ᅵ10.01 ( 0.00%) ᅵ ᅵ 2.65 (73.55%) ᅵ ᅵ 1.91 (80.95%) ᅵ ᅵ13.33 (-33.18%)
> > max ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ981.23 ( 0.00%) ᅵ 577.21 (41.17%) ᅵ 800.91 (18.38%) ᅵ 675.65 (31.14%)
> >
> > The changes for block in 2.6.33 make a massive difference here, notably
> > beating the disabling of low_latency.
>
> Yes. These are read of lots of small files, so the improvements for
> seeky workload we introduced in 2.6.33 helps a lot here.
Ok, good to know
> >
> > IOZone
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ iozone-with ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ ᅵiozone-without
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ ᅵ ᅵ low-latency
> > write-64 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 151212 ( 0.00%) ᅵ ᅵ ᅵ 163359 ( 7.44%) ᅵ ᅵ ᅵ 163359 ( 7.44%) ᅵ ᅵ ᅵ 159856 ( 5.41%)
> > write-128 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ189357 ( 0.00%) ᅵ ᅵ ᅵ 184922 (-2.40%) ᅵ ᅵ ᅵ 202805 ( 6.63%) ᅵ ᅵ ᅵ 206233 ( 8.18%)
> > write-256 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ219883 ( 0.00%) ᅵ ᅵ ᅵ 211232 (-4.10%) ᅵ ᅵ ᅵ 189867 (-15.81%) ᅵ ᅵ ᅵ 223174 ( 1.47%)
> > write-512 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ224932 ( 0.00%) ᅵ ᅵ ᅵ 222601 (-1.05%) ᅵ ᅵ ᅵ 204459 (-10.01%) ᅵ ᅵ ᅵ 220227 (-2.14%)
> > write-1024 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 227738 ( 0.00%) ᅵ ᅵ ᅵ 226728 (-0.45%) ᅵ ᅵ ᅵ 216009 (-5.43%) ᅵ ᅵ ᅵ 226155 (-0.70%)
> > write-2048 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 227564 ( 0.00%) ᅵ ᅵ ᅵ 224167 (-1.52%) ᅵ ᅵ ᅵ 229387 ( 0.79%) ᅵ ᅵ ᅵ 224848 (-1.21%)
> > write-4096 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 208556 ( 0.00%) ᅵ ᅵ ᅵ 227707 ( 8.41%) ᅵ ᅵ ᅵ 216908 ( 3.85%) ᅵ ᅵ ᅵ 223430 ( 6.66%)
> > write-8192 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 219484 ( 0.00%) ᅵ ᅵ ᅵ 222365 ( 1.30%) ᅵ ᅵ ᅵ 217737 (-0.80%) ᅵ ᅵ ᅵ 219389 (-0.04%)
> > write-16384 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ206670 ( 0.00%) ᅵ ᅵ ᅵ 209355 ( 1.28%) ᅵ ᅵ ᅵ 204146 (-1.24%) ᅵ ᅵ ᅵ 206295 (-0.18%)
> > write-32768 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ203023 ( 0.00%) ᅵ ᅵ ᅵ 205097 ( 1.01%) ᅵ ᅵ ᅵ 199766 (-1.63%) ᅵ ᅵ ᅵ 201852 (-0.58%)
> > write-65536 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ162134 ( 0.00%) ᅵ ᅵ ᅵ 196670 (17.56%) ᅵ ᅵ ᅵ 189975 (14.66%) ᅵ ᅵ ᅵ 189173 (14.29%)
> > write-131072 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ68534 ( 0.00%) ᅵ ᅵ ᅵ ᅵ69145 ( 0.88%) ᅵ ᅵ ᅵ ᅵ64519 (-6.22%) ᅵ ᅵ ᅵ ᅵ67417 (-1.66%)
> > write-262144 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ32936 ( 0.00%) ᅵ ᅵ ᅵ ᅵ28587 (-15.21%) ᅵ ᅵ ᅵ ᅵ31470 (-4.66%) ᅵ ᅵ ᅵ ᅵ27750 (-18.69%)
> > write-524288 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ24044 ( 0.00%) ᅵ ᅵ ᅵ ᅵ23560 (-2.05%) ᅵ ᅵ ᅵ ᅵ23116 (-4.01%) ᅵ ᅵ ᅵ ᅵ23759 (-1.20%)
> > rewrite-64 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 755681 ( 0.00%) ᅵ ᅵ ᅵ 800767 ( 5.63%) ᅵ ᅵ ᅵ 469931 (-60.81%) ᅵ ᅵ ᅵ 755681 ( 0.00%)
> > rewrite-128 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ581518 ( 0.00%) ᅵ ᅵ ᅵ 639723 ( 9.10%) ᅵ ᅵ ᅵ 591774 ( 1.73%) ᅵ ᅵ ᅵ 799840 (27.30%)
> > rewrite-256 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ639427 ( 0.00%) ᅵ ᅵ ᅵ 710511 (10.00%) ᅵ ᅵ ᅵ 666414 ( 4.05%) ᅵ ᅵ ᅵ 659861 ( 3.10%)
> > rewrite-512 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ669577 ( 0.00%) ᅵ ᅵ ᅵ 743788 ( 9.98%) ᅵ ᅵ ᅵ 692017 ( 3.24%) ᅵ ᅵ ᅵ 684954 ( 2.24%)
> > rewrite-1024 ᅵ ᅵ ᅵ ᅵ ᅵ 680960 ( 0.00%) ᅵ ᅵ ᅵ 755195 ( 9.83%) ᅵ ᅵ ᅵ 701422 ( 2.92%) ᅵ ᅵ ᅵ 686182 ( 0.76%)
> > rewrite-2048 ᅵ ᅵ ᅵ ᅵ ᅵ 685263 ( 0.00%) ᅵ ᅵ ᅵ 743123 ( 7.79%) ᅵ ᅵ ᅵ 703445 ( 2.58%) ᅵ ᅵ ᅵ 692780 ( 1.09%)
> > rewrite-4096 ᅵ ᅵ ᅵ ᅵ ᅵ 631352 ( 0.00%) ᅵ ᅵ ᅵ 686776 ( 8.07%) ᅵ ᅵ ᅵ 640007 ( 1.35%) ᅵ ᅵ ᅵ 643266 ( 1.85%)
> > rewrite-8192 ᅵ ᅵ ᅵ ᅵ ᅵ 442146 ( 0.00%) ᅵ ᅵ ᅵ 474089 ( 6.74%) ᅵ ᅵ ᅵ 457768 ( 3.41%) ᅵ ᅵ ᅵ 442624 ( 0.11%)
> > rewrite-16384 ᅵ ᅵ ᅵ ᅵ ᅵ428641 ( 0.00%) ᅵ ᅵ ᅵ 454857 ( 5.76%) ᅵ ᅵ ᅵ 442896 ( 3.22%) ᅵ ᅵ ᅵ 432613 ( 0.92%)
> > rewrite-32768 ᅵ ᅵ ᅵ ᅵ ᅵ425361 ( 0.00%) ᅵ ᅵ ᅵ 444206 ( 4.24%) ᅵ ᅵ ᅵ 434472 ( 2.10%) ᅵ ᅵ ᅵ 430568 ( 1.21%)
> > rewrite-65536 ᅵ ᅵ ᅵ ᅵ ᅵ405183 ( 0.00%) ᅵ ᅵ ᅵ 433898 ( 6.62%) ᅵ ᅵ ᅵ 419843 ( 3.49%) ᅵ ᅵ ᅵ 389242 (-4.10%)
> > rewrite-131072 ᅵ ᅵ ᅵ ᅵ ᅵ66110 ( 0.00%) ᅵ ᅵ ᅵ ᅵ58370 (-13.26%) ᅵ ᅵ ᅵ ᅵ54342 (-21.66%) ᅵ ᅵ ᅵ ᅵ58472 (-13.06%)
> > rewrite-262144 ᅵ ᅵ ᅵ ᅵ ᅵ29254 ( 0.00%) ᅵ ᅵ ᅵ ᅵ24665 (-18.61%) ᅵ ᅵ ᅵ ᅵ25710 (-13.78%) ᅵ ᅵ ᅵ ᅵ29306 ( 0.18%)
> > rewrite-524288 ᅵ ᅵ ᅵ ᅵ ᅵ23812 ( 0.00%) ᅵ ᅵ ᅵ ᅵ20742 (-14.80%) ᅵ ᅵ ᅵ ᅵ22490 (-5.88%) ᅵ ᅵ ᅵ ᅵ24543 ( 2.98%)
> > read-64 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ934589 ( 0.00%) ᅵ ᅵ ᅵ1160938 (19.50%) ᅵ ᅵ ᅵ1004538 ( 6.96%) ᅵ ᅵ ᅵ 840903 (-11.14%)
> > read-128 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1601534 ( 0.00%) ᅵ ᅵ ᅵ1869179 (14.32%) ᅵ ᅵ ᅵ1681806 ( 4.77%) ᅵ ᅵ ᅵ1280633 (-25.06%)
> > read-256 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1255511 ( 0.00%) ᅵ ᅵ ᅵ1526887 (17.77%) ᅵ ᅵ ᅵ1304314 ( 3.74%) ᅵ ᅵ ᅵ1310683 ( 4.21%)
> > read-512 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1291158 ( 0.00%) ᅵ ᅵ ᅵ1377278 ( 6.25%) ᅵ ᅵ ᅵ1336145 ( 3.37%) ᅵ ᅵ ᅵ1319723 ( 2.16%)
> > read-1024 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 1319408 ( 0.00%) ᅵ ᅵ ᅵ1306564 (-0.98%) ᅵ ᅵ ᅵ1368162 ( 3.56%) ᅵ ᅵ ᅵ1347557 ( 2.09%)
> > read-2048 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 1316016 ( 0.00%) ᅵ ᅵ ᅵ1394645 ( 5.64%) ᅵ ᅵ ᅵ1339827 ( 1.78%) ᅵ ᅵ ᅵ1347393 ( 2.33%)
> > read-4096 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 1253710 ( 0.00%) ᅵ ᅵ ᅵ1307525 ( 4.12%) ᅵ ᅵ ᅵ1247519 (-0.50%) ᅵ ᅵ ᅵ1251882 (-0.15%)
> > read-8192 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ995149 ( 0.00%) ᅵ ᅵ ᅵ1033337 ( 3.70%) ᅵ ᅵ ᅵ1016944 ( 2.14%) ᅵ ᅵ ᅵ1011794 ( 1.65%)
> > read-16384 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 883156 ( 0.00%) ᅵ ᅵ ᅵ 905213 ( 2.44%) ᅵ ᅵ ᅵ 905213 ( 2.44%) ᅵ ᅵ ᅵ 897458 ( 1.59%)
> > read-32768 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 844368 ( 0.00%) ᅵ ᅵ ᅵ 855213 ( 1.27%) ᅵ ᅵ ᅵ 849609 ( 0.62%) ᅵ ᅵ ᅵ 856364 ( 1.40%)
> > read-65536 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 816099 ( 0.00%) ᅵ ᅵ ᅵ 839262 ( 2.76%) ᅵ ᅵ ᅵ 835019 ( 2.27%) ᅵ ᅵ ᅵ 826473 ( 1.26%)
> > read-131072 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ818055 ( 0.00%) ᅵ ᅵ ᅵ 837369 ( 2.31%) ᅵ ᅵ ᅵ 828230 ( 1.23%) ᅵ ᅵ ᅵ 824351 ( 0.76%)
> > read-262144 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ827225 ( 0.00%) ᅵ ᅵ ᅵ 839635 ( 1.48%) ᅵ ᅵ ᅵ 840538 ( 1.58%) ᅵ ᅵ ᅵ 835693 ( 1.01%)
> > read-524288 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 24653 ( 0.00%) ᅵ ᅵ ᅵ ᅵ21387 (-15.27%) ᅵ ᅵ ᅵ ᅵ20602 (-19.66%) ᅵ ᅵ ᅵ ᅵ22519 (-9.48%)
> > reread-64 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ 2329708 ( 0.00%) ᅵ ᅵ ᅵ2251544 (-3.47%) ᅵ ᅵ ᅵ1985134 (-17.36%) ᅵ ᅵ ᅵ1985134 (-17.36%)
> > reread-128 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1446222 ( 0.00%) ᅵ ᅵ ᅵ1979446 (26.94%) ᅵ ᅵ ᅵ2009076 (28.02%) ᅵ ᅵ ᅵ2137031 (32.33%)
> > reread-256 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1828508 ( 0.00%) ᅵ ᅵ ᅵ2006158 ( 8.86%) ᅵ ᅵ ᅵ1892980 ( 3.41%) ᅵ ᅵ ᅵ1879725 ( 2.72%)
> > reread-512 ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ1521718 ( 0.00%) ᅵ ᅵ ᅵ1642783 ( 7.37%) ᅵ ᅵ ᅵ1508887 (-0.85%) ᅵ ᅵ ᅵ1579934 ( 3.68%)
> > reread-1024 ᅵ ᅵ ᅵ ᅵ ᅵ 1347557 ( 0.00%) ᅵ ᅵ ᅵ1422540 ( 5.27%) ᅵ ᅵ ᅵ1384034 ( 2.64%) ᅵ ᅵ ᅵ1375171 ( 2.01%)
> > reread-2048 ᅵ ᅵ ᅵ ᅵ ᅵ 1340664 ( 0.00%) ᅵ ᅵ ᅵ1413929 ( 5.18%) ᅵ ᅵ ᅵ1372364 ( 2.31%) ᅵ ᅵ ᅵ1350783 ( 0.75%)
> > reread-4096 ᅵ ᅵ ᅵ ᅵ ᅵ 1259592 ( 0.00%) ᅵ ᅵ ᅵ1324868 ( 4.93%) ᅵ ᅵ ᅵ1273788 ( 1.11%) ᅵ ᅵ ᅵ1284839 ( 1.96%)
> > reread-8192 ᅵ ᅵ ᅵ ᅵ ᅵ 1007285 ( 0.00%) ᅵ ᅵ ᅵ1033710 ( 2.56%) ᅵ ᅵ ᅵ1027159 ( 1.93%) ᅵ ᅵ ᅵ1011317 ( 0.40%)
> > reread-16384 ᅵ ᅵ ᅵ ᅵ ᅵ 891404 ( 0.00%) ᅵ ᅵ ᅵ 910828 ( 2.13%) ᅵ ᅵ ᅵ 916562 ( 2.74%) ᅵ ᅵ ᅵ 905022 ( 1.50%)
> > reread-32768 ᅵ ᅵ ᅵ ᅵ ᅵ 850492 ( 0.00%) ᅵ ᅵ ᅵ 859341 ( 1.03%) ᅵ ᅵ ᅵ 856385 ( 0.69%) ᅵ ᅵ ᅵ 862772 ( 1.42%)
> > reread-65536 ᅵ ᅵ ᅵ ᅵ ᅵ 836565 ( 0.00%) ᅵ ᅵ ᅵ 852664 ( 1.89%) ᅵ ᅵ ᅵ 852315 ( 1.85%) ᅵ ᅵ ᅵ 847020 ( 1.23%)
> > reread-131072 ᅵ ᅵ ᅵ ᅵ ᅵ844516 ( 0.00%) ᅵ ᅵ ᅵ 862590 ( 2.10%) ᅵ ᅵ ᅵ 854067 ( 1.12%) ᅵ ᅵ ᅵ 853155 ( 1.01%)
> > reread-262144 ᅵ ᅵ ᅵ ᅵ ᅵ851524 ( 0.00%) ᅵ ᅵ ᅵ 860559 ( 1.05%) ᅵ ᅵ ᅵ 864921 ( 1.55%) ᅵ ᅵ ᅵ 860653 ( 1.06%)
> > reread-524288 ᅵ ᅵ ᅵ ᅵ ᅵ 24927 ( 0.00%) ᅵ ᅵ ᅵ ᅵ21300 (-17.03%) ᅵ ᅵ ᅵ ᅵ19748 (-26.23%) ᅵ ᅵ ᅵ ᅵ22487 (-10.85%)
> > randread-64 ᅵ ᅵ ᅵ ᅵ ᅵ 1605256 ( 0.00%) ᅵ ᅵ ᅵ1605256 ( 0.00%) ᅵ ᅵ ᅵ1605256 ( 0.00%) ᅵ ᅵ ᅵ1775099 ( 9.57%)
> > randread-128 ᅵ ᅵ ᅵ ᅵ ᅵ1179358 ( 0.00%) ᅵ ᅵ ᅵ1582649 (25.48%) ᅵ ᅵ ᅵ1511363 (21.97%) ᅵ ᅵ ᅵ1528576 (22.85%)
> > randread-256 ᅵ ᅵ ᅵ ᅵ ᅵ1421755 ( 0.00%) ᅵ ᅵ ᅵ1599680 (11.12%) ᅵ ᅵ ᅵ1460430 ( 2.65%) ᅵ ᅵ ᅵ1310683 (-8.47%)
> > randread-512 ᅵ ᅵ ᅵ ᅵ ᅵ1306873 ( 0.00%) ᅵ ᅵ ᅵ1278855 (-2.19%) ᅵ ᅵ ᅵ1243315 (-5.11%) ᅵ ᅵ ᅵ1281909 (-1.95%)
> > randread-1024 ᅵ ᅵ ᅵ ᅵ 1201314 ( 0.00%) ᅵ ᅵ ᅵ1254656 ( 4.25%) ᅵ ᅵ ᅵ1190657 (-0.90%) ᅵ ᅵ ᅵ1231629 ( 2.46%)
> > randread-2048 ᅵ ᅵ ᅵ ᅵ 1179413 ( 0.00%) ᅵ ᅵ ᅵ1227971 ( 3.95%) ᅵ ᅵ ᅵ1185272 ( 0.49%) ᅵ ᅵ ᅵ1190529 ( 0.93%)
> > randread-4096 ᅵ ᅵ ᅵ ᅵ 1107005 ( 0.00%) ᅵ ᅵ ᅵ1160862 ( 4.64%) ᅵ ᅵ ᅵ1110727 ( 0.34%) ᅵ ᅵ ᅵ1116792 ( 0.88%)
> > randread-8192 ᅵ ᅵ ᅵ ᅵ ᅵ894337 ( 0.00%) ᅵ ᅵ ᅵ 924264 ( 3.24%) ᅵ ᅵ ᅵ 912676 ( 2.01%) ᅵ ᅵ ᅵ 899487 ( 0.57%)
> > randread-16384 ᅵ ᅵ ᅵ ᅵ 783760 ( 0.00%) ᅵ ᅵ ᅵ 800299 ( 2.07%) ᅵ ᅵ ᅵ 793351 ( 1.21%) ᅵ ᅵ ᅵ 791341 ( 0.96%)
> > randread-32768 ᅵ ᅵ ᅵ ᅵ 740498 ( 0.00%) ᅵ ᅵ ᅵ 743720 ( 0.43%) ᅵ ᅵ ᅵ 741233 ( 0.10%) ᅵ ᅵ ᅵ 743511 ( 0.41%)
> > randread-65536 ᅵ ᅵ ᅵ ᅵ 721640 ( 0.00%) ᅵ ᅵ ᅵ 727692 ( 0.83%) ᅵ ᅵ ᅵ 726984 ( 0.74%) ᅵ ᅵ ᅵ 728139 ( 0.89%)
> > randread-131072 ᅵ ᅵ ᅵ ᅵ715284 ( 0.00%) ᅵ ᅵ ᅵ 722094 ( 0.94%) ᅵ ᅵ ᅵ 717746 ( 0.34%) ᅵ ᅵ ᅵ 720825 ( 0.77%)
> > randread-262144 ᅵ ᅵ ᅵ ᅵ709855 ( 0.00%) ᅵ ᅵ ᅵ 706770 (-0.44%) ᅵ ᅵ ᅵ 709133 (-0.10%) ᅵ ᅵ ᅵ 714943 ( 0.71%)
> > randread-524288 ᅵ ᅵ ᅵ ᅵ ᅵ 394 ( 0.00%) ᅵ ᅵ ᅵ ᅵ ᅵ421 ( 6.41%) ᅵ ᅵ ᅵ ᅵ ᅵ418 ( 5.74%) ᅵ ᅵ ᅵ ᅵ ᅵ431 ( 8.58%)
> > randwrite-64 ᅵ ᅵ ᅵ ᅵ ᅵ 730988 ( 0.00%) ᅵ ᅵ ᅵ 764288 ( 4.36%) ᅵ ᅵ ᅵ 723111 (-1.09%) ᅵ ᅵ ᅵ 730988 ( 0.00%)
> > randwrite-128 ᅵ ᅵ ᅵ ᅵ ᅵ746459 ( 0.00%) ᅵ ᅵ ᅵ 799840 ( 6.67%) ᅵ ᅵ ᅵ 746459 ( 0.00%) ᅵ ᅵ ᅵ 742331 (-0.56%)
> > randwrite-256 ᅵ ᅵ ᅵ ᅵ ᅵ695778 ( 0.00%) ᅵ ᅵ ᅵ 752329 ( 7.52%) ᅵ ᅵ ᅵ 720041 ( 3.37%) ᅵ ᅵ ᅵ 727850 ( 4.41%)
> > randwrite-512 ᅵ ᅵ ᅵ ᅵ ᅵ666253 ( 0.00%) ᅵ ᅵ ᅵ 722760 ( 7.82%) ᅵ ᅵ ᅵ 667081 ( 0.12%) ᅵ ᅵ ᅵ 691126 ( 3.60%)
> > randwrite-1024 ᅵ ᅵ ᅵ ᅵ 651223 ( 0.00%) ᅵ ᅵ ᅵ 697776 ( 6.67%) ᅵ ᅵ ᅵ 663292 ( 1.82%) ᅵ ᅵ ᅵ 659625 ( 1.27%)
> > randwrite-2048 ᅵ ᅵ ᅵ ᅵ 655558 ( 0.00%) ᅵ ᅵ ᅵ 691887 ( 5.25%) ᅵ ᅵ ᅵ 665720 ( 1.53%) ᅵ ᅵ ᅵ 664073 ( 1.28%)
> > randwrite-4096 ᅵ ᅵ ᅵ ᅵ 635556 ( 0.00%) ᅵ ᅵ ᅵ 662721 ( 4.10%) ᅵ ᅵ ᅵ 643170 ( 1.18%) ᅵ ᅵ ᅵ 642400 ( 1.07%)
> > randwrite-8192 ᅵ ᅵ ᅵ ᅵ 467357 ( 0.00%) ᅵ ᅵ ᅵ 491364 ( 4.89%) ᅵ ᅵ ᅵ 476720 ( 1.96%) ᅵ ᅵ ᅵ 469734 ( 0.51%)
> > randwrite-16384 ᅵ ᅵ ᅵ ᅵ413188 ( 0.00%) ᅵ ᅵ ᅵ 427521 ( 3.35%) ᅵ ᅵ ᅵ 417353 ( 1.00%) ᅵ ᅵ ᅵ 417282 ( 0.98%)
> > randwrite-32768 ᅵ ᅵ ᅵ ᅵ404161 ( 0.00%) ᅵ ᅵ ᅵ 411721 ( 1.84%) ᅵ ᅵ ᅵ 404942 ( 0.19%) ᅵ ᅵ ᅵ 407580 ( 0.84%)
> > randwrite-65536 ᅵ ᅵ ᅵ ᅵ379372 ( 0.00%) ᅵ ᅵ ᅵ 397312 ( 4.52%) ᅵ ᅵ ᅵ 386853 ( 1.93%) ᅵ ᅵ ᅵ 381273 ( 0.50%)
> > randwrite-131072 ᅵ ᅵ ᅵ ᅵ21780 ( 0.00%) ᅵ ᅵ ᅵ ᅵ16924 (-28.69%) ᅵ ᅵ ᅵ ᅵ21177 (-2.85%) ᅵ ᅵ ᅵ ᅵ19758 (-10.23%)
> > randwrite-262144 ᅵ ᅵ ᅵ ᅵ 6249 ( 0.00%) ᅵ ᅵ ᅵ ᅵ 5548 (-12.64%) ᅵ ᅵ ᅵ ᅵ 6370 ( 1.90%) ᅵ ᅵ ᅵ ᅵ 6316 ( 1.06%)
> > randwrite-524288 ᅵ ᅵ ᅵ ᅵ 2915 ( 0.00%) ᅵ ᅵ ᅵ ᅵ 2582 (-12.90%) ᅵ ᅵ ᅵ ᅵ 2871 (-1.53%) ᅵ ᅵ ᅵ ᅵ 2859 (-1.96%)
> > bkwdread-64 ᅵ ᅵ ᅵ ᅵ ᅵ 1141196 ( 0.00%) ᅵ ᅵ ᅵ1141196 ( 0.00%) ᅵ ᅵ ᅵ1004538 (-13.60%) ᅵ ᅵ ᅵ1141196 ( 0.00%)
> > bkwdread-128 ᅵ ᅵ ᅵ ᅵ ᅵ1066865 ( 0.00%) ᅵ ᅵ ᅵ1386465 (23.05%) ᅵ ᅵ ᅵ1400936 (23.85%) ᅵ ᅵ ᅵ1101900 ( 3.18%)
> > bkwdread-256 ᅵ ᅵ ᅵ ᅵ ᅵ 877797 ( 0.00%) ᅵ ᅵ ᅵ1105556 (20.60%) ᅵ ᅵ ᅵ1105556 (20.60%) ᅵ ᅵ ᅵ1105556 (20.60%)
> > bkwdread-512 ᅵ ᅵ ᅵ ᅵ ᅵ1133103 ( 0.00%) ᅵ ᅵ ᅵ1162547 ( 2.53%) ᅵ ᅵ ᅵ1175271 ( 3.59%) ᅵ ᅵ ᅵ1162547 ( 2.53%)
> > bkwdread-1024 ᅵ ᅵ ᅵ ᅵ 1163562 ( 0.00%) ᅵ ᅵ ᅵ1206714 ( 3.58%) ᅵ ᅵ ᅵ1213534 ( 4.12%) ᅵ ᅵ ᅵ1195962 ( 2.71%)
> > bkwdread-2048 ᅵ ᅵ ᅵ ᅵ 1163439 ( 0.00%) ᅵ ᅵ ᅵ1218910 ( 4.55%) ᅵ ᅵ ᅵ1204552 ( 3.41%) ᅵ ᅵ ᅵ1204552 ( 3.41%)
> > bkwdread-4096 ᅵ ᅵ ᅵ ᅵ 1116792 ( 0.00%) ᅵ ᅵ ᅵ1175477 ( 4.99%) ᅵ ᅵ ᅵ1159922 ( 3.72%) ᅵ ᅵ ᅵ1150600 ( 2.94%)
> > bkwdread-8192 ᅵ ᅵ ᅵ ᅵ ᅵ912288 ( 0.00%) ᅵ ᅵ ᅵ 935233 ( 2.45%) ᅵ ᅵ ᅵ 944695 ( 3.43%) ᅵ ᅵ ᅵ 934724 ( 2.40%)
> > bkwdread-16384 ᅵ ᅵ ᅵ ᅵ 817707 ( 0.00%) ᅵ ᅵ ᅵ 824140 ( 0.78%) ᅵ ᅵ ᅵ 832527 ( 1.78%) ᅵ ᅵ ᅵ 829152 ( 1.38%)
> > bkwdread-32768 ᅵ ᅵ ᅵ ᅵ 775898 ( 0.00%) ᅵ ᅵ ᅵ 773714 (-0.28%) ᅵ ᅵ ᅵ 785494 ( 1.22%) ᅵ ᅵ ᅵ 787691 ( 1.50%)
> > bkwdread-65536 ᅵ ᅵ ᅵ ᅵ 759643 ( 0.00%) ᅵ ᅵ ᅵ 769924 ( 1.34%) ᅵ ᅵ ᅵ 778780 ( 2.46%) ᅵ ᅵ ᅵ 772174 ( 1.62%)
> > bkwdread-131072 ᅵ ᅵ ᅵ ᅵ763215 ( 0.00%) ᅵ ᅵ ᅵ 769634 ( 0.83%) ᅵ ᅵ ᅵ 773707 ( 1.36%) ᅵ ᅵ ᅵ 773816 ( 1.37%)
> > bkwdread-262144 ᅵ ᅵ ᅵ ᅵ765491 ( 0.00%) ᅵ ᅵ ᅵ 768992 ( 0.46%) ᅵ ᅵ ᅵ 780876 ( 1.97%) ᅵ ᅵ ᅵ 780021 ( 1.86%)
> > bkwdread-524288 ᅵ ᅵ ᅵ ᅵ ᅵ3688 ( 0.00%) ᅵ ᅵ ᅵ ᅵ 3595 (-2.59%) ᅵ ᅵ ᅵ ᅵ 3577 (-3.10%) ᅵ ᅵ ᅵ ᅵ 3724 ( 0.97%)
> >
> > The upcoming changes for 2.6.33 also help iozone in many cases, often by more
> > than just disabling low_latency. It has the occasional massive gain or loss
> > for the larger file sizes. I don't know why this is but as the big losses
> > appear to be mostly in the write-tests, I would guess that it's differences
> > in heavy-writer-throttling.
>
> I wonder if 2.6.33 + my async rampup patch will improve still further,
> maybe reaching the low_latency=0 performance also for writing tests.
It might, I didn't test yet as the machine is tied up. However, even if
it does, it will not help the 2.6.32 if the patches for 2.6.33 are being
considered.
> >
> > The only downside with block-2.6.33 is that there are a lot of patches in
> > there and doesn't help with the 2.6.32 release as such. I could do a reverse
> > bisect to see what helps the most in there but under ideal conditions, it'll
> > take 3 days to complete and I wouldn't be able to start until Monday as I'm
> > out of the country for the weekend. That's a bit late.
>
> Bisect will likely not help, since we have several patch series with
> heavy internal dependencies in that tree.
> If one of the patch series is found to bring the improvement, you have
> to backport the entire series, that is not advisable for a rc8 or for
> stable.
Scratch that then.
I did a quick test for when high-order-atomic-allocations-for-network
are happening but the results are not great. By quick test, I mean I
only did the gitk tests as there wasn't time to do the sysbench and
iozone tests as well before I'd go offline.
desktop-net-gitk
high-with low-latency low-latency high-without
low-latency block-2.6.33 async-rampup low-latency
min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%)
The patches for 2.6.33 help a little all right but the async-rampup
patches both make the performance worse and causes more page allocation
failures to occur. In other words, on most machines it'll appear fine
but people with wireless cards doing high-order allocations may run into
trouble.
Disabling low_latency again helps performance significantly in this
scenario. There were still page allocation failures because not all the
patches related to that problem made it to mainline.
I was somewhat aggrevated by the page allocation failures until I remembered
that there are three patches in -mm that I failed to convince either Jens or
Andrew of them being suitable for mainline. When they are added to the mix,
the results are as follows;
desktop-net-gitk
atomics-with low-latency low-latency atomics-without
low-latency block-2.6.33 async-rampup low-latency
min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%)
Again, plain old disabling low_latency both performs the best and fails page
allocations the least. The three patches for page allocation failures are
in -mm but not mainline are;
[PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
[PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
[PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
It still seems to be that the route of least damage is to disable low_latency
by default for 2.6.32. It's very unfortunate that I wasn't able to fully
justify the 3 patches for page allocation failures in time but all that
can be done there is consider them for -stable I suppose.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> How would one go about selecting the proper ratio at which to disable
> the low_latency logic?
Can we measure the dirty ratio when the allocation failures start to happen?
>> >
>> > I haven't tested the high-order allocation scenario yet but the results
>> > as thing stands are below. There are four kernels being compared
>> >
>> > 1. with-low-latency is 2.6.32-rc8 vanilla
>> > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied
>> > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
>> > 4. without-low-latency is with low_latency disabled
>> >
>> > desktop-net-gitk
>> > gitk-with low-latency low-latency gitk-without
>> > low-latency block-2.6.33 async-rampup low-latency
>> > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%)
>> > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%)
>> > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%)
>> > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%)
>> >
>> > The changes for block in 2.6.33 make a massive difference here, notably
>> > beating the disabling of low_latency.
>>
> I did a quick test for when high-order-atomic-allocations-for-network
> are happening but the results are not great. By quick test, I mean I
> only did the gitk tests as there wasn't time to do the sysbench and
> iozone tests as well before I'd go offline.
>
> desktop-net-gitk
> high-with low-latency low-latency high-without
> low-latency block-2.6.33 async-rampup low-latency
> min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
> mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
> stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
> max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
> pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%)
>
> The patches for 2.6.33 help a little all right but the async-rampup
> patches both make the performance worse and causes more page allocation
> failures to occur. In other words, on most machines it'll appear fine
> but people with wireless cards doing high-order allocations may run into
> trouble.
>
> Disabling low_latency again helps performance significantly in this
> scenario. There were still page allocation failures because not all the
> patches related to that problem made it to mainline.
I'm puzzled how almost all kernels, excluding the async rampup,
perform better when high order allocations are enabled, than in
previous test.
> I was somewhat aggrevated by the page allocation failures until I remembered
> that there are three patches in -mm that I failed to convince either Jens or
> Andrew of them being suitable for mainline. When they are added to the mix,
> the results are as follows;
>
> desktop-net-gitk
> atomics-with low-latency low-latency atomics-without
> low-latency block-2.6.33 async-rampup low-latency
> min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
> mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
> stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
> max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
> pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%)
>
Those patches penalize block-2.6.33, that was the one with lowest
number of failures in previous test.
I think the heuristics were tailored to 2.6.32. They need to be
re-tuned for 2.6.33.
> Again, plain old disabling low_latency both performs the best and fails page
> allocations the least. The three patches for page allocation failures are
> in -mm but not mainline are;
>
> [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
>
> It still seems to be that the route of least damage is to disable low_latency
> by default for 2.6.32. It's very unfortunate that I wasn't able to fully
> justify the 3 patches for page allocation failures in time but all that
> can be done there is consider them for -stable I suppose.
Just disabling low_latency will not solve the allocation issues (20
instead of 25).
Moreover, it will improve some workloads, but penalize others.
Your 3 patches, though, seem to improve the situation also for
low_latency enabled, both for performance and allocation failures (25
to 3). Having those 3 patches with low_latency enabled seems better,
since it won't penalize the workloads that are benefited by
low_latency (if you add a sequential read to your test, you should see
a big difference).
Thanks,
Corrado
Would the number of dirty pages in the page allocation failure message to
kern.log be enough? You won't get them all because of printk suppress but
it's something. Alternatively, tell me exactly what stats from /proc you
want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
number of pages though, the monitor tends to execute too late to be useful.
> >> >
> >> > I haven't tested the high-order allocation scenario yet but the results
> >> > as thing stands are below. There are four kernels being compared
> >> >
> >> > 1. with-low-latency ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ is 2.6.32-rc8 vanilla
> >> > 2. with-low-latency-block-2.6.33 ᅵis with the for-2.6.33 from linux-block applied
> >> > 3. with-low-latency-async-rampup ᅵis with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
> >> > 4. without-low-latency ᅵ ᅵ ᅵ ᅵ ᅵ ᅵis with low_latency disabled
> >> >
> >> > desktop-net-gitk
> >> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ gitk-with ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵgitk-without
> >> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ low-latency
> >> > min ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ954.46 ( 0.00%) ᅵ 570.06 (40.27%) ᅵ 796.22 (16.58%) ᅵ 640.65 (32.88%)
> >> > mean ᅵ ᅵ ᅵ ᅵ ᅵ 964.79 ( 0.00%) ᅵ 573.96 (40.51%) ᅵ 798.01 (17.29%) ᅵ 655.57 (32.05%)
> >> > stddev ᅵ ᅵ ᅵ ᅵ ᅵ10.01 ( 0.00%) ᅵ ᅵ 2.65 (73.55%) ᅵ ᅵ 1.91 (80.95%) ᅵ ᅵ13.33 (-33.18%)
> >> > max ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ981.23 ( 0.00%) ᅵ 577.21 (41.17%) ᅵ 800.91 (18.38%) ᅵ 675.65 (31.14%)
> >> >
> >> > The changes for block in 2.6.33 make a massive difference here, notably
> >> > beating the disabling of low_latency.
> >>
> > I did a quick test for when high-order-atomic-allocations-for-network
> > are happening but the results are not great. By quick test, I mean I
> > only did the gitk tests as there wasn't time to do the sysbench and
> > iozone tests as well before I'd go offline.
> >
> > desktop-net-gitk
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ high-with ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵhigh-without
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ low-latency
> > min ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ861.03 ( 0.00%) ᅵ 467.83 (45.67%) ᅵ1185.51 (-37.69%) ᅵ 303.43 (64.76%)
> > mean ᅵ ᅵ ᅵ ᅵ ᅵ 866.60 ( 0.00%) ᅵ 616.28 (28.89%) ᅵ1201.82 (-38.68%) ᅵ 459.69 (46.96%)
> > stddev ᅵ ᅵ ᅵ ᅵ ᅵ 4.39 ( 0.00%) ᅵ ᅵ86.90 (-1877.46%) ᅵ ᅵ23.63 (-437.75%) ᅵ ᅵ92.75 (-2010.76%)
> > max ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ872.56 ( 0.00%) ᅵ 679.36 (22.14%) ᅵ1242.63 (-42.41%) ᅵ 537.31 (38.42%)
> > pgalloc-fail ᅵ ᅵ ᅵ 25 ( 0.00%) ᅵ ᅵ ᅵ 10 (50.00%) ᅵ ᅵ ᅵ 39 (-95.00%) ᅵ ᅵ ᅵ 20 ( 0.00%)
> >
> > The patches for 2.6.33 help a little all right but the async-rampup
> > patches both make the performance worse and causes more page allocation
> > failures to occur. In other words, on most machines it'll appear fine
> > but people with wireless cards doing high-order allocations may run into
> > trouble.
> >
> > Disabling low_latency again helps performance significantly in this
> > scenario. There were still page allocation failures because not all the
> > patches related to that problem made it to mainline.
>
> I'm puzzled how almost all kernels, excluding the async rampup,
> perform better when high order allocations are enabled, than in
> previous test.
>
Two major differences. 1, the previous non-high-order tests had also
run sysbench and iozone so the starting conditions are different. I had
disabled those tests to get some of the high-order figures before I went
offline. However, the starting conditions are probably not as important as
the fact that kswapd is working to free order-2 pages and staying awake
until watermarks are reached. kswapd working harder is probably making a
big difference.
> > I was somewhat aggrevated by the page allocation failures until I remembered
> > that there are three patches in -mm that I failed to convince either Jens or
> > Andrew of them being suitable for mainline. When they are added to the mix,
> > the results are as follows;
> >
> > desktop-net-gitk
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵatomics-with ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵ low-latency ᅵ atomics-without
> > ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ low-latency ᅵ ᅵ ᅵblock-2.6.33 ᅵ ᅵ ᅵasync-rampup ᅵ ᅵ ᅵ low-latency
> > min ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ641.12 ( 0.00%) ᅵ 627.91 ( 2.06%) ᅵ1254.75 (-95.71%) ᅵ 375.05 (41.50%)
> > mean ᅵ ᅵ ᅵ ᅵ ᅵ 743.61 ( 0.00%) ᅵ 631.20 (15.12%) ᅵ1272.70 (-71.15%) ᅵ 389.71 (47.59%)
> > stddev ᅵ ᅵ ᅵ ᅵ ᅵ60.30 ( 0.00%) ᅵ ᅵ 2.53 (95.80%) ᅵ ᅵ10.64 (82.35%) ᅵ ᅵ22.38 (62.89%)
> > max ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ793.85 ( 0.00%) ᅵ 633.76 (20.17%) ᅵ1281.65 (-61.45%) ᅵ 428.41 (46.03%)
> > pgalloc-fail ᅵ ᅵ ᅵ ᅵ3 ( 0.00%) ᅵ ᅵ ᅵ ᅵ2 ( 0.00%) ᅵ ᅵ ᅵ 23 ( 0.00%) ᅵ ᅵ ᅵ ᅵ0 ( 0.00%)
> >
>
> Those patches penalize block-2.6.33, that was the one with lowest
> number of failures in previous test.
> I think the heuristics were tailored to 2.6.32. They need to be
> re-tuned for 2.6.33.
>
I made a mistake in the script that was generating the summary. I neglected to
take into account printk rate suppressions. When they are taken into account,
the first round of figures look like
desktop-net-gitk
high-with low-latency low-latency high-without
low-latency block-2.6.33 async-rampup low-latency
min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293 (-350.77%) 20 (69.23%)
So the async-rampup is getting smacked very hard with allocation failures
in the high-order case. With the three additional applied for allocation
failures, the figures look like
desktop-net-gitk
atomics-with low-latency low-latency atomics-without
low-latency block-2.6.33 async-rampup low-latency
min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0 ( 0.00%)
So again, async-rampup is getting smacked in terms of allocation failures
although the three additional patches help a lot. This is a real pity
because it looked nice in the tests involving no high-order allocations for
the network.
> > Again, plain old disabling low_latency both performs the best and fails page
> > allocations the least. The three patches for page allocation failures are
> > in -mm but not mainline are;
> >
> > [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> > [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> > [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
> >
> > It still seems to be that the route of least damage is to disable low_latency
> > by default for 2.6.32. It's very unfortunate that I wasn't able to fully
> > justify the 3 patches for page allocation failures in time but all that
> > can be done there is consider them for -stable I suppose.
>
> Just disabling low_latency will not solve the allocation issues (20
> instead of 25).
20 instead of 65 and I know it doesn't fully help the problem with
high-order allocations. The patches that do help that problem aren't in
mainline but they do exist.
> Moreover, it will improve some workloads, but penalize others.
>
It really does appear to hurt a lot when the machine is kinda low on
memory though. That is a fairly common situation with a desktop loaded
up with random apps. Well..... by common, I mean I hit that situation a
lot on my laptop. I don't hit it on server workloads because I make sure
the machines are not overloaded.
> Your 3 patches, though, seem to improve the situation also for
> low_latency enabled, both for performance and allocation failures (25
> to 3). Having those 3 patches with low_latency enabled seems better,
> since it won't penalize the workloads that are benefited by
> low_latency (if you add a sequential read to your test, you should see
> a big difference).
>
This is true and I would like to see them merged. However, this close to
release, with Jens unhappiness with the explanation of why
congestion_wait() changes made a difference and Andrew feeling there
wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
Monday what the story is.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
use strict;
select(STDOUT);
$|=1;
do {
open (my $bf, "< /proc/buddyinfo") or die;
open (my $up, "< /proc/uptime") or die;
my $now = <$up>;
chomp $now;
print $now;
while(<$bf>) {
next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/;
my ($frag, $tot, $val) = (0,0,1);
map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g);
print "\t", $frag/$tot;
}
print "\n";
sleep 1;
} while(1);
My definition of fragmentation is just the number of fragments / the number of pages:
* It is 1 only when all pages are of order 0
* it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used)
* to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k
I observed the mainline kernel during normal usage, and found that:
* the fragmentation is very low after boot (< 1%).
* it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
* high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.
* when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).
* the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
>
> Two major differences. 1, the previous non-high-order tests had also
> run sysbench and iozone so the starting conditions are different. I had
> disabled those tests to get some of the high-order figures before I went
> offline. However, the starting conditions are probably not as important as
> the fact that kswapd is working to free order-2 pages and staying awake
> until watermarks are reached. kswapd working harder is probably making a
> big difference.
>
From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory.
On the other hand, machines with plenty of memory should be the norm now, even for desktops.
Ok. Forget that patch for now. Maybe we can test it with 2.6.33 to see if it fits.
On the other hand, I saw that the problems with high order allocations started
around 2.6.31, where we didn't have any low_latency patch. So I don't think the
solution to the problem is in the block layer. A slightly slower or faster writeback
shouldn't cause a DoS like situation as the one encountered with your network driver.
> > Moreover, it will improve some workloads, but penalize others.
>
> It really does appear to hurt a lot when the machine is kinda low on
> memory though. That is a fairly common situation with a desktop loaded
> up with random apps. Well..... by common, I mean I hit that situation a
> lot on my laptop. I don't hit it on server workloads because I make sure
> the machines are not overloaded.
This is why we have it as a tunable. If your workload is negatively affected,
you can switch it off. But make sure to test it thoroughly, because even if
you found a 2x slowdown in a particular circumstance, it can gain 10x
speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html)
in others.
>
> > Your 3 patches, though, seem to improve the situation also for
> > low_latency enabled, both for performance and allocation failures (25
> > to 3). Having those 3 patches with low_latency enabled seems better,
> > since it won't penalize the workloads that are benefited by
> > low_latency (if you add a sequential read to your test, you should see
> > a big difference).
>
> This is true and I would like to see them merged. However, this close to
> release, with Jens unhappiness with the explanation of why
> congestion_wait() changes made a difference and Andrew feeling there
> wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
> Monday what the story is.
After a 1day study of the VM, I found an other way to improve the fragmentation.
With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
and decreases overtime, if the system is lightly used, even without dropping caches.
Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
allocations are usually serviced by the other zones (more likely than with mainline allocator).
The idea is to have 2 freelists for each zone.
The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
The free_list_1 contains the other ones.
When expanding, we put pages into free_list_1. When freeing, we put them in the proper one by checking the buddy of the compound.
And when extracting, we always extract from free_list_0 first, and fall back on the other if the first is empty.
In this way, we keep free longer the pages that are more likely to cause a big merge.
Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.
It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.
Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..6427361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
}
struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct list_head free_list_0[MIGRATE_TYPES];
+ struct list_head free_list_1[MIGRATE_TYPES];
unsigned long nr_free;
};
diff --git a/kernel/kexec.c b/kernel/kexec.c
index f336e21..aee5ef5 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(zone, free_area);
VMCOREINFO_OFFSET(zone, vm_stat);
VMCOREINFO_OFFSET(zone, spanned_pages);
- VMCOREINFO_OFFSET(free_area, free_list);
+ VMCOREINFO_OFFSET(free_area, free_list_0);
+ VMCOREINFO_OFFSET(free_area, free_list_1);
VMCOREINFO_OFFSET(list_head, next);
VMCOREINFO_OFFSET(list_head, prev);
VMCOREINFO_OFFSET(vm_struct, addr);
VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
log_buf_kexec_setup();
- VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+ VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
+ VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
VMCOREINFO_NUMBER(NR_FREE_PAGES);
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdcedf6..5f488d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
int migratetype)
{
unsigned long page_idx;
+ unsigned long combined_idx;
+ bool high_order_free = false;
if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(bad_range(zone, page));
while (order < MAX_ORDER-1) {
- unsigned long combined_idx;
struct page *buddy;
buddy = __page_find_buddy(page, page_idx, order);
@@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
order++;
}
set_page_order(page, order);
- list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+
+ if (order < MAX_ORDER-1) {
+ struct page *parent_page, *ppage_buddy;
+ combined_idx = __find_combined_index(page_idx, order);
+ parent_page = page + combined_idx - page_idx;
+ ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
+ high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
+ }
+
+ if (high_order_free)
+ list_add(&page->lru,
+ &zone->free_area[order].free_list_1[migratetype]);
+ else
+ list_add(&page->lru,
+ &zone->free_area[order].free_list_0[migratetype]);
zone->free_area[order].nr_free++;
}
@@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
high--;
size >>= 1;
VM_BUG_ON(bad_range(zone, &page[size]));
- list_add(&page[size].lru, &area->free_list[migratetype]);
+ list_add(&page[size].lru, &area->free_list_1[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ bool fl0, fl1;
area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ fl0 = list_empty(&area->free_list_0[migratetype]);
+ fl1 = list_empty(&area->free_list_1[migratetype]);
+ if (fl0 && fl1)
continue;
- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
+ if (fl0)
+ page = list_entry(area->free_list_1[migratetype].next,
+ struct page, lru);
+ else
+ page = list_entry(area->free_list_0[migratetype].next,
+ struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
@@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
order = page_order(page);
list_del(&page->lru);
list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list_0[migratetype]);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
for (current_order = MAX_ORDER-1; current_order >= order;
--current_order) {
for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+ bool fl0, fl1;
migratetype = fallbacks[start_migratetype][i];
/* MIGRATE_RESERVE handled later if necessary */
@@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
continue;
area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+
+
+ fl0 = list_empty(&area->free_list_0[migratetype]);
+ fl1 = list_empty(&area->free_list_1[migratetype]);
+
+ if (fl0 && fl1)
continue;
- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
+ if (fl0)
+ page = list_entry(area->free_list_1[migratetype].next,
+ struct page, lru);
+ else
+ page = list_entry(area->free_list_0[migratetype].next,
+ struct page, lru);
area->nr_free--;
/*
@@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
}
for_each_migratetype_order(order, t) {
- list_for_each(curr, &zone->free_area[order].free_list[t]) {
+ list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
+ unsigned long i;
+
+ pfn = page_to_pfn(list_entry(curr, struct page, lru));
+ for (i = 0; i < (1UL << order); i++)
+ swsusp_set_page_free(pfn_to_page(pfn + i));
+ }
+ list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
unsigned long i;
pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
{
int order, t;
for_each_migratetype_order(order, t) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
zone->free_area[order].nr_free = 0;
}
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..613ef1e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
area = &(zone->free_area[order]);
- list_for_each(curr, &area->free_list[mtype])
+ list_for_each(curr, &area->free_list_0[mtype])
+ freecount++;
+ list_for_each(curr, &area->free_list_1[mtype])
freecount++;
seq_printf(m, "%6lu ", freecount);
Well.
if direct reclaim need lumpy reclaim, you are right.
In no lupy case, vmscan start pageout and move the page list tail typically.
cleaned page will be used by another task.
---------------------------------------------------------------------------------------
static unsigned long shrink_page_list(struct list_head *page_list,
struct list_head *freed_pages_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
(snip)
switch (pageout(page, mapping, sync_writeback)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
if (PageWriteback(page) || PageDirty(page))
goto keep; /////// HERE
---------------------------------------------------------------------------------------
> > - vmscan maintain page granularity lru list. It mean vmscan makes awful
> > seekful I/O. it assume block-layer buffered much i/o request.
> > - plus, the above mena vmscan. writeout need good io throughput. otherwise
> > system might cause hangup.
> >
> > However, I don't think kswapd_awake is good choice. because
> > - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
> > btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.
>
> Good point.
>
> > - On large (many memory node) machine, one of much kswapd always run.
> >
>
> Also true.
>
> >
> > Instead, PF_MEMALLOC is good idea?
>
> It doesn't work out either because a process with PF_MEMALLOC is in
> direct reclaim and like kswapd, it may not be able to clean the pages at
> all, let alone in a small period of time.
please forget this idea ;)
In practice, the ordering of page allocations and frees are not random
but it's ok for the purposes here.
Also when considering fragmentation, I'd take into account the order of the
desired allocation as fragmentations at or over that size are not contributing
to fragmentation in a negative way. I'd usually express it in terms of free
pages instead of total pages as well to avoid large fluctuations when reclaim
is working. We can work with this measure for the moment though to avoid
getting side-tracked on what fragmentation is.
> I observed the mainline kernel during normal usage, and found that:
> * the fragmentation is very low after boot (< 1%).
> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.
All three of these observations are expected.
> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).
Again, this is expected. Page cache pages stay resident until
reclaimed. If they are clean, they are not really contributing to
fragmentation in any way that matters as they should be quickly found
and discarded in most cases. In the networking case, it's depending on
kswapd to find and reclaim the pages fast enough.
> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
> >
> > Two major differences. 1, the previous non-high-order tests had also
> > run sysbench and iozone so the starting conditions are different. I had
> > disabled those tests to get some of the high-order figures before I went
> > offline. However, the starting conditions are probably not as important as
> > the fact that kswapd is working to free order-2 pages and staying awake
> > until watermarks are reached. kswapd working harder is probably making a
> > big difference.
> >
>
> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
While this is true, during the course of the test, the old page cache
should be discarded quickly. It's not as abrupt as dropping the page
cache but the end result should be similar in the majority of cases -
the exception being when atomic allocations are a major factor.
> We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory.
> On the other hand, machines with plenty of memory should be the norm now, even for desktops.
>
Even large memory machines will eventually use the bulk of their memory
on old page cache. There is no problem with this as such.
Sounds reasonable.
> On the other hand, I saw that the problems with high order allocations started
> around 2.6.31, where we didn't have any low_latency patch.
While this is true, there appear to be many sources of the high order
allocation failures. While low_latency is not the original source, it
does not appear to have helped either. Even without high-order
allocations being involved, disabling low_latency performs much better
in low-memory situations.
> So I don't think the
> solution to the problem is in the block layer. A slightly slower or faster writeback
> shouldn't cause a DoS like situation as the one encountered with your network driver.
>
> > > Moreover, it will improve some workloads, but penalize others.
> >
> > It really does appear to hurt a lot when the machine is kinda low on
> > memory though. That is a fairly common situation with a desktop loaded
> > up with random apps. Well..... by common, I mean I hit that situation a
> > lot on my laptop. I don't hit it on server workloads because I make sure
> > the machines are not overloaded.
>
> This is why we have it as a tunable. If your workload is negatively affected,
> you can switch it off.
True, although it's hard to spot.
> But make sure to test it thoroughly, because even if
> you found a 2x slowdown in a particular circumstance, it can gain 10x
> speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html)
> in others.
>
Ok.
> >
> > > Your 3 patches, though, seem to improve the situation also for
> > > low_latency enabled, both for performance and allocation failures (25
> > > to 3). Having those 3 patches with low_latency enabled seems better,
> > > since it won't penalize the workloads that are benefited by
> > > low_latency (if you add a sequential read to your test, you should see
> > > a big difference).
> >
> > This is true and I would like to see them merged. However, this close to
> > release, with Jens unhappiness with the explanation of why
> > congestion_wait() changes made a difference and Andrew feeling there
> > wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
> > Monday what the story is.
>
> After a 1day study of the VM, I found an other way to improve the fragmentation.
> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
> and decreases overtime, if the system is lightly used, even without dropping caches.
> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
> allocations are usually serviced by the other zones (more likely than with mainline allocator).
>
> The idea is to have 2 freelists for each zone.
> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
> The free_list_1 contains the other ones.
> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
> And when extracting, we always extract from free_list_0 first,
This is subtle, but as well as increased overhead in the page allocator, I'd
expect this to break the page-ordering when a caller is allocation many numbers
of order-0 pages. Some IO controllers get a boost by the pages coming back
in physically contiguous order which happens if a high-order page is being
split towards the beginning of the stream of requests. Previous attempts at
altering how coalescing and splitting to reduce fragmentation with methods
similar to yours have fallen foul of this.
parent_page is a bad name here. It's not the parent of anything. What I
think you're looking for is the lowest page of the pair of buddies that
was last considered for merging.
> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
> + }
And you are checking if when one buddy of this pair frees, will it then
be merged with the next-highest order. If so, you want to delay reusing
that page for allocation.
> +
> + if (high_order_free)
> + list_add(&page->lru,
> + &zone->free_area[order].free_list_1[migratetype]);
> + else
> + list_add(&page->lru,
> + &zone->free_area[order].free_list_0[migratetype]);
You could have avoided the extra list to some extent by altering whether
it was the head or tail of the list the page was added to. It would have
had a similar effect of the page not being used for longer with slightly
less overhead.
> zone->free_area[order].nr_free++;
> }
>
> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
> high--;
> size >>= 1;
> VM_BUG_ON(bad_range(zone, &page[size]));
> - list_add(&page[size].lru, &area->free_list[migratetype]);
> + list_add(&page[size].lru, &area->free_list_1[migratetype]);
I think this here will damage the contiguous ordering of pages being
returned to callers.
> area->nr_free++;
> set_page_order(&page[size], high);
> }
> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>
> /* Find a page of the appropriate size in the preferred list */
> for (current_order = order; current_order < MAX_ORDER; ++current_order) {
> + bool fl0, fl1;
> area = &(zone->free_area[current_order]);
> - if (list_empty(&area->free_list[migratetype]))
> + fl0 = list_empty(&area->free_list_0[migratetype]);
> + fl1 = list_empty(&area->free_list_1[migratetype]);
> + if (fl0 && fl1)
> continue;
>
> - page = list_entry(area->free_list[migratetype].next,
> - struct page, lru);
> + if (fl0)
> + page = list_entry(area->free_list_1[migratetype].next,
> + struct page, lru);
> + else
> + page = list_entry(area->free_list_0[migratetype].next,
> + struct page, lru);
By altering whether it's the head or tail free pages are added to, you
can achieve a similar effect.
No more than the low_latency switch, I think this will help some
workloads in terms of fragmentation but hurt others that depend on the
ordering of pages being returned. There is a fair amount of overhead
introduced here as well with branches and a lot of extra lists although
I believe that could be mitigated.
What are the results if you just alter whether it's the head or tail of
the list that is used in __free_one_page()?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
If you need an order 5 page, how would kswapd work?
Will it free randomly some order 0 pages until a merge magically happens?
Unless the dirty ratio is really high, there should already be plenty
of contiguous non-dirty pages in the page cache that could be freed,
but if you use an LRU policy to evict, you can go through a lot of
freeing before a merge will happen.
>> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
>> >
>> > Two major differences. 1, the previous non-high-order tests had also
>> > run sysbench and iozone so the starting conditions are different. I had
>> > disabled those tests to get some of the high-order figures before I went
>> > offline. However, the starting conditions are probably not as important as
>> > the fact that kswapd is working to free order-2 pages and staying awake
>> > until watermarks are reached. kswapd working harder is probably making a
>> > big difference.
>> >
>>
>> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
>
> While this is true, during the course of the test, the old page cache
> should be discarded quickly. It's not as abrupt as dropping the page
> cache but the end result should be similar in the majority of cases -
> the exception being when atomic allocations are a major factor.
For my I/O scheduler tests I use an external disk, to be able to
monitor exactly what is happening.
If I don't do a sync & drop cache before starting a test, I usually
see writeback happening on the main disk, even if the only activity on
the machine is writing a sequential file to my external disk. If that
writeback is done in the context of my test process, this will alter
the result.
And with high order allocations, depending on how do you free page
cache, it can be even worse than that.
>
>> On the other hand, I saw that the problems with high order allocations started
>> around 2.6.31, where we didn't have any low_latency patch.
>
> While this is true, there appear to be many sources of the high order
> allocation failures. While low_latency is not the original source, it
> does not appear to have helped either. Even without high-order
> allocations being involved, disabling low_latency performs much better
> in low-memory situations.
Can you try reproducing:
http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html
in a low memory scenario, to substantiate your claim?
>> After a 1day study of the VM, I found an other way to improve the fragmentation.
>> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
>> and decreases overtime, if the system is lightly used, even without dropping caches.
>> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
>> allocations are usually serviced by the other zones (more likely than with mainline allocator).
>>
>> The idea is to have 2 freelists for each zone.
>> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
>> The free_list_1 contains the other ones.
>> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
>> And when extracting, we always extract from free_list_0 first,
>
> This is subtle, but as well as increased overhead in the page allocator, I'd
> expect this to break the page-ordering when a caller is allocation many numbers
> of order-0 pages. Some IO controllers get a boost by the pages coming back
> in physically contiguous order which happens if a high-order page is being
> split towards the beginning of the stream of requests. Previous attempts at
> altering how coalescing and splitting to reduce fragmentation with methods
> similar to yours have fallen foul of this.
I took extreme care in not disrupting the page ordering. In fact, I
thought, too, to a single list solution, but it could cause page
reordering (since I would have used add_tail to add to the other
list).
Right, this should be the combined page, to keep naming consistent
with combined_idx.
>
>> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
>> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
>> + }
>
> And you are checking if when one buddy of this pair frees, will it then
> be merged with the next-highest order. If so, you want to delay reusing
> that page for allocation.
Exactly.
If you have two streams of allocations, with different average
lifetime (and with the long lifetime allocations having a slower
rate), this will make very probable that the long lifetime allocations
span a smaller set of compounds.
>
>> +
>> + if (high_order_free)
>> + list_add(&page->lru,
>> + &zone->free_area[order].free_list_1[migratetype]);
>> + else
>> + list_add(&page->lru,
>> + &zone->free_area[order].free_list_0[migratetype]);
>
> You could have avoided the extra list to some extent by altering whether
> it was the head or tail of the list the page was added to. It would have
> had a similar effect of the page not being used for longer with slightly
> less overhead.
Right, but the order of insertions at the tail would be reversed.
>> zone->free_area[order].nr_free++;
>> }
>>
>> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
>> high--;
>> size >>= 1;
>> VM_BUG_ON(bad_range(zone, &page[size]));
>> - list_add(&page[size].lru, &area->free_list[migratetype]);
>> + list_add(&page[size].lru, &area->free_list_1[migratetype]);
>
> I think this here will damage the contiguous ordering of pages being
> returned to callers.
This shouldn't damage the order. In fact, expand always inserts in the
free_list_1, in the same order as the original code inserted in the
free_list. And if we hit expand, then the free_list_0 is empty, so all
allocations will be serviced from free_list_1 in the same order as the
original code.
Hopefully not, if my considerations above are correct.
> There is a fair amount of overhead
> introduced here as well with branches and a lot of extra lists although
> I believe that could be mitigated.
>
> What are the results if you just alter whether it's the head or tail of
> the list that is used in __free_one_page()?
In that case, it would alter the ordering, but not the one of the
pages returned by expand.
In fact, only the order of the pages returned by free will be
affected, and in that case maybe it is already quite disordered.
If that order is not needed to be kept, I can prepare a new version
with a single list.
BTW, if we only guarantee that pages returned by expand are well
ordered, this patch will increase the ordered-ness of the stream of
allocated pages, since it will increase the probability that
allocations go into expand (since frees will more likely create high
order combined pages). So it will also improve the workloads that
prefer ordered allocations.
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
>
--
__________________________________________________________________________
dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
No, it won't. There is contiguity-aware reclaim logic called "lumpy reclaim"
which is used for high-order pages. The next LRU page for reclaiming is
a cursor page and the naturally-aligned block of pages around it are also
considered for reclaim so that a high-order page gets freed.
> Unless the dirty ratio is really high, there should already be plenty
> of contiguous non-dirty pages in the page cache that could be freed,
> but if you use an LRU policy to evict, you can go through a lot of
> freeing before a merge will happen.
>
Indeed. There is no need to go into details but if it was order-0 pages
being reclaimed, an extremely large percentage of memory would have to be
freed to get a order-5 page.
> >> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
> >> >
> >> > Two major differences. 1, the previous non-high-order tests had also
> >> > run sysbench and iozone so the starting conditions are different. I had
> >> > disabled those tests to get some of the high-order figures before I went
> >> > offline. However, the starting conditions are probably not as important as
> >> > the fact that kswapd is working to free order-2 pages and staying awake
> >> > until watermarks are reached. kswapd working harder is probably making a
> >> > big difference.
> >> >
> >>
> >> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
> >
> > While this is true, during the course of the test, the old page cache
> > should be discarded quickly. It's not as abrupt as dropping the page
> > cache but the end result should be similar in the majority of cases -
> > the exception being when atomic allocations are a major factor.
>
> For my I/O scheduler tests I use an external disk, to be able to
> monitor exactly what is happening.
> If I don't do a sync & drop cache before starting a test, I usually
> see writeback happening on the main disk, even if the only activity on
> the machine is writing a sequential file to my external disk. If that
> writeback is done in the context of my test process, this will alter
> the result.
Why does the writeback kick in late? I thought pages were meant to be written
back after a contigurable interval of time had passed.
> And with high order allocations, depending on how do you free page
> cache, it can be even worse than that.
>
> >
> >> On the other hand, I saw that the problems with high order allocations started
> >> around 2.6.31, where we didn't have any low_latency patch.
> >
> > While this is true, there appear to be many sources of the high order
> > allocation failures. While low_latency is not the original source, it
> > does not appear to have helped either. Even without high-order
> > allocations being involved, disabling low_latency performs much better
> > in low-memory situations.
>
> Can you try reproducing:
> http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html
> in a low memory scenario, to substantiate your claim?
>
I can try but it'll take a few days to get around to. I'm still trying
to identify other sources of the problems from between 2.6.30 and
2.6.32-rc8. It'll be tricky to test what you ask because it might not just
be low-memory that is the problem but low memory + enough pressure that
processes are stalling waiting on reclaim.
> >> After a 1day study of the VM, I found an other way to improve the fragmentation.
> >> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
> >> and decreases overtime, if the system is lightly used, even without dropping caches.
> >> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
> >> allocations are usually serviced by the other zones (more likely than with mainline allocator).
> >>
> >> The idea is to have 2 freelists for each zone.
> >> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
> >> The free_list_1 contains the other ones.
> >> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
> >> And when extracting, we always extract from free_list_0 first,
> >
> > This is subtle, but as well as increased overhead in the page allocator, I'd
> > expect this to break the page-ordering when a caller is allocation many numbers
> > of order-0 pages. Some IO controllers get a boost by the pages coming back
> > in physically contiguous order which happens if a high-order page is being
> > split towards the beginning of the stream of requests. Previous attempts at
> > altering how coalescing and splitting to reduce fragmentation with methods
> > similar to yours have fallen foul of this.
>
> I took extreme care in not disrupting the page ordering. In fact, I
> thought, too, to a single list solution, but it could cause page
> reordering (since I would have used add_tail to add to the other
> list).
>
You're right. this way does preserve the page ordering.
> >
> >> and fall back on the other if the first is empty.
> >> In this way, we keep free longer the pages that are more likely to cause a big merge.
> >> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.
> >>
> >> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.
> >>
> >> Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 6f75617..6427361 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
> >> ᅵ}
> >>
> >> ᅵstruct free_area {
> >> - ᅵ ᅵ struct list_head ᅵ ᅵ ᅵ ᅵfree_list[MIGRATE_TYPES];
> >> + ᅵ ᅵ struct list_head ᅵ ᅵ ᅵ ᅵfree_list_0[MIGRATE_TYPES];
> >> + ᅵ ᅵ struct list_head ᅵ ᅵ ᅵ ᅵfree_list_1[MIGRATE_TYPES];
> >> ᅵ ᅵ ᅵ unsigned long ᅵ ᅵ ᅵ ᅵ ᅵ nr_free;
> >> ᅵ};
> >>
> >> diff --git a/kernel/kexec.c b/kernel/kexec.c
> >> index f336e21..aee5ef5 100644
> >> --- a/kernel/kexec.c
> >> +++ b/kernel/kexec.c
> >> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(zone, free_area);
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(zone, vm_stat);
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(zone, spanned_pages);
> >> - ᅵ ᅵ VMCOREINFO_OFFSET(free_area, free_list);
> >> + ᅵ ᅵ VMCOREINFO_OFFSET(free_area, free_list_0);
> >> + ᅵ ᅵ VMCOREINFO_OFFSET(free_area, free_list_1);
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(list_head, next);
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(list_head, prev);
> >> ᅵ ᅵ ᅵ VMCOREINFO_OFFSET(vm_struct, addr);
> >> ᅵ ᅵ ᅵ VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
> >> ᅵ ᅵ ᅵ log_buf_kexec_setup();
> >> - ᅵ ᅵ VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> >> + ᅵ ᅵ VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
> >> + ᅵ ᅵ VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
> >> ᅵ ᅵ ᅵ VMCOREINFO_NUMBER(NR_FREE_PAGES);
> >> ᅵ ᅵ ᅵ VMCOREINFO_NUMBER(PG_lru);
> >> ᅵ ᅵ ᅵ VMCOREINFO_NUMBER(PG_private);
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index cdcedf6..5f488d8 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ int migratetype)
> >> ᅵ{
> >> ᅵ ᅵ ᅵ unsigned long page_idx;
> >> + ᅵ ᅵ unsigned long combined_idx;
> >> + ᅵ ᅵ bool high_order_free = false;
> >>
> >> ᅵ ᅵ ᅵ if (unlikely(PageCompound(page)))
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (unlikely(destroy_compound_page(page, order)))
> >> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
> >> ᅵ ᅵ ᅵ VM_BUG_ON(bad_range(zone, page));
> >>
> >> ᅵ ᅵ ᅵ while (order < MAX_ORDER-1) {
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ unsigned long combined_idx;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page *buddy;
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ buddy = __page_find_buddy(page, page_idx, order);
> >> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ order++;
> >> ᅵ ᅵ ᅵ }
> >> ᅵ ᅵ ᅵ set_page_order(page, order);
> >> - ᅵ ᅵ list_add(&page->lru,
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &zone->free_area[order].free_list[migratetype]);
> >> +
> >> + ᅵ ᅵ if (order < MAX_ORDER-1) {
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page *parent_page, *ppage_buddy;
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ combined_idx = __find_combined_index(page_idx, order);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ parent_page = page + combined_idx - page_idx;
> >
> > parent_page is a bad name here. It's not the parent of anything. What I
> > think you're looking for is the lowest page of the pair of buddies that
> > was last considered for merging.
>
> Right, this should be the combined page, to keep naming consistent
> with combined_idx.
>
> >
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
> >> + ᅵ ᅵ }
> >
> > And you are checking if when one buddy of this pair frees, will it then
> > be merged with the next-highest order. If so, you want to delay reusing
> > that page for allocation.
>
> Exactly.
> If you have two streams of allocations, with different average
> lifetime (and with the long lifetime allocations having a slower
> rate), this will make very probable that the long lifetime allocations
> span a smaller set of compounds.
I see the logic.
> >
> >> +
> >> + ᅵ ᅵ if (high_order_free)
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_add(&page->lru,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &zone->free_area[order].free_list_1[migratetype]);
> >> + ᅵ ᅵ else
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_add(&page->lru,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &zone->free_area[order].free_list_0[migratetype]);
> >
> > You could have avoided the extra list to some extent by altering whether
> > it was the head or tail of the list the page was added to. It would have
> > had a similar effect of the page not being used for longer with slightly
> > less overhead.
>
> Right, but the order of insertions at the tail would be reversed.
>
True but maybe it doesn't matter. What's important is that the order the
pages are returned during allocation and after a high-order page is split
is what is important.
> >> ᅵ ᅵ ᅵ zone->free_area[order].nr_free++;
> >> ᅵ}
> >>
> >> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ high--;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ size >>= 1;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ VM_BUG_ON(bad_range(zone, &page[size]));
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_add(&page[size].lru, &area->free_list[migratetype]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_add(&page[size].lru, &area->free_list_1[migratetype]);
> >
> > I think this here will damage the contiguous ordering of pages being
> > returned to callers.
>
> This shouldn't damage the order. In fact, expand always inserts in the
> free_list_1, in the same order as the original code inserted in the
> free_list. And if we hit expand, then the free_list_0 is empty, so all
> allocations will be serviced from free_list_1 in the same order as the
> original code.
>
> >
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area->nr_free++;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ set_page_order(&page[size], high);
> >> ᅵ ᅵ ᅵ }
> >> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >>
> >> ᅵ ᅵ ᅵ /* Find a page of the appropriate size in the preferred list */
> >> ᅵ ᅵ ᅵ for (current_order = order; current_order < MAX_ORDER; ++current_order) {
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ bool fl0, fl1;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area = &(zone->free_area[current_order]);
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (list_empty(&area->free_list[migratetype]))
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ fl0 = list_empty(&area->free_list_0[migratetype]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ fl1 = list_empty(&area->free_list_1[migratetype]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (fl0 && fl1)
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ continue;
> >>
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list[migratetype].next,
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (fl0)
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list_1[migratetype].next,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ else
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list_0[migratetype].next,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >
> > By altering whether it's the head or tail free pages are added to, you
> > can achieve a similar effect.
> >
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_del(&page->lru);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ rmv_page_order(page);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area->nr_free--;
> >> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ order = page_order(page);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_del(&page->lru);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_add(&page->lru,
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &zone->free_area[order].free_list[migratetype]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &zone->free_area[order].free_list_0[migratetype]);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page += 1 << order;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ pages_moved += 1 << order;
> >> ᅵ ᅵ ᅵ }
> >> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >> ᅵ ᅵ ᅵ for (current_order = MAX_ORDER-1; current_order >= order;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ --current_order) {
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ for (i = 0; i < MIGRATE_TYPES - 1; i++) {
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ bool fl0, fl1;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ migratetype = fallbacks[start_migratetype][i];
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ /* MIGRATE_RESERVE handled later if necessary */
> >> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ continue;
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area = &(zone->free_area[current_order]);
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (list_empty(&area->free_list[migratetype]))
> >> +
> >> +
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ fl0 = list_empty(&area->free_list_0[migratetype]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ fl1 = list_empty(&area->free_list_1[migratetype]);
> >> +
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (fl0 && fl1)
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ continue;
> >>
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list[migratetype].next,
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ if (fl0)
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list_1[migratetype].next,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ else
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ page = list_entry(area->free_list_0[migratetype].next,
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ struct page, lru);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area->nr_free--;
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ /*
> >> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ }
> >>
> >> ᅵ ᅵ ᅵ for_each_migratetype_order(order, t) {
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ unsigned long i;
> >> +
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ for (i = 0; i < (1UL << order); i++)
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ swsusp_set_page_free(pfn_to_page(pfn + i));
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ }
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ unsigned long i;
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >> ᅵ{
> >> ᅵ ᅵ ᅵ int order, t;
> >> ᅵ ᅵ ᅵ for_each_migratetype_order(order, t) {
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ zone->free_area[order].nr_free = 0;
> >> ᅵ ᅵ ᅵ }
> >> ᅵ}
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index c81321f..613ef1e 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >>
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ area = &(zone->free_area[order]);
> >>
> >> - ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &area->free_list[mtype])
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &area->free_list_0[mtype])
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ freecount++;
> >> + ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ list_for_each(curr, &area->free_list_1[mtype])
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ freecount++;
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ seq_printf(m, "%6lu ", freecount);
> >> ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ }
> >
> > No more than the low_latency switch, I think this will help some
> > workloads in terms of fragmentation but hurt others that depend on the
> > ordering of pages being returned.
>
> Hopefully not, if my considerations above are correct.
Right, it doesn't affect the ordering of pages returned. The impact is
additional branches and a lot more lists but it's still very interesting.
> > There is a fair amount of overhead
> > introduced here as well with branches and a lot of extra lists although
> > I believe that could be mitigated.
> >
> > What are the results if you just alter whether it's the head or tail of
> > the list that is used in __free_one_page()?
>
> In that case, it would alter the ordering, but not the one of the
> pages returned by expand.
> In fact, only the order of the pages returned by free will be
> affected, and in that case maybe it is already quite disordered.
> If that order is not needed to be kept, I can prepare a new version
> with a single list.
>
The ordering of free does not need to be preserved. The important
property is that if a high-order page is split by expand() that
subsequent allocations use the contiguous pages.
> BTW, if we only guarantee that pages returned by expand are well
> ordered, this patch will increase the ordered-ness of the stream of
> allocated pages, since it will increase the probability that
> allocations go into expand (since frees will more likely create high
> order combined pages). So it will also improve the workloads that
> prefer ordered allocations.
>
That's a distinct possibility.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
> > For my I/O scheduler tests I use an external disk, to be able to
> > monitor exactly what is happening.
> > If I don't do a sync & drop cache before starting a test, I usually
> > see writeback happening on the main disk, even if the only activity on
> > the machine is writing a sequential file to my external disk. If that
> > writeback is done in the context of my test process, this will alter
> > the result.
>
> Why does the writeback kick in late? I thought pages were meant to be
> written back after a contigurable interval of time had passed.
That is a good question. Maybe when dirty ratio goes high, something is
being written to swap?
>
> I can try but it'll take a few days to get around to. I'm still trying
> to identify other sources of the problems from between 2.6.30 and
> 2.6.32-rc8. It'll be tricky to test what you ask because it might not just
> be low-memory that is the problem but low memory + enough pressure that
> processes are stalling waiting on reclaim.
Ok.
>
> > Right, but the order of insertions at the tail would be reversed.
>
> True but maybe it doesn't matter. What's important is that the order the
> pages are returned during allocation and after a high-order page is split
> is what is important.
>
> > > There is a fair amount of overhead
> > > introduced here as well with branches and a lot of extra lists although
> > > I believe that could be mitigated.
> > >
> > > What are the results if you just alter whether it's the head or tail of
> > > the list that is used in __free_one_page()?
> >
> > In that case, it would alter the ordering, but not the one of the
> > pages returned by expand.
> > In fact, only the order of the pages returned by free will be
> > affected, and in that case maybe it is already quite disordered.
> > If that order is not needed to be kept, I can prepare a new version
> > with a single list.
>
> The ordering of free does not need to be preserved. The important
> property is that if a high-order page is split by expand() that
> subsequent allocations use the contiguous pages.
Then, a solution with a single list is possible. It removes the overhead
of the branches when allocating, and also the additional lists.
What about:
From b792ce5afff2e7a28ec3db41baaf93c3200ee5fc Mon Sep 17 00:00:00 2001
From: Corrado Zoccolo <czoc...@gmail.com>
Date: Mon, 30 Nov 2009 17:42:05 +0100
Subject: [PATCH] page allocator: heuristic to reduce fragmentation in buddy
allocator
In order to reduce fragmentation, we classify freed pages in two
groups, according to their probability of being part of a high
order merge.
Pages belonging to a compound whose buddy is free are more likely
to be part of a high order merge, so they will be added at the tail
of the freelist. The remaining pages will, instead, be put at the
front of the freelist.
In this way, the pages that are more likely to cause a big merge are
kept free longer. Consequently we tend to aggregate the long-living
allocations on a subset of the compounds, reducing the fragmentation.
Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
mm/page_alloc.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bc2ac6..0f273af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
int migratetype)
{
unsigned long page_idx;
+ unsigned long combined_idx;
+ bool combined_free = false;
if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(bad_range(zone, page));
while (order < MAX_ORDER-1) {
- unsigned long combined_idx;
struct page *buddy;
buddy = __page_find_buddy(page, page_idx, order);
@@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
order++;
}
set_page_order(page, order);
- list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+
+ if (order < MAX_ORDER-1) {
+ struct page *combined_page, *combined_buddy;
+ combined_idx = __find_combined_index(page_idx, order);
+ combined_page = page + combined_idx - page_idx;
+ combined_buddy = __page_find_buddy(combined_page, combined_idx, order + 1);
+ combined_free = page_is_buddy(combined_page, combined_buddy, order + 1);
+ }
+
+ if (combined_free)
+ list_add_tail(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
+ else
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
}
--
1.6.2.5