[PATCH 00/10] mm: balance LRU lists based on relative thrashing

Johannes Weiner

unread,

Jun 6, 2016, 4:00:07 PM6/6/16

to

Hi everybody,

this series re-implements the LRU balancing between page cache and
anonymous pages to work better with fast random IO swap devices.

The LRU balancing code evolved under slow rotational disks with high
seek overhead, and it had to extrapolate the cost of reclaiming a list
based on in-memory reference patterns alone, which is error prone and,
in combination with the high IO cost of mistakes, risky. As a result,
the balancing code is now at a point where it mostly goes for page
cache and avoids the random IO of swapping altogether until the VM is
under significant memory pressure.

With the proliferation of fast random IO devices such as SSDs and
persistent memory, though, swap becomes interesting again, not just as
a last-resort overflow, but as an extension of memory that can be used
to optimize the in-memory balance between the page cache and the
anonymous workingset even during moderate load. Our current reclaim
choices don't exploit the potential of this hardware. This series sets
out to address this.

Having exact tracking of refault IO - the ultimate cost of reclaiming
the wrong pages - allows us to use an IO cost based balancing model
that is more aggressive about swapping on fast backing devices while
holding back on existing setups that still use rotational storage.

These patches base the LRU balancing on the rate of refaults on each
list, times the relative IO cost between swap device and filesystem
(swappiness), in order to optimize reclaim for least IO cost incurred.

---

The following postgres benchmark demonstrates the benefits of this new
model. The machine has 7G, the database is 5.6G with 1G for shared
buffers, and the system has a little over 1G worth of anonymous pages
from mostly idle processes and tmpfs files. The filesystem is on
spinning rust, the swap partition is on an SSD; swappiness is set to
115 to ballpark the relative IO cost between them. The test run is
preceded by 30 minutes of warmup using the same workload:

transaction type: TPC-B (sort of)
scaling factor: 420
query mode: simple
number of clients: 8
number of threads: 4
duration: 3600 s

vanilla:
number of transactions actually processed: 290360
latency average: 99.187 ms
latency stddev: 261.171 ms
tps = 80.654848 (including connections establishing)
tps = 80.654878 (excluding connections establishing)

patched:
number of transactions actually processed: 377960
latency average: 76.198 ms
latency stddev: 229.411 ms
tps = 104.987704 (including connections establishing)
tps = 104.987743 (excluding connections establishing)

The patched kernel shows a 30% increase in throughput, and a 23%
decrease in average latency. Latency variance is reduced as well.

The reclaim statistics explain the difference in behavior:

PGBENCH5.6G-vanilla PGBENCH5.6G-lrucost
Real time 3600.49 ( +0.00%) 3600.26 ( -0.01%)
User time 17.85 ( +0.00%) 18.80 ( +5.05%)
System time 17.52 ( +0.00%) 17.02 ( -2.72%)
Allocation stalls 3.00 ( +0.00%) 0.00 ( -75.00%)
Anon scanned 6579.00 ( +0.00%) 201845.00 (+2967.57%)
Anon reclaimed 3426.00 ( +0.00%) 86924.00 (+2436.48%)
Anon reclaim efficiency 52.07 ( +0.00%) 43.06 ( -16.98%)
File scanned 364444.00 ( +0.00%) 27706.00 ( -92.40%)
File reclaimed 363136.00 ( +0.00%) 27366.00 ( -92.46%)
File reclaim efficiency 99.64 ( +0.00%) 98.77 ( -0.86%)
Swap out 3149.00 ( +0.00%) 86932.00 (+2659.78%)
Swap in 313.00 ( +0.00%) 503.00 ( +60.51%)
File refault 222486.00 ( +0.00%) 101041.00 ( -54.59%)
Total refaults 222799.00 ( +0.00%) 101544.00 ( -54.42%)

The patched kernel works much harder to find idle anonymous pages in
order to alleviate the thrashing of the page cache. And it pays off:
overall, refault IO is cut in half, more time is spent in userspace,
less time is spent in the kernel.

---

The parallelio test from the mmtests package shows the backward
compatibility of the new model. It runs a memcache workload while
copying large files in parallel. The page cache isn't thrashing, so
the VM shouldn't swap except to relieve immediate memory pressure.
Swappiness is reset to the default setting of 60 as well.

parallelio Transactions
vanilla lrucost
60 60
Min memcachetest-0M 83736.00 ( 0.00%) 84376.00 ( 0.76%)
Min memcachetest-769M 83708.00 ( 0.00%) 85038.00 ( 1.59%)
Min memcachetest-2565M 85419.00 ( 0.00%) 85740.00 ( 0.38%)
Min memcachetest-4361M 85979.00 ( 0.00%) 86746.00 ( 0.89%)
Hmean memcachetest-0M 84805.85 ( 0.00%) 84852.31 ( 0.05%)
Hmean memcachetest-769M 84273.56 ( 0.00%) 85160.52 ( 1.05%)
Hmean memcachetest-2565M 85792.43 ( 0.00%) 85967.59 ( 0.20%)
Hmean memcachetest-4361M 86212.90 ( 0.00%) 86891.87 ( 0.79%)
Stddev memcachetest-0M 959.16 ( 0.00%) 339.07 ( 64.65%)
Stddev memcachetest-769M 421.00 ( 0.00%) 110.07 ( 73.85%)
Stddev memcachetest-2565M 277.86 ( 0.00%) 252.33 ( 9.19%)
Stddev memcachetest-4361M 193.55 ( 0.00%) 106.30 ( 45.08%)
CoeffVar memcachetest-0M 1.13 ( 0.00%) 0.40 ( 64.66%)
CoeffVar memcachetest-769M 0.50 ( 0.00%) 0.13 ( 74.13%)
CoeffVar memcachetest-2565M 0.32 ( 0.00%) 0.29 ( 9.37%)
CoeffVar memcachetest-4361M 0.22 ( 0.00%) 0.12 ( 45.51%)
Max memcachetest-0M 86067.00 ( 0.00%) 85129.00 ( -1.09%)
Max memcachetest-769M 84715.00 ( 0.00%) 85305.00 ( 0.70%)
Max memcachetest-2565M 86084.00 ( 0.00%) 86320.00 ( 0.27%)
Max memcachetest-4361M 86453.00 ( 0.00%) 86996.00 ( 0.63%)

parallelio Background IO
vanilla lrucost
60 60
Min io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Min io-duration-769M 6.00 ( 0.00%) 6.00 ( 0.00%)
Min io-duration-2565M 21.00 ( 0.00%) 21.00 ( 0.00%)
Min io-duration-4361M 36.00 ( 0.00%) 37.00 ( -2.78%)
Amean io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean io-duration-769M 6.67 ( 0.00%) 6.67 ( 0.00%)
Amean io-duration-2565M 21.67 ( 0.00%) 21.67 ( 0.00%)
Amean io-duration-4361M 36.33 ( 0.00%) 37.00 ( -1.83%)
Stddev io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Stddev io-duration-769M 0.47 ( 0.00%) 0.47 ( 0.00%)
Stddev io-duration-2565M 0.47 ( 0.00%) 0.47 ( 0.00%)
Stddev io-duration-4361M 0.47 ( 0.00%) 0.00 (100.00%)
CoeffVar io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
CoeffVar io-duration-769M 7.07 ( 0.00%) 7.07 ( 0.00%)
CoeffVar io-duration-2565M 2.18 ( 0.00%) 2.18 ( 0.00%)
CoeffVar io-duration-4361M 1.30 ( 0.00%) 0.00 (100.00%)
Max io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Max io-duration-769M 7.00 ( 0.00%) 7.00 ( 0.00%)
Max io-duration-2565M 22.00 ( 0.00%) 22.00 ( 0.00%)
Max io-duration-4361M 37.00 ( 0.00%) 37.00 ( 0.00%)

parallelio Swap totals
vanilla lrucost
60 60
Min swapin-0M 244169.00 ( 0.00%) 281418.00 (-15.26%)
Min swapin-769M 269973.00 ( 0.00%) 231669.00 ( 14.19%)
Min swapin-2565M 204356.00 ( 0.00%) 188934.00 ( 7.55%)
Min swapin-4361M 178044.00 ( 0.00%) 147799.00 ( 16.99%)
Min swaptotal-0M 810441.00 ( 0.00%) 832580.00 ( -2.73%)
Min swaptotal-769M 827282.00 ( 0.00%) 705879.00 ( 14.67%)
Min swaptotal-2565M 690422.00 ( 0.00%) 656948.00 ( 4.85%)
Min swaptotal-4361M 660507.00 ( 0.00%) 582026.00 ( 11.88%)
Min minorfaults-0M 2677904.00 ( 0.00%) 2706086.00 ( -1.05%)
Min minorfaults-769M 2731412.00 ( 0.00%) 2606587.00 ( 4.57%)
Min minorfaults-2565M 2599647.00 ( 0.00%) 2572429.00 ( 1.05%)
Min minorfaults-4361M 2573117.00 ( 0.00%) 2514047.00 ( 2.30%)
Min majorfaults-0M 82864.00 ( 0.00%) 98005.00 (-18.27%)
Min majorfaults-769M 95047.00 ( 0.00%) 78789.00 ( 17.11%)
Min majorfaults-2565M 69486.00 ( 0.00%) 65934.00 ( 5.11%)
Min majorfaults-4361M 60009.00 ( 0.00%) 50955.00 ( 15.09%)
Amean swapin-0M 291429.67 ( 0.00%) 290184.67 ( 0.43%)
Amean swapin-769M 294641.33 ( 0.00%) 247553.33 ( 15.98%)
Amean swapin-2565M 224398.67 ( 0.00%) 199541.33 ( 11.08%)
Amean swapin-4361M 188710.67 ( 0.00%) 155103.67 ( 17.81%)
Amean swaptotal-0M 877847.33 ( 0.00%) 842476.33 ( 4.03%)
Amean swaptotal-769M 860593.67 ( 0.00%) 765749.00 ( 11.02%)
Amean swaptotal-2565M 724284.33 ( 0.00%) 674759.67 ( 6.84%)
Amean swaptotal-4361M 669080.67 ( 0.00%) 594949.33 ( 11.08%)
Amean minorfaults-0M 2743339.00 ( 0.00%) 2707815.33 ( 1.29%)
Amean minorfaults-769M 2740174.33 ( 0.00%) 2656168.33 ( 3.07%)
Amean minorfaults-2565M 2624234.00 ( 0.00%) 2579847.00 ( 1.69%)
Amean minorfaults-4361M 2582434.67 ( 0.00%) 2525946.33 ( 2.19%)
Amean majorfaults-0M 99845.67 ( 0.00%) 101007.33 ( -1.16%)
Amean majorfaults-769M 101037.67 ( 0.00%) 87706.00 ( 13.19%)
Amean majorfaults-2565M 74771.67 ( 0.00%) 68243.67 ( 8.73%)
Amean majorfaults-4361M 62557.33 ( 0.00%) 52668.33 ( 15.81%)
Stddev swapin-0M 33554.61 ( 0.00%) 6370.43 ( 81.01%)
Stddev swapin-769M 18283.19 ( 0.00%) 11586.05 ( 36.63%)
Stddev swapin-2565M 14314.16 ( 0.00%) 9023.96 ( 36.96%)
Stddev swapin-4361M 11000.92 ( 0.00%) 6770.47 ( 38.46%)
Stddev swaptotal-0M 47680.16 ( 0.00%) 8319.84 ( 82.55%)
Stddev swaptotal-769M 23632.76 ( 0.00%) 42426.42 (-79.52%)
Stddev swaptotal-2565M 24761.63 ( 0.00%) 14504.40 ( 41.42%)
Stddev swaptotal-4361M 8173.20 ( 0.00%) 9177.32 (-12.29%)
Stddev minorfaults-0M 49578.82 ( 0.00%) 1928.88 ( 96.11%)
Stddev minorfaults-769M 7305.53 ( 0.00%) 35084.61 (-380.25%)
Stddev minorfaults-2565M 17393.80 ( 0.00%) 5259.94 ( 69.76%)
Stddev minorfaults-4361M 7780.48 ( 0.00%) 10048.60 (-29.15%)
Stddev majorfaults-0M 12102.64 ( 0.00%) 2178.49 ( 82.00%)
Stddev majorfaults-769M 4839.82 ( 0.00%) 6313.49 (-30.45%)
Stddev majorfaults-2565M 3748.79 ( 0.00%) 2707.31 ( 27.78%)
Stddev majorfaults-4361M 3292.87 ( 0.00%) 1466.92 ( 55.45%)
CoeffVar swapin-0M 11.51 ( 0.00%) 2.20 ( 80.93%)
CoeffVar swapin-769M 6.21 ( 0.00%) 4.68 ( 24.58%)
CoeffVar swapin-2565M 6.38 ( 0.00%) 4.52 ( 29.10%)
CoeffVar swapin-4361M 5.83 ( 0.00%) 4.37 ( 25.12%)
CoeffVar swaptotal-0M 5.43 ( 0.00%) 0.99 ( 81.82%)
CoeffVar swaptotal-769M 2.75 ( 0.00%) 5.54 (-101.76%)
CoeffVar swaptotal-2565M 3.42 ( 0.00%) 2.15 ( 37.12%)
CoeffVar swaptotal-4361M 1.22 ( 0.00%) 1.54 (-26.28%)
CoeffVar minorfaults-0M 1.81 ( 0.00%) 0.07 ( 96.06%)
CoeffVar minorfaults-769M 0.27 ( 0.00%) 1.32 (-395.44%)
CoeffVar minorfaults-2565M 0.66 ( 0.00%) 0.20 ( 69.24%)
CoeffVar minorfaults-4361M 0.30 ( 0.00%) 0.40 (-32.04%)
CoeffVar majorfaults-0M 12.12 ( 0.00%) 2.16 ( 82.21%)
CoeffVar majorfaults-769M 4.79 ( 0.00%) 7.20 (-50.28%)
CoeffVar majorfaults-2565M 5.01 ( 0.00%) 3.97 ( 20.87%)
CoeffVar majorfaults-4361M 5.26 ( 0.00%) 2.79 ( 47.09%)
Max swapin-0M 318760.00 ( 0.00%) 296366.00 ( 7.03%)
Max swapin-769M 313685.00 ( 0.00%) 258977.00 ( 17.44%)
Max swapin-2565M 236882.00 ( 0.00%) 210990.00 ( 10.93%)
Max swapin-4361M 203852.00 ( 0.00%) 164117.00 ( 19.49%)
Max swaptotal-0M 913095.00 ( 0.00%) 852936.00 ( 6.59%)
Max swaptotal-769M 879597.00 ( 0.00%) 799103.00 ( 9.15%)
Max swaptotal-2565M 748943.00 ( 0.00%) 692476.00 ( 7.54%)
Max swaptotal-4361M 680081.00 ( 0.00%) 602448.00 ( 11.42%)
Max minorfaults-0M 2797869.00 ( 0.00%) 2710507.00 ( 3.12%)
Max minorfaults-769M 2749296.00 ( 0.00%) 2682591.00 ( 2.43%)
Max minorfaults-2565M 2637180.00 ( 0.00%) 2584036.00 ( 2.02%)
Max minorfaults-4361M 2592162.00 ( 0.00%) 2538624.00 ( 2.07%)
Max majorfaults-0M 110188.00 ( 0.00%) 103107.00 ( 6.43%)
Max majorfaults-769M 106900.00 ( 0.00%) 92559.00 ( 13.42%)
Max majorfaults-2565M 77770.00 ( 0.00%) 72043.00 ( 7.36%)
Max majorfaults-4361M 67207.00 ( 0.00%) 54538.00 ( 18.85%)

vanilla lrucost
60 60
User 1108.24 1122.37
System 4636.57 4650.63
Elapsed 6046.97 6047.82

vanilla lrucost
60 60
Minor Faults 34022711 33360104
Major Faults 1014895 929273
Swap Ins 2997968 2677588
Swap Outs 6397877 5956707
Allocation stalls 27 31
DMA allocs 0 0
DMA32 allocs 15080196 14356136
Normal allocs 26177871 26662120
Movable allocs 0 0
Direct pages scanned 31625 27194
Kswapd pages scanned 33103442 27727713
Kswapd pages reclaimed 11817394 11598677
Direct pages reclaimed 21146 24043
Kswapd efficiency 35% 41%
Kswapd velocity 5474.385 4584.745
Direct efficiency 66% 88%
Direct velocity 5.230 4.496
Percentage direct scans 0% 0%
Zone normal velocity 3786.073 3908.266
Zone dma32 velocity 1693.542 680.975
Zone dma velocity 0.000 0.000
Page writes by reclaim 6398557.000 5962129.000
Page writes file 680 5422
Page writes anon 6397877 5956707
Page reclaim immediate 3750 12647
Sector Reads 12608512 11624860
Sector Writes 49304260 47539216
Page rescued immediate 0 0
Slabs scanned 148322 164263
Direct inode steals 0 0
Kswapd inode steals 0 22
Kswapd skipped wait 0 0
THP fault alloc 6 3
THP collapse alloc 3490 3567
THP splits 0 0
THP fault fallback 0 0
THP collapse fail 13 17
Compaction stalls 431 446
Compaction success 405 416
Compaction failures 26 30
Page migrate success 199708 211181
Page migrate failure 71 121
Compaction pages isolated 425244 452352
Compaction migrate scanned 209471 226018
Compaction free scanned 20950979 23257076
Compaction cost 216 229
NUMA alloc hit 38459351 38177612
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 38455861 38174045
NUMA base PTE updates 0 0
NUMA huge PMD updates 0 0
NUMA page range updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA hint local percent 100 100
NUMA pages migrated 0 0
AutoNUMA cost 0% 0%

Both the memcache transactions and the background IO throughput are
unchanged.

Overall reclaim activity actually went down in the patched kernel,
since the VM is now deterred by the swapins, whereas previously a
successful swapout followed by a swapin would actually make the anon
LRU more attractive (swapout is a scanned but not rotated page; swapin
puts pages on the inactive list, which used to be a scan event too).

The changes are fairly straight-forward, but they do require a page
flag to tell inactive cache refaults (cache transition) from active
ones (existing cache needs more space). On x86-32 PAE, that bumps us
to 22 core flags + 7 section bits on x86 PAE + 2 zone bits = 31 bits.
With the configurable hwpoison flag 32, and thus the last page flag.
However, this is core VM functionality, and we can make new features
64-bit-only, like we did with the page idle tracking.

Thanks

Documentation/sysctl/vm.txt | 16 +++--
fs/cifs/file.c | 10 +--
fs/fuse/dev.c | 2 +-
include/linux/mmzone.h | 29 ++++----
include/linux/page-flags.h | 2 +
include/linux/pagevec.h | 2 +-
include/linux/swap.h | 11 ++-
include/trace/events/mmflags.h | 1 +
kernel/sysctl.c | 3 +-
mm/filemap.c | 9 +--
mm/migrate.c | 4 ++
mm/mlock.c | 2 +-
mm/shmem.c | 4 +-
mm/swap.c | 124 +++++++++++++++++++---------------
mm/swap_state.c | 3 +-
mm/vmscan.c | 48 ++++++-------
mm/vmstat.c | 6 +-
mm/workingset.c | 142 +++++++++++++++++++++++++++++----------
18 files changed, 258 insertions(+), 160 deletions(-)

Johannes Weiner

unread,

Jun 6, 2016, 4:00:09 PM6/6/16

to

With the advent of fast random IO devices (SSDs, PMEM) and in-memory
swap devices such as zswap, it's possible for swap to be much faster
than filesystems, and for swapping to be preferable over thrashing
filesystem caches.

Allow setting swappiness - which defines the relative IO cost of cache
misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---
Documentation/sysctl/vm.txt | 16 +++++++++++-----
kernel/sysctl.c | 3 ++-
mm/vmscan.c | 2 +-
3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 720355cbdf45..54030750cd31 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)

swappiness

-This control is used to define how aggressive the kernel will swap
-memory pages. Higher values will increase agressiveness, lower values
-decrease the amount of swap. A value of 0 instructs the kernel not to
-initiate swap until the amount of free and file-backed pages is less
-than the high water mark in a zone.
+This control is used to define the relative IO cost of cache misses
+between the swap device and the filesystem as a value between 0 and
+200. At 100, the VM assumes equal IO cost and will thus apply memory
+pressure to the page cache and swap-backed pages equally. At 0, the
+kernel will not initiate swap until the amount of free and file-backed
+pages is less than the high watermark in a zone.

The default value is 60.

+On non-rotational swap devices, a value of 100 (or higher, depending
+on what's backing the filesystem) is recommended.
+
+For in-memory swap, like zswap, values closer to 200 are recommended.
+
==============================================================

- user_reserve_kbytes
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2effd84d83e3..56a9243eb171 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int __maybe_unused two = 2;
static int __maybe_unused four = 4;
static unsigned long one_ul = 1;
static int one_hundred = 100;
+static int two_hundred = 200;
static int one_thousand = 1000;
#ifdef CONFIG_PRINTK
static int ten_thousand = 10000;
@@ -1323,7 +1324,7 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero,
- .extra2 = &one_hundred,
+ .extra2 = &two_hundred,
},
#ifdef CONFIG_HUGETLB_PAGE
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..f79010bbcdd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,7 +136,7 @@ struct scan_control {
#endif

/*
- * From 0 .. 100. Higher means more swappy.
+ * From 0 .. 200. Higher means more swappy.
*/
int vm_swappiness = 60;
/*
--
2.8.3

Johannes Weiner

unread,

Jun 6, 2016, 4:00:09 PM6/6/16

to

Isolating an existing LRU page and subsequently putting it back on the
list currently influences the balance between the anon and file LRUs.
For example, heavy page migration or compaction could influence the
balance between the LRUs and make one type more attractive when that
type of page is affected more than the other. That doesn't make sense.

Add a dedicated LRU cache for putback, so that we can tell new LRU
pages from existing ones at the time of linking them to the lists.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---

include/linux/pagevec.h | 2 +-
include/linux/swap.h | 1 +
mm/mlock.c | 2 +-
mm/swap.c | 34 ++++++++++++++++++++++++++++------
mm/vmscan.c | 2 +-
5 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index b45d391b4540..3f8a2a01131c 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -21,7 +21,7 @@ struct pagevec {
};

void __pagevec_release(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
+void __pagevec_lru_add(struct pagevec *pvec, bool new);
unsigned pagevec_lookup_entries(struct pagevec *pvec,
struct address_space *mapping,
pgoff_t start, unsigned nr_entries,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38fe1e91ba55..178f084365c2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,6 +296,7 @@ extern unsigned long nr_free_pagecache_pages(void);

/* linux/mm/swap.c */
extern void lru_cache_add(struct page *);
+extern void lru_cache_putback(struct page *page);
extern void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *head);
extern void activate_page(struct page *);
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..449c291a286d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -264,7 +264,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
*__pagevec_lru_add() calls release_pages() so we don't call
* put_page() explicitly
*/
- __pagevec_lru_add(pvec);
+ __pagevec_lru_add(pvec, false);
count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
}

diff --git a/mm/swap.c b/mm/swap.c
index c6936507abb5..576c721f210b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -44,6 +44,7 @@
int page_cluster;

static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
@@ -405,12 +406,23 @@ void lru_cache_add(struct page *page)

get_page(page);
if (!pagevec_space(pvec))
- __pagevec_lru_add(pvec);
+ __pagevec_lru_add(pvec, true);
pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
}
EXPORT_SYMBOL(lru_cache_add);

+void lru_cache_putback(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
+
+ get_page(page);
+ if (!pagevec_space(pvec))
+ __pagevec_lru_add(pvec, false);
+ pagevec_add(pvec, page);
+ put_cpu_var(lru_putback_pvec);
+}
+
/**
* add_page_to_unevictable_list - add a page to the unevictable list
* @page: the page to be added to the unevictable list
@@ -561,10 +573,15 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
*/
void lru_add_drain_cpu(int cpu)
{
- struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
+ struct pagevec *pvec;
+
+ pvec = &per_cpu(lru_add_pvec, cpu);
+ if (pagevec_count(pvec))
+ __pagevec_lru_add(pvec, true);

+ pvec = &per_cpu(lru_putback_pvec, cpu);
if (pagevec_count(pvec))
- __pagevec_lru_add(pvec);
+ __pagevec_lru_add(pvec, false);

pvec = &per_cpu(lru_rotate_pvecs, cpu);
if (pagevec_count(pvec)) {
@@ -819,12 +836,17 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
int file = page_is_file_cache(page);
int active = PageActive(page);
enum lru_list lru = page_lru(page);
+ bool new = (bool)arg;

VM_BUG_ON_PAGE(PageLRU(page), page);

SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);
- update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
+
+ if (new)
+ update_page_reclaim_stat(lruvec, file, active,
+ hpage_nr_pages(page));
+
trace_mm_lru_insertion(page, lru);
}

@@ -832,9 +854,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
* Add the passed pages to the LRU, then drop the caller's refcount
* on them. Reinitialises the caller's pagevec.
*/
-void __pagevec_lru_add(struct pagevec *pvec)
+void __pagevec_lru_add(struct pagevec *pvec, bool new)
{
- pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+ pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new);
}

/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f79010bbcdd4..8503713bb60e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -737,7 +737,7 @@ redo:
* We know how to handle that.
*/
is_unevictable = false;
- lru_cache_add(page);
+ lru_cache_putback(page);
} else {
/*
* Put unevictable pages directly on zone's unevictable
--
2.8.3

Johannes Weiner

unread,

Jun 6, 2016, 4:00:10 PM6/6/16

to

When the splitlru patches divided page cache and swap-backed pages
into separate LRU lists, the pressure balance between the lists was
biased to account for the fact that streaming IO can cause memory
pressure with a flood of pages that are used only once. New page cache
additions would tip the balance toward the file LRU, and repeat access
would neutralize that bias again. This ensured that page reclaim would
always go for used-once cache first.

Since e9868505987a ("mm,vmscan: only evict file pages when we have
plenty"), page reclaim generally skips over swap-backed memory
entirely as long as there is used-once cache present, and will apply
the LRU balancing when only repeatedly accessed cache pages are left -
at which point the previous use-once bias will have been neutralized.

This makes the use-once cache balancing bias unnecessary. Remove it.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---

mm/swap.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 576c721f210b..814e3a2e54b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,7 +264,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
void *arg)
{
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int file = page_is_file_cache(page);
int lru = page_lru_base_type(page);

del_page_from_lru_list(page, lruvec, lru);
@@ -274,7 +273,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
trace_mm_lru_activate(page);

__count_vm_event(PGACTIVATE);
- update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
}
}

@@ -797,8 +795,6 @@ EXPORT_SYMBOL(__pagevec_release);

void lru_add_page_tail(struct page *page, struct page *page_tail,

struct lruvec *lruvec, struct list_head *list)
{
- const int file = 0;
-
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -833,20 +829,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,

static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,

void *arg)
{
- int file = page_is_file_cache(page);
- int active = PageActive(page);

enum lru_list lru = page_lru(page);

- bool new = (bool)arg;

VM_BUG_ON_PAGE(PageLRU(page), page);

SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);

- if (new)
- update_page_reclaim_stat(lruvec, file, active,
- hpage_nr_pages(page));
-
trace_mm_lru_insertion(page, lru);
}

--
2.8.3

Johannes Weiner

unread,

Jun 6, 2016, 4:00:11 PM6/6/16

to

Since the LRUs were split into anon and file lists, the VM has been
balancing between page cache and anonymous pages based on per-list
ratios of scanned vs. rotated pages. In most cases that tips page
reclaim towards the list that is easier to reclaim and has the fewest
actively used pages, but there are a few problems with it:

1. Refaults and in-memory rotations are weighted the same way, even
though one costs IO and the other costs CPU. When the balance is
off, the page cache can be thrashing while anonymous pages are aged
comparably slower and thus have more time to get even their coldest
pages referenced. The VM would consider this a fair equilibrium.

2. The page cache has usually a share of use-once pages that will
further dilute its scanned/rotated ratio in the above-mentioned
scenario. This can cease scanning of the anonymous list almost
entirely - again while the page cache is thrashing and IO-bound.

Historically, swap has been an emergency overflow for high memory
pressure, and we avoided using it as long as new page allocations
could be served from recycling page cache. However, when recycling
page cache incurs a higher cost in IO than swapping out a few unused
anonymous pages would, it makes sense to increase swap pressure.

In order to accomplish this, we can extend the thrash detection code
that currently detects workingset changes within the page cache: when
inactive cache pages are thrashing, the VM raises LRU pressure on the
otherwise protected active file list to increase competition. However,
when active pages begin refaulting as well, it means that the page
cache is thrashing as a whole and the LRU balance should tip toward
anonymous. This is what this patch implements.

To tell inactive from active refaults, a page flag is introduced that
marks pages that have been on the active list in their lifetime. This
flag is remembered in the shadow page entry on reclaim, and restored
when the page refaults. It is also set on anonymous pages during
swapin. When a page with that flag set is added to the LRU, the LRU
balance is adjusted for the IO cost of reclaiming the thrashing list.

Rotations continue to influence the LRU balance as well, but with a
different weight factor. That factor is statically chosen such that
refaults are considered more costly than rotations at this point. We
might want to revisit this for ultra-fast swap or secondary memory
devices, where rotating referenced pages might be more costly than
swapping or relocating them directly and have some of them refault.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---

include/linux/mmzone.h | 6 +-
include/linux/page-flags.h | 2 +
include/linux/swap.h | 10 ++-
include/trace/events/mmflags.h | 1 +

mm/filemap.c | 9 +--
mm/migrate.c | 4 ++

mm/swap.c | 38 ++++++++++-
mm/swap_state.c | 1 +
mm/vmscan.c | 5 +-
mm/vmstat.c | 6 +-
mm/workingset.c | 142 +++++++++++++++++++++++++++++++----------
11 files changed, 172 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4d257d00fbf5..d7aaee25b536 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -148,9 +148,9 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
- WORKINGSET_REFAULT,
- WORKINGSET_ACTIVATE,
- WORKINGSET_NODERECLAIM,
+ REFAULT_INACTIVE_FILE,
+ REFAULT_ACTIVE_FILE,
+ REFAULT_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..a1b9d7dddd68 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -79,6 +79,7 @@ enum pageflags {
PG_dirty,
PG_lru,
PG_active,
+ PG_workingset,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
@@ -259,6 +260,7 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c461ce0533da..9923b51ee8e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,7 +250,7 @@ struct swap_info_struct {

/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
void workingset_activation(struct page *page);
extern struct list_lru workingset_shadow_nodes;

@@ -295,8 +295,12 @@ extern unsigned long nr_free_pagecache_pages(void);

/* linux/mm/swap.c */
-extern void lru_note_cost(struct lruvec *lruvec, bool file,
- unsigned int nr_pages);
+enum lru_cost_type {
+ COST_CPU,
+ COST_IO,
+};
+extern void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
+ bool file, unsigned int nr_pages);

extern void lru_cache_add(struct page *);

extern void lru_cache_putback(struct page *page);

extern void lru_add_page_tail(struct page *page, struct page *page_tail,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 43cedbf0c759..bc05e0ac1b8c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -86,6 +86,7 @@
{1UL << PG_dirty, "dirty" }, \
{1UL << PG_lru, "lru" }, \
{1UL << PG_active, "active" }, \
+ {1UL << PG_workingset, "workingset" }, \
{1UL << PG_slab, "slab" }, \
{1UL << PG_owner_priv_1, "owner_priv_1" }, \
{1UL << PG_arch_1, "arch_1" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 9665b1d4f318..1b356b47381b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -700,12 +700,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
* data from the working set, only to cache data that will
* get overwritten with something else, is a waste of memory.
*/
- if (!(gfp_mask & __GFP_WRITE) &&
- shadow && workingset_refault(shadow)) {
- SetPageActive(page);
- workingset_activation(page);
- } else
- ClearPageActive(page);
+ WARN_ON_ONCE(PageActive(page));
+ if (!(gfp_mask & __GFP_WRITE) && shadow)
+ workingset_refault(page, shadow);
lru_cache_add(page);
}
return ret;
diff --git a/mm/migrate.c b/mm/migrate.c
index 9baf41c877ff..115d49441c6c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
SetPageActive(newpage);
} else if (TestClearPageUnevictable(page))
SetPageUnevictable(newpage);
+ if (PageWorkingset(page))
+ SetPageWorkingset(newpage);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
@@ -1809,6 +1811,8 @@ fail_putback:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

/* Reverse changes made by migrate_page_copy() */
+ if (TestClearPageWorkingset(new_page))
+ ClearPageWorkingset(page);
if (TestClearPageActive(new_page))
SetPageActive(page);
if (TestClearPageUnevictable(new_page))
diff --git a/mm/swap.c b/mm/swap.c
index ae07b469ddca..cb6773e1424e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -249,8 +249,28 @@ void rotate_reclaimable_page(struct page *page)
}
}

-void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
+void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
+ bool file, unsigned int nr_pages)
{
+ if (cost == COST_IO) {
+ /*
+ * Reflect the relative reclaim cost between incurring
+ * IO from refaults on one hand, and incurring CPU
+ * cost from rotating scanned pages on the other.
+ *
+ * XXX: For now, the relative cost factor for IO is
+ * set statically to outweigh the cost of rotating
+ * referenced pages. This might change with ultra-fast
+ * IO devices, or with secondary memory devices that
+ * allow users continued access of swapped out pages.
+ *
+ * Until then, the value is chosen simply such that we
+ * balance for IO cost first and optimize for CPU only
+ * once the thrashing subsides.
+ */
+ nr_pages *= SWAP_CLUSTER_MAX;
+ }
+
lruvec->balance.numer[file] += nr_pages;
lruvec->balance.denom += nr_pages;
}
@@ -262,6 +282,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,

int lru = page_lru_base_type(page);

del_page_from_lru_list(page, lruvec, lru);

+ SetPageWorkingset(page);
SetPageActive(page);
lru += LRU_ACTIVE;
add_page_to_lru_list(page, lruvec, lru);
@@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,

static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{

+ unsigned int nr_pages = hpage_nr_pages(page);

enum lru_list lru = page_lru(page);

+ bool active = is_active_lru(lru);
+ bool file = is_file_lru(lru);
+ bool new = (bool)arg;

VM_BUG_ON_PAGE(PageLRU(page), page);

SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);

+ if (new) {
+ /*
+ * If the workingset is thrashing, note the IO cost of
+ * reclaiming that list and steer reclaim away from it.
+ */
+ if (PageWorkingset(page))
+ lru_note_cost(lruvec, COST_IO, file, nr_pages);
+ else if (active)
+ SetPageWorkingset(page);
+ }
+
trace_mm_lru_insertion(page, lru);
}

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5400f814ae12..43561a56ba5d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -365,6 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/*
* Initiate read into locked page and return.
*/
+ SetPageWorkingset(new_page);
lru_cache_add(new_page);
*new_page_allocated = true;
return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index acbd212eab6e..b2cb4f4f9d31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1216,6 +1216,7 @@ activate_locked:
if (PageSwapCache(page) && mem_cgroup_swap_full(page))
try_to_free_swap(page);
VM_BUG_ON_PAGE(PageActive(page), page);
+ SetPageWorkingset(page);
SetPageActive(page);
pgactivate++;
keep_locked:
@@ -1524,7 +1525,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
* Rotating pages costs CPU without actually
* progressing toward the reclaim goal.
*/
- lru_note_cost(lruvec, file, numpages);
+ lru_note_cost(lruvec, COST_CPU, file, numpages);
}

if (put_page_testzero(page)) {
@@ -1849,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
* Rotating pages costs CPU without actually
* progressing toward the reclaim goal.
*/
- lru_note_cost(lruvec, file, nr_rotated);
+ lru_note_cost(lruvec, COST_CPU, file, nr_rotated);

move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 77e42ef388c2..6c8d658f5b7f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -727,9 +727,9 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
- "workingset_refault",
- "workingset_activate",
- "workingset_nodereclaim",
+ "refault_inactive_file",
+ "refault_active_file",
+ "refault_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_free_cma",

diff --git a/mm/workingset.c b/mm/workingset.c
index 8a75f8d2916a..261cf583fb62 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -118,7 +118,7 @@
* the only thing eating into inactive list space is active pages.
*
*
- * Activating refaulting pages
+ * Refaulting inactive pages
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
@@ -131,6 +131,10 @@
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
+ * That means, if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the existing cache pages on the active list.
+ *
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
@@ -139,6 +143,30 @@
* and the used pages get to stay in cache.
*
*
+ * Refaulting active pages
+ *
+ * If, on the other hand, the refaulting pages have been recently
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim: the cache is not transitioning to
+ * a different workingset, the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
+ * When that is the case, mere activation of the refaulting pages is
+ * not enough. The page reclaim code needs to be informed of the high
+ * IO cost associated with the continued reclaim of page cache, so
+ * that it can steer pressure to the anonymous list.
+ *
+ * Just as when refaulting inactive pages, it's possible that there
+ * are cold(er) anonymous pages that can be swapped and forgotten in
+ * order to increase the space available to the page cache as a whole.
+ *
+ * If anonymous pages start thrashing as well, the reclaim scanner
+ * will aim for the list that imposes the lowest cost on the system,
+ * where cost is defined as:
+ *
+ * refault rate * relative IO cost (as determined by swappiness)
+ *
+ *
* Implementation
*
* For each zone's file LRU lists, a counter for inactive evictions
@@ -150,10 +178,25 @@
*
* On cache misses for which there are shadow entries, an eligible
* refault distance will immediately activate the refaulting page.
+ *
+ * On activation, cache pages are marked PageWorkingset, which is not
+ * cleared until the page is freed. Shadow entries will remember that
+ * flag to be able to tell inactive from active refaults. Refaults of
+ * previous workingset pages will restore that page flag and inform
+ * page reclaim of the IO cost.
+ *
+ * XXX: Since we don't track anonymous references, every swap-in event
+ * is considered a workingset refault - regardless of distance. Swapin
+ * floods will thus always raise the assumed IO cost of reclaiming the
+ * anonymous LRU lists, even if the pages haven't been used recently.
+ * Temporary events don't matter that much other than they might delay
+ * the stabilization a bit. But during continuous thrashing, anonymous
+ * pages can have a leg-up against page cache. This might need fixing
+ * for ultra-fast IO devices or secondary memory types.
*/

-#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
- ZONES_SHIFT + NODES_SHIFT + \
+#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
+ 1 + ZONES_SHIFT + NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)

@@ -167,24 +210,29 @@
*/
static unsigned int bucket_order __read_mostly;

-static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction,
+ bool workingset)
{
eviction >>= bucket_order;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+ eviction = (eviction << 1) | workingset;
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}

static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
- unsigned long *evictionp)
+ unsigned long *evictionp, bool *workingsetp)
{
unsigned long entry = (unsigned long)shadow;
int memcgid, nid, zid;
+ bool workingset;

entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+ workingset = entry & 1;
+ entry >>= 1;
zid = entry & ((1UL << ZONES_SHIFT) - 1);
entry >>= ZONES_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
@@ -195,6 +243,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
*memcgidp = memcgid;
*zonep = NODE_DATA(nid)->node_zones + zid;
*evictionp = entry << bucket_order;
+ *workingsetp = workingset;
}

/**
@@ -220,19 +269,18 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)

lruvec = mem_cgroup_zone_lruvec(zone, memcg);
eviction = atomic_long_inc_return(&lruvec->inactive_age);
- return pack_shadow(memcgid, zone, eviction);
+ return pack_shadow(memcgid, zone, eviction, PageWorkingset(page));
}

/**
* workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the zone it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
*/
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
{
unsigned long refault_distance;
unsigned long active_file;
@@ -240,10 +288,12 @@ bool workingset_refault(void *shadow)
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
+ unsigned long anon;
struct zone *zone;
+ bool workingset;
int memcgid;

- unpack_shadow(shadow, &memcgid, &zone, &eviction);
+ unpack_shadow(shadow, &memcgid, &zone, &eviction, &workingset);

rcu_read_lock();
/*
@@ -263,40 +313,64 @@ bool workingset_refault(void *shadow)
* configurations instead.
*/
memcg = mem_cgroup_from_id(memcgid);
- if (!mem_cgroup_disabled() && !memcg) {
- rcu_read_unlock();
- return false;
- }
+ if (!mem_cgroup_disabled() && !memcg)
+ goto out;
lruvec = mem_cgroup_zone_lruvec(zone, memcg);
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
- rcu_read_unlock();
+ if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+ anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+ lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+ else
+ anon = 0;

/*
- * The unsigned subtraction here gives an accurate distance
- * across inactive_age overflows in most cases.
+ * Calculate the refault distance.
*
- * There is a special case: usually, shadow entries have a
- * short lifetime and are either refaulted or reclaimed along
- * with the inode before they get too old. But it is not
- * impossible for the inactive_age to lap a shadow entry in
- * the field, which can then can result in a false small
- * refault distance, leading to a false activation should this
- * old entry actually refault again. However, earlier kernels
- * used to deactivate unconditionally with *every* reclaim
- * invocation for the longest time, so the occasional
- * inappropriate activation leading to pressure on the active
- * list is not a problem.
+ * The unsigned subtraction here gives an accurate distance
+ * across inactive_age overflows in most cases. There is a
+ * special case: usually, shadow entries have a short lifetime
+ * and are either refaulted or reclaimed along with the inode
+ * before they get too old. But it is not impossible for the
+ * inactive_age to lap a shadow entry in the field, which can
+ * then can result in a false small refault distance, leading
+ * to a false activation should this old entry actually
+ * refault again. However, earlier kernels used to deactivate
+ * unconditionally with *every* reclaim invocation for the
+ * longest time, so the occasional inappropriate activation
+ * leading to pressure on the active list is not a problem.
*/
refault_distance = (refault - eviction) & EVICTION_MASK;

- inc_zone_state(zone, WORKINGSET_REFAULT);
+ /*
+ * Compare the distance with the existing workingset. We don't
+ * act on pages that couldn't stay resident even with all the
+ * memory available to the page cache.
+ */
+ if (refault_distance > active_file + anon)
+ goto out;

- if (refault_distance <= active_file) {
- inc_zone_state(zone, WORKINGSET_ACTIVATE);
- return true;
+ /*
+ * If inactive cache is refaulting, activate the page to
+ * challenge the current cache workingset. The existing cache
+ * might be stale, or at least colder than the contender.
+ *
+ * If active cache is refaulting (PageWorkingset set at time
+ * of eviction), it means that the page cache as a whole is
+ * thrashing. Restore PageWorkingset to inform the LRU code
+ * about the additional cost of reclaiming more page cache.
+ */
+ SetPageActive(page);
+ atomic_long_inc(&lruvec->inactive_age);
+
+ if (workingset) {
+ SetPageWorkingset(page);
+ inc_zone_state(zone, REFAULT_ACTIVE_FILE);
+ } else {
+ inc_zone_state(zone, REFAULT_INACTIVE_FILE);
}
- return false;
+out:
+ rcu_read_unlock();
}

/**
@@ -433,7 +507,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
}
}
BUG_ON(node->count);
- inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+ inc_zone_state(page_zone(virt_to_page(node)), REFAULT_NODERECLAIM);
if (!__radix_tree_delete_node(&mapping->page_tree, node))
BUG();

--
2.8.3

Johannes Weiner

unread,

Jun 6, 2016, 4:00:13 PM6/6/16

to

Currently, scan pressure between the anon and file LRU lists is
balanced based on a mixture of reclaim efficiency and a somewhat vague
notion of "value" of having certain pages in memory over others. That
concept of value is problematic, because it has caused us to count any
event that remotely makes one LRU list more or less preferrable for
reclaim, even when these events are not directly comparable to each
other and impose very different costs on the system - such as a
referenced file page that we still deactivate and a referenced
anonymous page that we actually rotate back to the head of the list.

There is also conceptual overlap with the LRU algorithm itself. By
rotating recently used pages instead of reclaiming them, the algorithm
already biases the applied scan pressure based on page value. Thus,
when rebalancing scan pressure due to rotations, we should think of
reclaim cost, and leave assessing the page value to the LRU algorithm.

Lastly, considering both value-increasing as well as value-decreasing
events can sometimes cause the same type of event to be counted twice,
i.e. how rotating a page increases the LRU value, while reclaiming it
succesfully decreases the value. In itself this will balance out fine,
but it quietly skews the impact of events that are only recorded once.

The abstract metric of "value", the murky relationship with the LRU
algorithm, and accounting both negative and positive events make the
current pressure balancing model hard to reason about and modify.

In preparation for thrashing-based LRU balancing, this patch switches
to a balancing model of accounting the concrete, actually observed
cost of reclaiming one LRU over another. For now, that cost includes
pages that are scanned but rotated back to the list head. Subsequent
patches will add consideration for IO caused by refaulting recently
evicted pages. The idea is to primarily scan the LRU that thrashes the
least, and secondarily scan the LRU that needs the least amount of
work to free memory.

Rename struct zone_reclaim_stat to struct lru_cost, and move from two
separate value ratios for the LRU lists to a relative LRU cost metric
with a shared denominator. Then make everything that affects the cost
go through a new lru_note_cost() function.

Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
---

include/linux/mmzone.h | 23 +++++++++++------------
include/linux/swap.h | 2 ++
mm/swap.c | 15 +++++----------
mm/vmscan.c | 35 +++++++++++++++--------------------
4 files changed, 33 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..4d257d00fbf5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -191,22 +191,21 @@ static inline int is_active_lru(enum lru_list lru)
return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
}

-struct zone_reclaim_stat {
- /*
- * The pageout code in vmscan.c keeps track of how many of the
- * mem/swap backed and file backed pages are referenced.
- * The higher the rotated/scanned ratio, the more valuable
- * that cache is.
- *
- * The anon LRU stats live in [0], file LRU stats in [1]
- */
- unsigned long recent_rotated[2];
- unsigned long recent_scanned[2];
+/*
+ * This tracks cost of reclaiming one LRU type - file or anon - over
+ * the other. As the observed cost of pressure on one type increases,
+ * the scan balance in vmscan.c tips toward the other type.
+ *
+ * The recorded cost for anon is in numer[0], file in numer[1].
+ */
+struct lru_cost {
+ unsigned long numer[2];
+ unsigned long denom;
};

struct lruvec {
struct list_head lists[NR_LRU_LISTS];
- struct zone_reclaim_stat reclaim_stat;
+ struct lru_cost balance;
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
#ifdef CONFIG_MEMCG
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 178f084365c2..c461ce0533da 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,8 @@ extern unsigned long nr_free_pagecache_pages(void);

/* linux/mm/swap.c */
+extern void lru_note_cost(struct lruvec *lruvec, bool file,
+ unsigned int nr_pages);

extern void lru_cache_add(struct page *);
extern void lru_cache_putback(struct page *page);
extern void lru_add_page_tail(struct page *page, struct page *page_tail,

diff --git a/mm/swap.c b/mm/swap.c
index 814e3a2e54b4..645d21242324 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -249,15 +249,10 @@ void rotate_reclaimable_page(struct page *page)
}
}

-static void update_page_reclaim_stat(struct lruvec *lruvec,
- int file, int rotated,
- unsigned int nr_pages)
+void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
{
- struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-
- reclaim_stat->recent_scanned[file] += nr_pages;
- if (rotated)
- reclaim_stat->recent_rotated[file] += nr_pages;

+ lruvec->balance.numer[file] += nr_pages;

+ lruvec->balance.denom += nr_pages;

}

static void __activate_page(struct page *page, struct lruvec *lruvec,

@@ -543,7 +538,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,

if (active)
__count_vm_event(PGDEACTIVATE);
- update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+ lru_note_cost(lruvec, !file, hpage_nr_pages(page));
}

@@ -560,7 +555,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
add_page_to_lru_list(page, lruvec, lru);

__count_vm_event(PGDEACTIVATE);
- update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+ lru_note_cost(lruvec, !file, hpage_nr_pages(page));
}
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8503713bb60e..06e381e1004c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1492,7 +1492,6 @@ static int too_many_isolated(struct zone *zone, int file,
static noinline_for_stack void

putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)

{
- struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
struct zone *zone = lruvec_zone(lruvec);
LIST_HEAD(pages_to_free);

@@ -1521,8 +1520,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
if (is_active_lru(lru)) {
int file = is_file_lru(lru);
int numpages = hpage_nr_pages(page);
- reclaim_stat->recent_rotated[file] += numpages;
+ /*
+ * Rotating pages costs CPU without actually
+ * progressing toward the reclaim goal.
+ */
+ lru_note_cost(lruvec, file, numpages);
}
+
if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
@@ -1577,7 +1581,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct zone *zone = lruvec_zone(lruvec);
- struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1601,7 +1604,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

update_lru_size(lruvec, lru, -nr_taken);
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
- reclaim_stat->recent_scanned[file] += nr_taken;

if (global_reclaim(sc)) {
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1773,7 +1775,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
struct page *page;
- struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
unsigned long nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
@@ -1793,7 +1794,6 @@ static void shrink_active_list(unsigned long nr_to_scan,

update_lru_size(lruvec, lru, -nr_taken);
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
- reclaim_stat->recent_scanned[file] += nr_taken;

if (global_reclaim(sc))
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1851,7 +1851,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
* helps balance scan pressure between file and anonymous pages in
* get_scan_count.
*/
- reclaim_stat->recent_rotated[file] += nr_rotated;
+ lru_note_cost(lruvec, file, nr_rotated);

move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);

@@ -1947,7 +1947,6 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
unsigned long *lru_pages)
{
int swappiness = mem_cgroup_swappiness(memcg);
- struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
u64 denominator = 0; /* gcc */
struct zone *zone = lruvec_zone(lruvec);
@@ -2072,14 +2071,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);

spin_lock_irq(&zone->lru_lock);
- if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
- reclaim_stat->recent_scanned[0] /= 2;
- reclaim_stat->recent_rotated[0] /= 2;
- }
-
- if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
- reclaim_stat->recent_scanned[1] /= 2;
- reclaim_stat->recent_rotated[1] /= 2;
+ if (unlikely(lruvec->balance.denom > (anon + file) / 8)) {
+ lruvec->balance.numer[0] /= 2;
+ lruvec->balance.numer[1] /= 2;
+ lruvec->balance.denom /= 2;
}

/*
@@ -2087,11 +2082,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/
- ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
- ap /= reclaim_stat->recent_rotated[0] + 1;
+ ap = anon_prio * (lruvec->balance.denom + 1);
+ ap /= lruvec->balance.numer[0] + 1;

- fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
- fp /= reclaim_stat->recent_rotated[1] + 1;
+ fp = file_prio * (lruvec->balance.denom + 1);
+ fp /= lruvec->balance.numer[1] + 1;
spin_unlock_irq(&zone->lru_lock);

fraction[0] = ap;
--
2.8.3

Rik van Riel

unread,

Jun 6, 2016, 6:00:07 PM6/6/16

to

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
>
> +void lru_cache_putback(struct page *page)
> +{
> + struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> +
> + get_page(page);
> + if (!pagevec_space(pvec))
> + __pagevec_lru_add(pvec, false);
> + pagevec_add(pvec, page);
> + put_cpu_var(lru_putback_pvec);
> +}
>

Wait a moment.

So now we have a putback_lru_page, which does adjust
the statistics, and an lru_cache_putback which does
not?

This function could use a name that is not as similar
to its counterpart :)

--
All Rights Reversed.

signature.asc

Johannes Weiner

unread,

Jun 6, 2016, 6:20:06 PM6/6/16

to

lru_cache_add() and lru_cache_putback() are the two sibling functions,
where the first influences the LRU balance and the second one doesn't.

The last hunk in the patch (obscured by showing the label instead of
the function name as context) updates putback_lru_page() from using
lru_cache_add() to using lru_cache_putback().

Does that make sense?

Tim Chen

unread,

Jun 6, 2016, 8:00:08 PM6/6/16

to

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:

Johannes,

It seems like you are saying that the shadow entry is also present
for anonymous pages that are swapped out. But once a page is swapped
out, its entry is removed from the radix tree and we won't be able
to store the shadow page entry as for file mapped page
in __remove_mapping. Or are you thinking of modifying
the current code to keep the radix tree entry? I may be missing something
so will appreciate if you can clarify.

Thanks.

Tim

Minchan Kim

unread,

Jun 6, 2016, 8:30:06 PM6/6/16

to

Hi Johannes,

Thanks for the nice work. I didn't read all patchset yet but the design
makes sense to me so it would be better for zram-based on workload
compared to as is.

Generally, I agree extending swappiness value good but not sure 200 is
enough to represent speed gap between file and swap sotrage in every
cases. - Just nitpick.

Some years ago, I extended it to 200 like your patch and experimented it
based on zram in our platform workload. At that time, it was terribly
slow in app switching workload if swappiness is higher than 150.
Although it was highly dependent on the workload, it's dangerous to
recommend it before fixing balacing between file and anon, I think.
IOW, I think this patch should be last one in this patchset.

>
> The default value is 60.
>
> +On non-rotational swap devices, a value of 100 (or higher, depending
> +on what's backing the filesystem) is recommended.
> +
> +For in-memory swap, like zswap, values closer to 200 are recommended.

maybe, like zram

I'm not sure it would be good suggestion for zswap because it ends up
writing cached pages to swap device once it reaches threshold.
Then, the cost is compression + decompression + write I/O which is
heavier than normal swap device(i.e., write I/O). OTOH, zram have no
(writeback I/O+ decompression) cost.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Rik van Riel

unread,

Jun 6, 2016, 9:20:06 PM6/6/16

to

That means the page reclaim does not update the
"rotated" statistics. That seems undesirable,
no? Am I overlooking something?

--
All Rights Reversed.

signature.asc

Rik van Riel

unread,

Jun 6, 2016, 10:30:06 PM6/6/16

to

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:

> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page
> cache
> additions would tip the balance toward the file LRU, and repeat
> access
> would neutralize that bias again. This ensured that page reclaim
> would
> always go for used-once cache first.
>
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left
> -
> at which point the previous use-once bias will have been neutralized.
>
> This makes the use-once cache balancing bias unnecessary. Remove it.
>

The code in get_scan_count() still seems to use the statistics
of which you just removed the updating.

What am I overlooking?

--
All Rights Reversed.

signature.asc

Rik van Riel

unread,

Jun 6, 2016, 10:40:06 PM6/6/16

to

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:

> Currently, scan pressure between the anon and file LRU lists is
> balanced based on a mixture of reclaim efficiency and a somewhat
> vague
> notion of "value" of having certain pages in memory over others. That
> concept of value is problematic, because it has caused us to count
> any
> event that remotely makes one LRU list more or less preferrable for
> reclaim, even when these events are not directly comparable to each
> other and impose very different costs on the system - such as a
> referenced file page that we still deactivate and a referenced
> anonymous page that we actually rotate back to the head of the list.
>

Well, patches 7-10 answered my question on patch 6 :)

I like this design.

--
All Rights Reversed.

signature.asc

Michal Hocko

unread,

Jun 7, 2016, 5:30:06 AM6/7/16

to

On Mon 06-06-16 18:15:50, Johannes Weiner wrote:
[...]

> The last hunk in the patch (obscured by showing the label instead of
> the function name as context)

JFYI my ~/.gitconfig has the following to workaround this:
[diff "default"]
xfuncname = "^[[:alpha:]$_].*[^:]$"

--
Michal Hocko
SUSE Labs

Michal Hocko

unread,

Jun 7, 2016, 6:00:06 AM6/7/16

to

On Mon 06-06-16 15:48:31, Johannes Weiner wrote:
> Isolating an existing LRU page and subsequently putting it back on the
> list currently influences the balance between the anon and file LRUs.
> For example, heavy page migration or compaction could influence the
> balance between the LRUs and make one type more attractive when that
> type of page is affected more than the other. That doesn't make sense.
>
> Add a dedicated LRU cache for putback, so that we can tell new LRU
> pages from existing ones at the time of linking them to the lists.

It is far from trivial to review this one (there are quite some callers)
but it makes sense to me from the semantic point of view.

> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

Acked-by: Michal Hocko <mho...@suse.com>

Michal Hocko

unread,

Jun 7, 2016, 6:00:07 AM6/7/16

to

On Mon 06-06-16 15:48:26, Johannes Weiner wrote:
> Hi everybody,
>
> this series re-implements the LRU balancing between page cache and
> anonymous pages to work better with fast random IO swap devices.

I didn't get to review the full series properly but initial patches
(2-5) seem good to go even without the rest. I will try to get to the
rest ASAP.

Thanks!

Johannes Weiner

unread,

Jun 7, 2016, 10:00:07 AM6/7/16

to

Oh, reclaim doesn't use putback_lru_page(), except for the stray
unevictable corner case. It does open-coded putback in batch, and
those functions continue to update the reclaim statistics. See the
recent_scanned/recent_rotated manipulations in putback_inactive_pages(),
shrink_inactive_list(), and shrink_active_list().

putback_lru_page() is mainly used by page migration, cgroup migration,
mlock etc. - all operations which muck with the LRU for purposes other
than reclaim or aging, and so shouldn't affect the anon/file balance.

This patch only changes those LRU users, not page reclaim.

Johannes Weiner

unread,

Jun 7, 2016, 10:10:06 AM6/7/16

to

Thanks, that's useful. I added it to my ~/.gitconfig, so this should
be a little less confusing in v2.

Johannes Weiner

unread,

Jun 7, 2016, 10:20:06 AM6/7/16

to

On Tue, Jun 07, 2016 at 09:25:50AM +0900, Minchan Kim wrote:
> Hi Johannes,
>
> Thanks for the nice work. I didn't read all patchset yet but the design
> makes sense to me so it would be better for zram-based on workload
> compared to as is.

Thanks!

> On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
> >
> > swappiness
> >
> > -This control is used to define how aggressive the kernel will swap
> > -memory pages. Higher values will increase agressiveness, lower values
> > -decrease the amount of swap. A value of 0 instructs the kernel not to
> > -initiate swap until the amount of free and file-backed pages is less
> > -than the high water mark in a zone.
> > +This control is used to define the relative IO cost of cache misses
> > +between the swap device and the filesystem as a value between 0 and
> > +200. At 100, the VM assumes equal IO cost and will thus apply memory
> > +pressure to the page cache and swap-backed pages equally. At 0, the
> > +kernel will not initiate swap until the amount of free and file-backed
> > +pages is less than the high watermark in a zone.
>
> Generally, I agree extending swappiness value good but not sure 200 is
> enough to represent speed gap between file and swap sotrage in every
> cases. - Just nitpick.

How so? You can't give swap more weight than 100%. 200 is the maximum
possible value.

> Some years ago, I extended it to 200 like your patch and experimented it
> based on zram in our platform workload. At that time, it was terribly
> slow in app switching workload if swappiness is higher than 150.
> Although it was highly dependent on the workload, it's dangerous to
> recommend it before fixing balacing between file and anon, I think.
> IOW, I think this patch should be last one in this patchset.

Good point. I'll tone down the recommendations. But OTOH it's a fairly
trivial patch, so I wouldn't want it to close after the current 10/10.

> > The default value is 60.
> >
> > +On non-rotational swap devices, a value of 100 (or higher, depending
> > +on what's backing the filesystem) is recommended.
> > +
> > +For in-memory swap, like zswap, values closer to 200 are recommended.
>
> maybe, like zram
>
> I'm not sure it would be good suggestion for zswap because it ends up
> writing cached pages to swap device once it reaches threshold.
> Then, the cost is compression + decompression + write I/O which is
> heavier than normal swap device(i.e., write I/O). OTOH, zram have no
> (writeback I/O+ decompression) cost.

Oh, good catch. Yeah, I'll change that for v2.

Thanks for your input, Minchan

Johannes Weiner

unread,

Jun 7, 2016, 10:20:06 AM6/7/16

to

As I mentioned in 5/10, page reclaim still does updates for each
scanned page and rotated page at this point in the series.

This merely removes the pre-reclaim bias for cache.

Johannes Weiner

unread,

Jun 7, 2016, 10:20:06 AM6/7/16

to

Great! Thanks for reviewing.

Johannes Weiner

unread,

Jun 7, 2016, 12:30:10 PM6/7/16

to

Hi Tim,

Sorry if this was ambiguously phrased.

You are correct, there are no shadow entries for anonymous evictions,
only page cache evictions. All swap-ins are treated as "eligible"
refaults and push back against cache, whereas cache only pushes
against anon if the cache workingset is determined to fit into memory.

That implies a fixed hierarchy where the VM always tries to fit the
anonymous workingset into memory first and the page cache second. If
the anonymous set is bigger than memory, the algorithm won't stop
counting IO cost from anonymous refaults and pressuring page cache.

[ Although you can set the effective cost of these refaults to 0
(swappiness = 200) and reduce effective cache to a minimum -
possibly to a level where LRU rotations consume most of it.
But yeah. ]

So the current code works well when we assume that cache workingsets
might exceed memory, but anonymous workingsets don't.

For SSDs and non-DIMM pmem devices this assumption is fine, because
nobody wants half their frequent anonymous memory accesses to be major
faults. Anonymous workingsets will continue to target RAM size there.

Secondary memory types, which userspace can continue to map directly
after "swap out", are a different story. That might need workingset
estimation for anonymous pages. But it would have to build on top of
this series here. These patches are about eliminating or mitigating IO
by swapping idle or colder anon pages when the cache is thrashing.

Tim Chen

unread,

Jun 7, 2016, 4:00:06 PM6/7/16

to

On Tue, 2016-06-07 at 12:23 -0400, Johannes Weiner wrote:
> Hi Tim,
>
> On Mon, Jun 06, 2016 at 04:50:23PM -0700, Tim Chen wrote:
> >
> > On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > >
> > > To tell inactive from active refaults, a page flag is introduced that
> > > marks pages that have been on the active list in their lifetime. This
> > > flag is remembered in the shadow page entry on reclaim, and restored
> > > when the page refaults. It is also set on anonymous pages during
> > > swapin. When a page with that flag set is added to the LRU, the LRU
> > > balance is adjusted for the IO cost of reclaiming the thrashing list.
> > Johannes,
> >
> > It seems like you are saying that the shadow entry is also present
> > for anonymous pages that are swapped out. But once a page is swapped
> > out, its entry is removed from the radix tree and we won't be able
> > to store the shadow page entry as for file mapped page
> > in __remove_mapping. Or are you thinking of modifying
> > the current code to keep the radix tree entry? I may be missing something
> > so will appreciate if you can clarify.
> Sorry if this was ambiguously phrased.
>
> You are correct, there are no shadow entries for anonymous evictions,
> only page cache evictions. All swap-ins are treated as "eligible"
> refaults and push back against cache, whereas cache only pushes
> against anon if the cache workingset is determined to fit into memory.

Thanks. That makes sense. I wasn't sure before whether you intend
to have a re-fault distance to determine if a
faulted in anonymous page is in working set. I see now that
you always consider it to be in working set.

>
> That implies a fixed hierarchy where the VM always tries to fit the
> anonymous workingset into memory first and the page cache second. If
> the anonymous set is bigger than memory, the algorithm won't stop
> counting IO cost from anonymous refaults and pressuring page cache.
>
> [ Although you can set the effective cost of these refaults to 0
> (swappiness = 200) and reduce effective cache to a minimum -
> possibly to a level where LRU rotations consume most of it.
> But yeah. ]
>
> So the current code works well when we assume that cache workingsets
> might exceed memory, but anonymous workingsets don't.
>
> For SSDs and non-DIMM pmem devices this assumption is fine, because
> nobody wants half their frequent anonymous memory accesses to be major
> faults. Anonymous workingsets will continue to target RAM size there.
>
> Secondary memory types, which userspace can continue to map directly
> after "swap out", are a different story. That might need workingset
> estimation for anonymous pages.

The direct mapped swap case is trickier as we need a method to gauge how often
a page was accessed in place in swap, to decide if we need to
bring it back to RAM. The accessed bit in pte only tells
us if it has been accessed, but not the frequency.

If we simply try to mitigate IO cost, we may just have pages migrated and
accessed within the swap space, but not bring the hot ones back to RAM.

That said, this series is a very nice optimization of the balance between
anonymous and file backed page reclaim.

Thanks.

Tim

Minchan Kim

unread,

Jun 7, 2016, 8:10:06 PM6/7/16

to

In old, swappiness is how agressively reclaim anonymous pages in favour
of page cache. But when I read your description and changes about
swappiness in vm.txt, esp, *relative IO cost*, I feel you change swappiness
define to represent relative IO cost between swap storage and file storage.
Then, with that, we could balance anonymous and file LRU with the weight.

For example, let's assume that in-memory swap storage is 10x times faster
than slow thumb drive. In that case, IO cost of 5 anonymous pages
swapping-in/out is equal to 1 file-backed page-discard/read.

I thought it does make sense because that measuring the speed gab between
those storages is easier than selecting vague swappiness tendency.

In terms of such approach, I thought 200 is not enough to show the gab
because the gap is started from 100.
Isn't it your intention? If so, to me, the description was rather
misleading. :(

Minchan Kim

unread,

Jun 8, 2016, 3:40:06 AM6/8/16

to

Just trivial:

'new' argument would be not clear in this context what does it mean
so worth to comment it, IMO but no strong opinion.

Other than that,

Acked-by: Minchan Kim <min...@kernel.org>

Minchan Kim

unread,

Jun 8, 2016, 4:11:18 AM6/8/16

to

On Mon, Jun 06, 2016 at 03:48:32PM -0400, Johannes Weiner wrote:
> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page cache
> additions would tip the balance toward the file LRU, and repeat access
> would neutralize that bias again. This ensured that page reclaim would
> always go for used-once cache first.
>
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left -
> at which point the previous use-once bias will have been neutralized.
>
> This makes the use-once cache balancing bias unnecessary. Remove it.
>
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

Acked-by: Minchan Kim <min...@kernel.org>

Minchan Kim

unread,

Jun 8, 2016, 4:20:07 AM6/8/16

to

balance.numer[0] + balance.number[1] = balance.denom
so we can remove denom at the moment?

Michal Hocko

unread,

Jun 8, 2016, 8:40:07 AM6/8/16

to

On Mon 06-06-16 15:48:32, Johannes Weiner wrote:
> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page cache
> additions would tip the balance toward the file LRU, and repeat access
> would neutralize that bias again. This ensured that page reclaim would
> always go for used-once cache first.
>
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left -
> at which point the previous use-once bias will have been neutralized.
>
> This makes the use-once cache balancing bias unnecessary. Remove it.
>
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

Acked-by: Michal Hocko <mho...@suse.com>

Michal Hocko

unread,

Jun 8, 2016, 9:00:06 AM6/8/16

to

This makes a lot of sense to me

> Subsequent
> patches will add consideration for IO caused by refaulting recently
> evicted pages. The idea is to primarily scan the LRU that thrashes the
> least, and secondarily scan the LRU that needs the least amount of
> work to free memory.
>
> Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> separate value ratios for the LRU lists to a relative LRU cost metric
> with a shared denominator.

I just do not like the too generic `number'. I guess cost or price would
fit better and look better in the code as well. Up you though...

> Then make everything that affects the cost go through a new
> lru_note_cost() function.

Just curious, have you tried to measure just the effect of this change
without the rest of the series? I do not expect it would show large
differences because we are not doing SCAN_FRACT most of the time...

> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

Acked-by: Michal Hocko <mho...@suse.com>

Thanks!

Michal Hocko

unread,

Jun 8, 2016, 10:00:06 AM6/8/16

to

The approach seems sensible to me. The additional page flag is far from
nice to say the least. Maybe we can override some existing one which
doesn't make any other sense for LRU pages. E.g. PG_slab although we
might have some explicit VM_BUG_ONs etc.. so this could get tricky.

I have to think about this more.

Johannes Weiner

unread,

Jun 8, 2016, 12:00:14 PM6/8/16

to

The way swappiness works never actually changed.

The only thing that changed is that we used to look at referenced
pages (recent_rotated) and *assumed* they would likely cause IO when
reclaimed, whereas with my patches we actually know whether they are.
But swappiness has always been about relative IO cost of the LRUs.

Swappiness defines relative IO cost between file and swap on a scale
from 0 to 200, where 100 is the point of equality. The scale factors
are calculated in get_scan_count() like this:

anon_prio = swappiness
file_prio = 200 - swappiness

and those are applied to the recorded cost/value ratios like this:

ap = anon_prio * scanned / rotated
fp = file_prio * scanned / rotated

That means if your swap device is 10 times faster than your filesystem
device, and you thus want anon to receive 10x the refaults when the
anon and file pages are used equally, you do this:

x + 10x = 200
x = 18 (ish)

So your file priority is ~18 and your swap priority is the remainder
of the range, 200 - 18. You set swappiness to 182.

Now fill in the numbers while assuming all pages on both lists have
been referenced before and will likely refault (or in the new model,
all pages are refaulting):

fraction[anon] = ap = 182 * 1 / 1 = 182
fraction[file] = fp = 18 * 1 / 1 = 18
denominator = ap + fp = 182 + 18 = 200

and then calculate the scan target like this:

scan[type] = (lru_size() >> priority) * fraction[type] / denominator

This will scan and reclaim 9% of the file pages and 90% of the anon
pages. On refault, 9% of the IO will be from the filesystem and 90%
from the swap device.

Johannes Weiner

unread,

Jun 8, 2016, 12:10:06 PM6/8/16

to

True, it's a little mysterious. I'll document it.

> Other than that,
>
> Acked-by: Minchan Kim <min...@kernel.org>

Thanks!

Johannes Weiner

unread,

Jun 8, 2016, 12:10:07 PM6/8/16

to

You're right, it doesn't make sense to keep that around anymore. I'll
remove it.

Thanks!

Johannes Weiner

unread,

Jun 8, 2016, 12:20:08 PM6/8/16

to

On Wed, Jun 08, 2016 at 02:51:37PM +0200, Michal Hocko wrote:
> On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> > Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> > separate value ratios for the LRU lists to a relative LRU cost metric
> > with a shared denominator.
>
> I just do not like the too generic `number'. I guess cost or price would
> fit better and look better in the code as well. Up you though...

Yeah, I picked it as a pair, numerator and denominator. But as Minchan
points out, denom is superfluous in the final version of the patch, so
I'm going to remove it and give the numerators better names.

anon_cost and file_cost?

> > Then make everything that affects the cost go through a new
> > lru_note_cost() function.
>
> Just curious, have you tried to measure just the effect of this change
> without the rest of the series? I do not expect it would show large
> differences because we are not doing SCAN_FRACT most of the time...

Yes, we default to use-once cache and do fractional scanning when that
runs out and we have to go after workingset, which might potentially
cause refault IO. So you need a workload that has little streaming IO.

I haven't tested this patch in isolation, but it shouldn't make much
of a difference, since we continue to balance based on the same input.

Minchan Kim

unread,

Jun 8, 2016, 9:10:06 PM6/8/16

to

Thanks for the detail example. Then, let's change the example a little bit.

A system has big HDD storage and SSD swap.

HDD: 200 IOPS
SSD: 100000 IOPS
From https://en.wikipedia.org/wiki/IOPS

So, speed gap is 500x.
x + 500x = 200
If we use PCIe-SSD, the gap will be larger.
That's why I said 200 is enough to represent speed gap.
Such system configuration is already non-sense so it is okay to ignore such
usecases?

Michal Hocko

unread,

Jun 9, 2016, 8:20:06 AM6/9/16

to

On Wed 08-06-16 12:16:05, Johannes Weiner wrote:
> On Wed, Jun 08, 2016 at 02:51:37PM +0200, Michal Hocko wrote:
> > On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> > > Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> > > separate value ratios for the LRU lists to a relative LRU cost metric
> > > with a shared denominator.
> >
> > I just do not like the too generic `number'. I guess cost or price would
> > fit better and look better in the code as well. Up you though...
>
> Yeah, I picked it as a pair, numerator and denominator. But as Minchan
> points out, denom is superfluous in the final version of the patch, so
> I'm going to remove it and give the numerators better names.
>
> anon_cost and file_cost?

Yes that is much more descriptive and easier to grep for. I didn't
propose that because I thought you would want to preserve the array
definition for an easier code to update them.

Johannes Weiner

unread,

Jun 9, 2016, 9:40:07 AM6/9/16

to

It'll be slightly more verbose, but that's probably a good thing.
Especially for readability in get_scan_count().

Johannes Weiner

unread,

Jun 9, 2016, 9:40:17 AM6/9/16

to

On Thu, Jun 09, 2016 at 10:01:07AM +0900, Minchan Kim wrote:
> A system has big HDD storage and SSD swap.
>
> HDD: 200 IOPS
> SSD: 100000 IOPS
> From https://en.wikipedia.org/wiki/IOPS
>
> So, speed gap is 500x.
> x + 500x = 200
> If we use PCIe-SSD, the gap will be larger.
> That's why I said 200 is enough to represent speed gap.

Ah, I see what you're saying.

Yeah, that's unfortunately a limitation in the current ABI. Extending
the range to previously unavailable settings is doable; changing the
meaning of existing values is not. We'd have to add another interface.

> Such system configuration is already non-sense so it is okay to ignore such
> usecases?

I'm not sure we have to be proactive about it, but we can always add a
more fine-grained knob to override swappiness when somebody wants to
use such a setup in practice.

Minchan Kim

unread,

Jun 9, 2016, 10:20:05 PM6/9/16

to

Hi Hannes,

I think PG_workingset might be a good flag in the future, core MM might
utilize it to optimize something so I hope it supports for 32bit, too.

A usecase with PG_workingset in old was cleancache. A few year ago,
Dan tried it to only cache activated page from page cache to cleancache,
IIRC. As well, many system using zram(i.e., fast swap) are still 32 bit
architecture.

Just an idea. we might be able to move less important flag(i.e., enabled
in specific configuration, for example, PG_hwpoison or PG_uncached) in 32bit
to page_extra to avoid allocate extra memory space and charge the bit as
PG_workingset. :)

Other concern about PG_workingset is naming. For file-backed pages, it's
good because file-backed pages started from inactive's head and promoted
active LRU once two touch so it's likely to be workingset. However,
for anonymous page, it starts from active list so every anonymous page
has PG_workingset while mlocked pages cannot have a chance to have it.
It wouldn't matter in eclaim POV but if we would use PG_workingset as
indicator to identify real workingset page, it might be confused.
Maybe, We could mark mlocked pages as workingset unconditionally.

When I see this, popped thought is how we handle PG_workingset
when split/collapsing THP and then, I can't find any logic. :(
Every anonymous page is PG_workingset by birth so you ignore it
intentionally?

> +
> lruvec->balance.numer[file] += nr_pages;

> lruvec->balance.denom += nr_pages;

So, lru_cost_type is binary. COST_IO and COST_CPU. 'bool' is enough to
represent it if you doesn't have further plan to expand it.
But if you did to make it readable, I'm not against. Just trivial.

> }
> @@ -262,6 +282,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
> int lru = page_lru_base_type(page);
>
> del_page_from_lru_list(page, lruvec, lru);
> + SetPageWorkingset(page);
> SetPageActive(page);
> lru += LRU_ACTIVE;
> add_page_to_lru_list(page, lruvec, lru);
> @@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,

> static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,

How about putting the comment you said to Tim in here?

"
There are no shadow entries for anonymous evictions, only page cache

evictions. All swap-ins are treated as "eligible" refaults and push back
against cache, whereas cache only pushes against anon if the cache
workingset is determined to fit into memory.

That implies a fixed hierarchy where the VM always tries to fit the
anonymous workingset into memory first and the page cache second.
If the anonymous set is bigger than memory, the algorithm won't stop
counting IO cost from anonymous refaults and pressuring page cache.
"

Or put it in workingset.c. I see you wrote up a little bit about
anonymous refault in there but I think adding abvove paragraph is
very helpful.

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Johannes Weiner

unread,

Jun 13, 2016, 12:00:12 PM6/13/16

to

On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> On Mon, Jun 06, 2016 at 03:48:36PM -0400, Johannes Weiner wrote:
> > @@ -79,6 +79,7 @@ enum pageflags {
> > PG_dirty,
> > PG_lru,
> > PG_active,
> > + PG_workingset,
>
> I think PG_workingset might be a good flag in the future, core MM might
> utilize it to optimize something so I hope it supports for 32bit, too.
>
> A usecase with PG_workingset in old was cleancache. A few year ago,
> Dan tried it to only cache activated page from page cache to cleancache,
> IIRC. As well, many system using zram(i.e., fast swap) are still 32 bit
> architecture.
>
> Just an idea. we might be able to move less important flag(i.e., enabled
> in specific configuration, for example, PG_hwpoison or PG_uncached) in 32bit
> to page_extra to avoid allocate extra memory space and charge the bit as
> PG_workingset. :)

Yeah, I do think it should be a core flag. We have the space for it.

> Other concern about PG_workingset is naming. For file-backed pages, it's
> good because file-backed pages started from inactive's head and promoted
> active LRU once two touch so it's likely to be workingset. However,
> for anonymous page, it starts from active list so every anonymous page
> has PG_workingset while mlocked pages cannot have a chance to have it.
> It wouldn't matter in eclaim POV but if we would use PG_workingset as
> indicator to identify real workingset page, it might be confused.
> Maybe, We could mark mlocked pages as workingset unconditionally.

Hm I'm not sure it matters. Technically we don't have to set it on
anon, but since it's otherwise unused anyway, it's nice to set it to
reinforce the notion that anon is currently always workingset.

> > @@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
> > SetPageActive(newpage);
> > } else if (TestClearPageUnevictable(page))
> > SetPageUnevictable(newpage);
> > + if (PageWorkingset(page))
> > + SetPageWorkingset(newpage);
>
> When I see this, popped thought is how we handle PG_workingset
> when split/collapsing THP and then, I can't find any logic. :(
> Every anonymous page is PG_workingset by birth so you ignore it
> intentionally?

Good catch. __split_huge_page_tail() should copy it over, will fix that.

Yeah, it's meant for readability. "true" and "false" make for fairly
cryptic arguments when they are a static property of the callsite:

lru_note_cost(lruvec, false, page_is_file_cache(page), hpage_nr_pages(page))

???

So I'd rather name these things and leave bool for things that are
based on predicate functions.

Agreed, that would probably be helpful. I'll put that in.

Thanks Minchan!

Minchan Kim

unread,

Jun 14, 2016, 10:30:11 PM6/14/16

to

When I read your description firstly, I thought the flag for anon page
is set on only swapin but now I feel you want to set it for all of
anonymous page but it has several holes like mlocked pages, shmem pages
and THP and you want to fix it in THP case only.
Hm, What's the rule?
It's not consistent and confusing to me. :(

I think it would be better that PageWorkingset function should return
true in case of PG_swapbacked set if we want to consider all pages of
anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.

Another question:

Do we want to retain [1]?

This patch motivates from swap IO could be much faster than file IO
so that it would be natural if we rely on refaulting feedback rather
than forcing evicting file cache?

[1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?

Johannes Weiner

unread,

Jun 16, 2016, 11:20:09 AM6/16/16

to

On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> > On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > > Other concern about PG_workingset is naming. For file-backed pages, it's
> > > good because file-backed pages started from inactive's head and promoted
> > > active LRU once two touch so it's likely to be workingset. However,
> > > for anonymous page, it starts from active list so every anonymous page
> > > has PG_workingset while mlocked pages cannot have a chance to have it.
> > > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > > indicator to identify real workingset page, it might be confused.
> > > Maybe, We could mark mlocked pages as workingset unconditionally.
> >
> > Hm I'm not sure it matters. Technically we don't have to set it on
> > anon, but since it's otherwise unused anyway, it's nice to set it to
> > reinforce the notion that anon is currently always workingset.
>
> When I read your description firstly, I thought the flag for anon page
> is set on only swapin but now I feel you want to set it for all of
> anonymous page but it has several holes like mlocked pages, shmem pages
> and THP and you want to fix it in THP case only.
> Hm, What's the rule?
> It's not consistent and confusing to me. :(

I think you are might be over thinking this a bit ;)

The current LRU code has a notion of workingset pages, which is anon
pages and multi-referenced file pages. shmem are considered file for
this purpose. That's why anon start out active and files/shmem do
not. This patch adds refaulting pages to the mix.

PG_workingset keeps track of pages that were recently workingset, so
we set it when the page enters the workingset (activations and
refaults, and new anon from the start). The only thing we need out of
this flag is to tell us whether reclaim is going after the workingset
because the LRUs have become too small to hold it.

mlocked pages are not really interesting because not only are they not
evictable, they are entirely exempt from aging. Without aging, we can
not say whether they are workingset or not. We'll just leave the flags
alone, like the active flag right now.

> I think it would be better that PageWorkingset function should return
> true in case of PG_swapbacked set if we want to consider all pages of
> anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.

I'm not sure I see the upside, it would be more branches and code.

> Another question:
>
> Do we want to retain [1]?
>
> This patch motivates from swap IO could be much faster than file IO
> so that it would be natural if we rely on refaulting feedback rather
> than forcing evicting file cache?
>
> [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?

Yes! We don't want to go after the workingset, whether it be cache or
anonymous, while there is single-use page cache lying around that we
can reclaim for free, with no IO and little risk of future IO. Anon
memory doesn't have this equivalent. Only cache is lazy-reclaimed.

Once the cache refaults, we activate it to reflect the fact that it's
workingset. Only when we run out of single-use cache do we want to
reclaim multi-use pages, and *then* we balance workingsets based on
cost of refetching each side from secondary storage.

Minchan Kim

unread,

Jun 17, 2016, 3:50:06 AM6/17/16

to

Understood.

Divergence comes from here. It seems you design the page flag for only
aging/balancing logic working well while I am thinking to leverage the
flag to identify real workingset. I mean a anonymous page would be a cold
if it has just cold data for the application which would be swapped
out after a short time and never swap-in until process exits. However,
we put it from active list so that it has PG_workingset but it's cold
page.

Yes, we cannot use the flag for such purpose in this SEQ replacement so
I will not insist on it.

>
> mlocked pages are not really interesting because not only are they not
> evictable, they are entirely exempt from aging. Without aging, we can
> not say whether they are workingset or not. We'll just leave the flags
> alone, like the active flag right now.
>
> > I think it would be better that PageWorkingset function should return
> > true in case of PG_swapbacked set if we want to consider all pages of
> > anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.
>
> I'm not sure I see the upside, it would be more branches and code.
>
> > Another question:
> >
> > Do we want to retain [1]?
> >
> > This patch motivates from swap IO could be much faster than file IO
> > so that it would be natural if we rely on refaulting feedback rather
> > than forcing evicting file cache?
> >
> > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
>
> Yes! We don't want to go after the workingset, whether it be cache or
> anonymous, while there is single-use page cache lying around that we
> can reclaim for free, with no IO and little risk of future IO. Anon
> memory doesn't have this equivalent. Only cache is lazy-reclaimed.
>
> Once the cache refaults, we activate it to reflect the fact that it's
> workingset. Only when we run out of single-use cache do we want to
> reclaim multi-use pages, and *then* we balance workingsets based on
> cost of refetching each side from secondary storage.

If pages in inactive file LRU are really single-use page cache, I agree.

However, how does the logic can work like that?
If reclaimed file pages were part of workingset(i.e., refault happens),
we give the pressure to anonymous LRU but get_scan_count still force to
reclaim file lru until inactive file LRU list size is enough low.

With that, too many file workingset could be evicted although anon swap
is cheaper on fast swap storage?

IOW, refault mechanisme works once inactive file LRU list size is enough
small but small inactive file LRU doesn't guarantee it has only multiple
-use pages. Hm, Isn't it a problem?

Johannes Weiner

unread,

Jun 17, 2016, 1:10:04 PM6/17/16

to

Well, I'm designing the flag so that it's useful for the case I am
introducing it for :)

I have no problem with changing its semantics later on if you want to
build on top of it, rename it, anything - so far as the LRU balancing
is unaffected of course.

But I don't think it makes sense to provision it for potential future
cases that may or may not materialize.

> > > Do we want to retain [1]?
> > >
> > > This patch motivates from swap IO could be much faster than file IO
> > > so that it would be natural if we rely on refaulting feedback rather
> > > than forcing evicting file cache?
> > >
> > > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> >
> > Yes! We don't want to go after the workingset, whether it be cache or
> > anonymous, while there is single-use page cache lying around that we
> > can reclaim for free, with no IO and little risk of future IO. Anon
> > memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> >
> > Once the cache refaults, we activate it to reflect the fact that it's
> > workingset. Only when we run out of single-use cache do we want to
> > reclaim multi-use pages, and *then* we balance workingsets based on
> > cost of refetching each side from secondary storage.
>
> If pages in inactive file LRU are really single-use page cache, I agree.
>
> However, how does the logic can work like that?
> If reclaimed file pages were part of workingset(i.e., refault happens),
> we give the pressure to anonymous LRU but get_scan_count still force to
> reclaim file lru until inactive file LRU list size is enough low.
>
> With that, too many file workingset could be evicted although anon swap
> is cheaper on fast swap storage?
>
> IOW, refault mechanisme works once inactive file LRU list size is enough
> small but small inactive file LRU doesn't guarantee it has only multiple
> -use pages. Hm, Isn't it a problem?

It's a trade-off between the cost of detecting a new workingset from a
stream of use-once pages, and the cost of use-once pages impose on the
established workingset.

That's a pretty easy choice, if you ask me. I'd rather ask cache pages
to prove they are multi-use than have use-once pages put pressure on
the workingset.

Sure, a spike like you describe is certainly possible, where a good
portion of the inactive file pages will be re-used in the near future,
yet we evict all of them in a burst of memory pressure when we should
have swapped. That's a worst case scenario for the use-once policy in
a workingset transition.

However, that's much better than use-once pages, which cost no
additional IO to reclaim and do not benefit from being cached at all,
causing the workingset to be trashed or swapped out.

In your scenario, the real multi-use pages will quickly refault and
get activated and the algorithm will adapt to the new circumstances.

Minchan Kim

unread,

Jun 20, 2016, 4:20:05 AM6/20/16

to

I admit I was so far from the topic. Sorry, Johannes. :)

The reason I guess is naming of the flag. When you introduced the flag,
I popped a vague idea to utilize the flag in future if it represents real
workingset but as I reviewed code, I realized it's not what I want but just
thing to detect activated page before reclaiming. So to me, it looks like
PG_activated rather than PG_workingset. ;-)

Make sense.

>
> Sure, a spike like you describe is certainly possible, where a good
> portion of the inactive file pages will be re-used in the near future,
> yet we evict all of them in a burst of memory pressure when we should
> have swapped. That's a worst case scenario for the use-once policy in
> a workingset transition.

So, the point is how such case it happens frequently. A scenario I can
think of is that if we use one-cgroup-per-app, many file pages would be
inactive LRU while active LRU is almost empty until reclaim kicks in.
Because normally, parallel reclaim work during launching new app makes
app's startup time really slow. That's why mobile platform uses notifiers
to get free memory in advance via kiling/reclaiming. Anyway, once we get
amount of free memory and lauching new app in a new cgroup, pages would
live his born LRU list(ie, anon: active file: inactive) without aging.

Then, activity manager can set memory.high of less important app-cgroup
to reclaim it with high value swappiness because swap device is much
faster on that system and much bigger anonymous pages compared to file-
backed pages. Surely, activity manager will expect lots of anonymous
pages be able to swap out but unlike expectation, he will see such spike
easily with reclaiming file-backed pages a lot and refault until inactive
file LRU is enough small.

I think it's enough possible scenario in small system one-cgroup-per-
app.

>
> However, that's much better than use-once pages, which cost no
> additional IO to reclaim and do not benefit from being cached at all,
> causing the workingset to be trashed or swapped out.

I agree removing e9868505987a entirely is dangerous but I think
we need something to prevent such spike. Checking sc->priority might
be helpful. Anyway, I think it's worth to discuss.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bbfae9a92819..5d5e8e634a06 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2043,6 +2043,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* system is under heavy pressure.
*/
if (!inactive_list_is_low(lruvec, true) &&
+ sc->priority >= DEF_PRIORITY - 2 &&
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;

Johannes Weiner

unread,

Jun 22, 2016, 6:00:05 PM6/22/16

to

That's the workingset transition I was talking about. The algorithm is
designed to settle towards stable memory patterns. We can't possibly
remove one of the key components of this - the use-once policy - to
speed up a few seconds of workingset transition when it comes at the
risk of potentially thrashing the workingset for *hours*.

The fact that swap IO can be faster than filesystem IO doesn't change
this at all. The point is that the reclaim and refetch IO cost of
use-once cache is ZERO. Causing swap IO to make room for more and more
unused cache pages doesn't make any sense, no matter the swap speed.

I really don't see the relevance of this discussion to this patch set.

Minchan Kim

unread,

Jun 24, 2016, 2:30:05 AM6/24/16

to

I agree your overall point about use-once first reclaim and as I said
previos mail, I didn't want to remove e9868505987a entirely.

My concern was unconditionally scanning of only file lru until inactive
list is enough low by magic value(3:1 or 1:1) is too heuristic to reclaim
use-once pages first so that it could evict non used-once file backed
pages too much.

Even, let's think about MADV_FREEed page in anonymous LRU list.
They might be more attractive candidate for reclaim.
Even, Userspace already paid for the madvise syscall to prefer
but VM unconditionally keeps them until inactive file lru is enough
small under assumption that we should sweep used-once file pages
firstly and unfortune multi-use page reclaim is trade-off to detect
workingset transitions so user should take care of it although he
wanted to prefer anonymous via vm_swappiness.

I don't think it makes sense. The vm_swappiness is user preference
knob. He can know his system workload better than kernel. For example,
a user might want to degrade overall system performance by swapping
out anonymous more but want to keep file pages to reduce latency spike
to access those file pages when some event happens suddenly.
But kernel ignores it until inactive lru is enough small.

pages. And please think over MADV_FREEed pages. They might be more
attractive candidate for reclaim point of view.
use-once file pages

A idea in my mind is as follows.
You nicely abstract cost model in this patchset so if scanning cost
of either LRU is too higher than paging-in/out(e.g., 32 * 2 *
SWAP_CLUSTER_MAX) in other LRU, we can break unconditional scanning
and turn into other LRU to prove it's valuable workingset temporally.
And repeated above cycle rather than sweeping inactive file LRU only.
I think it can mitigate the workload tranistion spike with hanlding
cold/freeable pages fairly in anonymous LRU list.

>
> I really don't see the relevance of this discussion to this patch set.

Hm, yes, the thing I had a concern is *not* new introduced by your patch
but it has been there for a long time but your patch's goal is to avoid
balancing code mostly going for page cache favor and exploit the
potential of fast swap device as you described in cover-letter.
However, e9868505987a might be one of conflict with that approach.
That was why I raise an issue.

If you think it's separate issue, I don't want to make your nice job
stucked and waste your time. It could be revisited afterward.

Thanks.