Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH v4 0/3] idle memory tracking

103 views
Skip to first unread message

Vladimir Davydov

unread,
May 7, 2015, 10:10:09 AM5/7/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

* /proc/kpageidle. This file implements a bitmap where each bit corresponds
to a page, indexed by PFN. When the bit is set, the corresponding page is
idle. A page is considered idle if it has not been accessed since it was
marked idle. To mark a page idle one should set the bit corresponding to the
page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. Only user memory pages can be marked idle, for other
page types input is silently ignored. Writing to this file beyond max PFN
results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by setting corresponding bits in the
/proc/kpageidle bitmap
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the number of bits set

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 3.

---- CHANGE LOG ----

Changes in v4:

This iteration primarily addresses Minchan's comments to v3:

- Implement /proc/kpageidle as a bitmap instead of using u64 per each page,
because there does not seem to be any future uses for the other 63 bits.
- Do not double-increase pra->referenced in page_referenced_one() if the page
was young and referenced recently.
- Remove the pointless (page_count == 0) check from kpageidle_get_page().
- Rename kpageidle_clear_refs() to kpageidle_clear_pte_refs().
- Improve comments to kpageidle-related functions.
- Rebase on top of 4.1-rc2.

Note it does not address Minchan's concern of possible __page_set_anon_rmap vs
page_referenced race (see https://lkml.org/lkml/2015/5/3/220) since it is still
unclear if this race can really happen (see https://lkml.org/lkml/2015/5/4/160)

Changes in v3:

- Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
requires two extra page flags and there is no space for them on 32
bit, page ext is used (thanks to Minchan Kim).
- Minor code cleanups and comments improved.
- Rebase on top of 4.1-rc1.

Changes in v2:

- The main difference from v1 is the API change. In v1 the user can
only set the idle flag for all pages at once, and for clearing the
Idle flag on pages accessed via page tables /proc/PID/clear_refs
should be used.
The main drawback of the v1 approach, as noted by Minchan, is that on
big machines setting the idle flag for each pages can result in CPU
bursts, which would be especially frustrating if the user only wanted
to estimate the amount of idle pages for a particular process or VMA.
With the new API a more fine-grained approach is possible: one can
read a process's /proc/PID/pagemap and set/check the Idle flag only
for those pages of the process's address space he or she is
interested in.
Another good point about the v2 API is that it is possible to limit
/proc/kpage* scanning rate when the user wants to estimate the total
number of idle pages, which is unachievable with the v1 approach.
- Make /proc/kpagecgroup return the ino of the closest online ancestor
in case the cgroup a page is charged to is offline.
- Fix /proc/PID/clear_refs not clearing Young page flag.
- Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54

v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup
- patch 2 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 3 implements the idle page tracking feature, including the
userspace API, /proc/kpageidle

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"


def set_idle():
f = open("/proc/kpageidle", "wb")
while True:
try:
f.write(struct.pack("Q", pow(2, 64) - 1))
except IOError as err:
if err.errno == errno.ENXIO:
break
raise
f.close()


def count_idle():
f_idle = open("/proc/kpageidle", "rb")
f_flags = open("/proc/kpageflags", "rb")
f_cgroup = open("/proc/kpagecgroup", "rb")

pfn = 0
nr_idle = {}
while True:
if not pfn % 64:
s = f_idle.read(8)
if not s:
break
idle_bitmap = struct.unpack('Q', s)[0]

idle = idle_bitmap & 1
idle_bitmap >>= 1
pfn += 1

flags = struct.unpack('Q', f_flags.read(8))[0]
cgino = struct.unpack('Q', f_cgroup.read(8))[0]

unevictable = flags >> 18 & 1
huge = flags >> 22 & 1

if idle and not unevictable:
nr_idle[cgino] = nr_idle.get(cgino, 0) + (512 if huge else 1)

f_flags.close()
f_cgroup.close()
f_idle.close()
return nr_idle


print "Setting the idle flag for each page..."
set_idle()

raw_input("Wait until the workload accesses its working set, then press Enter")

print "Counting idle pages..."
nr_idle = count_idle()

for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(nr_idle.get(ino, 0) * 4) + " KB"
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (3):
memcg: add page_cgroup_ino helper
proc: add kpagecgroup file
proc: add kpageidle file

Documentation/vm/pagemap.txt | 16 ++-
fs/proc/Kconfig | 5 +-
fs/proc/page.c | 224 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 8 +-
include/linux/mm.h | 88 +++++++++++++++++
include/linux/page-flags.h | 9 ++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 73 +++++++-------
mm/memory-failure.c | 16 +--
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
16 files changed, 417 insertions(+), 64 deletions(-)

--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Vladimir Davydov

unread,
May 7, 2015, 10:10:13 AM5/7/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works on pages charged to offline
memory cgroups, returning the inode number of the closest online
ancestor in this case, while the former does not, which is crucial for
the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 8 ++---
mm/hwpoison-inject.c | 5 +--
mm/memcontrol.c | 73 ++++++++++++++++++++++----------------------
mm/memory-failure.c | 16 ++--------
4 files changed, 42 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
void mem_cgroup_split_huge_fixup(struct page *head);
#endif

+unsigned long page_cgroup_ino(struct page *page);
+
#else /* CONFIG_MEMCG */
struct mem_cgroup;

@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
- * We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
- lock_page(hpage);
err = hwpoison_filter(hpage);
- unlock_page(hpage);
if (err)
return 0;

@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..87c7f852d45b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}

-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges. If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- struct mem_cgroup *memcg;
- unsigned short id;
- swp_entry_t ent;
-
- VM_BUG_ON_PAGE(!PageLocked(page), page);
-
- memcg = page->mem_cgroup;
- if (memcg) {
- if (!css_tryget_online(&memcg->css))
- memcg = NULL;
- } else if (PageSwapCache(page)) {
- ent.val = page_private(page);
- id = lookup_swap_cgroup_id(ent);
- rcu_read_lock();
- memcg = mem_cgroup_from_id(id);
- if (memcg && !css_tryget_online(&memcg->css))
- memcg = NULL;
- rcu_read_unlock();
- }
- return memcg;
-}
-
static void lock_page_lru(struct page *page, int *isolated)
{
struct zone *zone = page_zone(page);
@@ -2774,6 +2740,31 @@ void mem_cgroup_split_huge_fixup(struct page *head)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+/**
+ * page_cgroup_ino - return inode number of page's memcg
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number. It is safe to call this function without taking
+ * a reference to the page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !css_tryget_online(&memcg->css))
+ memcg = parent_mem_cgroup(memcg);
+ rcu_read_unlock();
+ if (memcg) {
+ ino = cgroup_ino(memcg->css.cgroup);
+ css_put(&memcg->css);
+ }
+ return ino;
+}
+
#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
@@ -5482,8 +5473,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}

- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
+ if (do_swap_account && PageSwapCache(page)) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id = lookup_swap_cgroup_id(ent);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9359b770cd9..64cd565fd4f8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)
{
- struct mem_cgroup *mem;
- struct cgroup_subsys_state *css;
- unsigned long ino;
-
if (!hwpoison_filter_memcg)
return 0;

- mem = try_get_mem_cgroup_from_page(p);
- if (!mem)
- return -EINVAL;
-
- css = mem_cgroup_css(mem);
- ino = cgroup_ino(css->cgroup);
- css_put(css);
-
- if (ino != hwpoison_filter_memcg)
+ if (page_cgroup_ino(p) != hwpoison_filter_memcg)
return -EINVAL;

return 0;

Vladimir Davydov

unread,
May 7, 2015, 10:10:24 AM5/7/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 171 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 88 ++++++++++++++++++++++
include/linux/page-flags.h | 9 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
11 files changed, 315 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..e5f30539ca2d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -275,6 +275,173 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it has the LRU flag set, because it
+ * is always safe to pass such a page to page_referenced(), which is essential
+ * for idle page tracking. With such an indicator of user pages we can skip
+ * isolated pages, but since there are not usually many of them, it should not
+ * affect the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+ page = pfn_to_page(pfn);
+ if (!page || !PageLRU(page))
+ return NULL;
+ if (!get_page_unless_zero(page))
+ return NULL;
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ return NULL;
+ }
+ return page;
+}
+
+/*
+ * This function calls page_referenced() to clear the referenced bit for all
+ * mappings to a page. Since the latter also clears the page idle flag if the
+ * page was referenced, it can be used to update the idle flag of a page.
+ */
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+ unsigned long dummy;
+
+ if (page_referenced(page, 0, NULL, &dummy))
+ /*
+ * We cleared the referenced bit in a mapping to this page. To
+ * avoid interference with the reclaimer, mark it young so that
+ * the next call to page_referenced() will also return > 0 (see
+ * page_referenced_one())
+ */
+ set_page_young(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * 8;
+ if (pfn >= max_pfn)
+ return 0;
+
+ count = min_t(unsigned long, count,
+ DIV_ROUND_UP(max_pfn - pfn, KPMSIZE * 8));
+ end_pfn = pfn + count * 8;
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % 64;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_pte_refs(page);
+ if (page_is_idle(page))
+ idle_bitmap |= 1 << bit;
+ }
+ put_page(page);
+ }
+ if (bit == 63) {
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * 8;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ count = min_t(unsigned long, count,
+ DIV_ROUND_UP(max_pfn - pfn, KPMSIZE * 8));
+ end_pfn = pfn + count * 8;
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % 64;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +449,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..ab04846f7dd5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -808,6 +808,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,

/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -835,6 +836,7 @@ out:

/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..794d29aa2317 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2200,5 +2200,93 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+ return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ SetPageYoung(page);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+ return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ clear_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_young(struct page *page)
+{
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f34e040b34e9..5e7c4f50a644 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ PG_young,
+ PG_idle,
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -289,6 +293,11 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ PAGE_EXT_YOUNG,
+ PAGE_EXT_IDLE,
+#endif
};

/*
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..3600eace4774 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+config IDLE_PAGE_TRACKING
+ bool "Enable idle page tracking"
+ select PROC_PAGE_MONITOR
+ select PAGE_EXTENSION if !64BIT
+ help
+ This feature allows to estimate the amount of user pages that have
+ not been touched during a given period of time. This information can
+ be useful to tune memory cgroup limits and/or for job placement
+ within a compute cluster.
+
+ See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..bb66f9ccec03 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ {1UL << PG_young, "young" },
+ {1UL << PG_idle, "idle" },
+#endif
};

static void dump_flags(unsigned long flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ &page_idle_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..eca7416f55d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -781,6 +781,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

+ if (referenced && page_is_idle(page))
+ clear_page_idle(page);
+
+ if (page_is_young(page)) {
+ clear_page_young(page);
+ referenced++;
+ }
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index a7251a8ed532..6bf6f293a9ea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);

Vladimir Davydov

unread,
May 7, 2015, 10:10:52 AM5/7/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/Kconfig | 5 ++--
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
help
Various /proc files exist to monitor process memory utilization:
/proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
- /proc/kpagecount, and /proc/kpageflags. Disabling these
- interfaces will reduce the size of the kernel by approximately 4kb.
+ /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+ Disabling these interfaces will reduce the size of the kernel
+ by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);

Vladimir Davydov

unread,
May 8, 2015, 9:21:02 AM5/8/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Oops, this patch is stale, the correct one is here:
---
From: Vladimir Davydov <vdav...@parallels.com>
Subject: [PATCH] proc: add kpageidle file
index 70d23245dd43..5c055a7eee54 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -16,6 +16,7 @@

#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)

/* /proc/kpagecount - an array exposing page counts
*
@@ -275,6 +276,173 @@ static const struct file_operations proc_kpagecgroup_operations = {
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return 0;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_pte_refs(page);
+ if (page_is_idle(page))
+ idle_bitmap |= 1ULL << bit;
+ }
+ put_page(page);
+ }
+ if (bit == KPMBITS - 1) {
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
@@ -282,6 +450,10 @@ static int __init proc_page_init(void)

Vladimir Davydov

unread,
May 8, 2015, 9:23:01 AM5/8/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Thu, May 07, 2015 at 05:09:39PM +0300, Vladimir Davydov wrote:
> ---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----

Oops, this script is stale. The correct one is here:
---
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024 # must be multiple of 8


def set_idle():
f = open("/proc/kpageidle", "wb", BUFSIZE)
while True:
try:
f.write(struct.pack("Q", pow(2, 64) - 1))
except IOError as err:
if err.errno == errno.ENXIO:
break
raise
f.close()


def count_idle():
f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
f_idle = open("/proc/kpageidle", "rb", BUFSIZE)

pfn = 0
nr_idle = {}
while True:
s = f_flags.read(8)
if not s:
break

flags, = struct.unpack('Q', s)
cgino, = struct.unpack('Q', f_cgroup.read(8))

bit = pfn % 64
if not bit:
idle_bitmap, = struct.unpack('Q', f_idle.read(8))

idle = idle_bitmap >> bit & 1
pfn += 1

unevictable = flags >> 18 & 1
huge = flags >> 22 & 1

if idle and not unevictable:
nr_idle[cgino] = nr_idle.get(cgino, 0) + (512 if huge else 1)

f_flags.close()
f_cgroup.close()
f_idle.close()
return nr_idle


print "Setting the idle flag for each page..."
set_idle()

raw_input("Wait until the workload accesses its working set, then press Enter")

print "Counting idle pages..."
nr_idle = count_idle()

for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(nr_idle.get(ino, 0) * 4) + " KB"

Vladimir Davydov

unread,
May 12, 2015, 9:34:44 AM5/12/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
* /proc/kpageidle. This file implements a bitmap where each bit corresponds
to a page, indexed by PFN. When the bit is set, the corresponding page is
idle. A page is considered idle if it has not been accessed since it was
marked idle. To mark a page idle one should set the bit corresponding to the
page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. Only user memory pages can be marked idle, for other
page types input is silently ignored. Writing to this file beyond max PFN
results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by setting corresponding bits in the
/proc/kpageidle bitmap
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the number of bits set

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 3.

---- CHANGE LOG ----

Changes in v5:

- Fix possible race between kpageidle_clear_pte_refs() and
__page_set_anon_rmap() by checking that a page is on an LRU list
under zone->lru_lock (Minchan).
- Export idle flag via /proc/kpageflags (Minchan).
- Rebase on top of 4.1-rc3.
v4: https://lkml.org/lkml/2015/5/7/580
v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup
- patch 2 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 3 implements the idle page tracking feature, including the
userspace API, /proc/kpageidle
- patch 4 exports idle flag via /proc/kpageflags

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (4):
memcg: add page_cgroup_ino helper
proc: add kpagecgroup file
proc: add kpageidle file
proc: export idle flag via kpageflags

Documentation/vm/pagemap.txt | 22 ++-
fs/proc/Kconfig | 5 +-
fs/proc/page.c | 234 ++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 8 +-
include/linux/mm.h | 88 ++++++++++++
include/linux/page-flags.h | 9 ++
include/linux/page_ext.h | 4 +
include/uapi/linux/kernel-page-flags.h | 1 +
mm/Kconfig | 12 ++
mm/debug.c | 4 +
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 73 +++++-----
mm/memory-failure.c | 16 +--
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
17 files changed, 434 insertions(+), 64 deletions(-)

--
1.7.10.4

Vladimir Davydov

unread,
May 12, 2015, 9:34:50 AM5/12/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works on pages charged to offline
memory cgroups, returning the inode number of the closest online
ancestor in this case, while the former does not, which is crucial for
the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
index 4ca5fe0042e1..d2facac0b01f 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
- * We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
- lock_page(hpage);
err = hwpoison_filter(hpage);
- unlock_page(hpage);
if (err)
goto put_out;

@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
+ unsigned long ino = 0;
+
index 501820c815b3..7166ad81b222 100644

Vladimir Davydov

unread,
May 12, 2015, 9:35:02 AM5/12/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 178 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 88 +++++++++++++++++++++
include/linux/page-flags.h | 9 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
11 files changed, 322 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..f42ead08d346 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -16,6 +16,7 @@

#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)

/* /proc/kpagecount - an array exposing page counts
*
@@ -275,6 +276,179 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to page_referenced(), which is essential for
+ * idle page tracking. With such an indicator of user pages we can skip
+ * isolated pages, but since there are not usually many of them, it will hardly
+ * affect the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+ struct zone *zone;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+
+ page = pfn_to_page(pfn);
+ if (!page || !PageLRU(page))
+ return NULL;
+ if (!get_page_unless_zero(page))
+ return NULL;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ page = NULL;
+ }
+ spin_unlock_irq(&zone->lru_lock);
@@ -282,6 +456,10 @@ static int __init proc_page_init(void)
index 8b18fd4227d1..3650793eaeab 100644

Vladimir Davydov

unread,
May 12, 2015, 9:35:14 AM5/12/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page

+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index f42ead08d346..24748be3dd65 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -148,6 +148,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;

+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25


#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

Vladimir Davydov

unread,
May 12, 2015, 9:36:08 AM5/12/15
to Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/Kconfig | 5 ++--
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
help
Various /proc files exist to monitor process memory utilization:
/proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
- /proc/kpagecount, and /proc/kpageflags. Disabling these
- interfaces will reduce the size of the kernel by approximately 4kb.
+ /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+ Disabling these interfaces will reduce the size of the kernel
+ by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);

Raghavendra KT

unread,
Jun 7, 2015, 2:11:34 AM6/7/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, Linux Kernel Mailing List, Raghavendra KT
On Tue, May 12, 2015 at 7:04 PM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>

Hi Vladimir,

Thanks for the patches, I was able test how the series is helpful to determine
docker container workingset / idlemem with these patches. (tested on ppc64le
after porting to a distro kernel).

Vladimir Davydov

unread,
Jun 7, 2015, 5:12:10 AM6/7/15
to Raghavendra KT, Andrew Morton, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, Linux Kernel Mailing List
On Sun, Jun 07, 2015 at 11:41:15AM +0530, Raghavendra KT wrote:
> Thanks for the patches, I was able test how the series is helpful to determine
> docker container workingset / idlemem with these patches. (tested on ppc64le
> after porting to a distro kernel).

Hi,

Thank you for using and testing it! I've been busy for a while with my
internal tasks, but I am almost done with them and will get back to this
patch set and resubmit it soon (during the next week hopefully).

Thanks,
Vladimir

Andrew Morton

unread,
Jun 8, 2015, 3:35:52 PM6/8/15
to Raghavendra KT, Vladimir Davydov, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, Linux Kernel Mailing List
On Sun, 7 Jun 2015 11:41:15 +0530 Raghavendra KT <raghave...@linux.vnet.ibm.com> wrote:

> On Tue, May 12, 2015 at 7:04 PM, Vladimir Davydov
> <vdav...@parallels.com> wrote:
> > Hi,
> >
> > This patch set introduces a new user API for tracking user memory pages
> > that have not been used for a given period of time. The purpose of this
> > is to provide the userspace with the means of tracking a workload's
> > working set, i.e. the set of pages that are actively used by the
> > workload. Knowing the working set size can be useful for partitioning
> > the system more efficiently, e.g. by tuning memory cgroup limits
> > appropriately, or for job placement within a compute cluster.
> >
> > ---- USE CASES ----
> >
> > The unified cgroup hierarchy has memory.low and memory.high knobs, which
> > are defined as the low and high boundaries for the workload working set
> > size. However, the working set size of a workload may be unknown or
> > change in time. With this patch set, one can periodically estimate the
> > amount of memory unused by each cgroup and tune their memory.low and
> > memory.high parameters accordingly, therefore optimizing the overall
> > memory utilization.
> >
>
> Hi Vladimir,
>
> Thanks for the patches, I was able test how the series is helpful to determine
> docker container workingset / idlemem with these patches. (tested on ppc64le
> after porting to a distro kernel).

And what were the results of your testing? The more details the
better, please.

Raghavendra K T

unread,
Jun 9, 2015, 4:27:06 AM6/9/15
to Andrew Morton, Vladimir Davydov, Minchan Kim, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, Linux Kernel Mailing List
Hi Andrew,
This is what I had done in my experiment (Theoretical):
1) created a docker container
2)
Ran the python script (example in first patch) provided by Vladimir
to get idle memory in the docker container. This would further help
in analyzing what is the rss docker container would ideally use
and hence we could set the memory limit for the container and we will
know how much we should ideally scale without degrading the performance
of other containers.

# ~/raghu/idlemmetrack/idlememtrack.py
Setting the idle flag for each page...
Wait until the workload accesses its working set, then press Enter
Counting idle pages..
/sys/fs/cgroup/memory: 9764 KB
[...]
/sys/fs/cgroup/memory/system.slice/docker-[...].scope: 224 K
...

I understand that you might probably want how did the scaling experiment
with memory limit tuning went after that, but I have not got
that data yet.. :(..

Vladimir Davydov

unread,
Jun 12, 2015, 5:52:58 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

* /proc/kpageidle. This file implements a bitmap where each bit corresponds
to a page, indexed by PFN. When the bit is set, the corresponding page is
idle. A page is considered idle if it has not been accessed since it was
marked idle. To mark a page idle one should set the bit corresponding to the
page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. Only user memory pages can be marked idle, for other
page types input is silently ignored. Writing to this file beyond max PFN
results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by setting corresponding bits in the
/proc/kpageidle bitmap
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the number of bits set

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v6:

- Split the patch introducing page_cgroup_ino helper to ease review.
- Rebase on top of v4.1-rc7-mmotm-2015-06-09-16-55
/proc/kpagecgroup and patches 2-3 do related cleanup
- patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 5 implements the idle page tracking feature, including the
userspace API, /proc/kpageidle
- patch 6 exports idle flag via /proc/kpageflags
Vladimir Davydov (6):
memcg: add page_cgroup_ino helper
hwpoison: use page_cgroup_ino for filtering by memcg
memcg: zap try_get_mem_cgroup_from_page
proc: add kpagecgroup file
proc: add kpageidle file
proc: export idle flag via kpageflags

Documentation/vm/pagemap.txt | 22 +++-
fs/proc/page.c | 234 +++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 7 +-
include/linux/mm.h | 88 +++++++++++++
include/linux/page-flags.h | 9 ++
include/linux/page_ext.h | 4 +
include/uapi/linux/kernel-page-flags.h | 1 +
mm/Kconfig | 12 ++
mm/debug.c | 4 +
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 71 +++++-----
mm/memory-failure.c | 16 +--
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
16 files changed, 428 insertions(+), 62 deletions(-)

--
2.1.4

Vladimir Davydov

unread,
Jun 12, 2015, 5:53:06 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73b02b0a8f60..50069abebc3c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,

extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+extern unsigned long page_cgroup_ino(struct page *page);

struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..894dc2169979 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
return &memcg->css;
}

+/**
+ * page_cgroup_ino - return inode number of the memcg a page is charged to
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number or 0 if @page is not charged to any cgroup. It
+ * is safe to call this function without holding a reference to @page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !(memcg->css.flags & CSS_ONLINE))
+ memcg = parent_mem_cgroup(memcg);
+ if (memcg)
+ ino = cgroup_ino(memcg->css.cgroup);
+ rcu_read_unlock();
+ return ino;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{

Vladimir Davydov

unread,
Jun 12, 2015, 5:53:12 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 6 ------
mm/memcontrol.c | 48 ++++++++++++----------------------------------
2 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50069abebc3c..635edfe06bac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -259,11 +258,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 894dc2169979..fa1447fcba33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2378,40 +2378,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -5628,8 +5594,20 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
* the page lock, which serializes swap cache removal, which
* in turn serializes uncharging.
*/
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
if (page->mem_cgroup)
goto out;
+
+ if (do_swap_account) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id = lookup_swap_cgroup_id(ent);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
}

if (PageTransHuge(page)) {
@@ -5637,8 +5615,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}

- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);

Vladimir Davydov

unread,
Jun 12, 2015, 5:53:23 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
+ ino = 0;
+

Vladimir Davydov

unread,
Jun 12, 2015, 5:53:31 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 178 +++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 88 +++++++++++++++++++++
include/linux/page-flags.h | 9 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/page_ext.c | 3 +
mm/rmap.c | 8 ++
mm/swap.c | 2 +
11 files changed, 322 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..1e342270b9c0 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
+ if (page_referenced(page, 0, NULL, &dummy, NULL))
+ /*
+ * We cleared the referenced bit in a mapping to this page. To
+ * avoid interference with the reclaimer, mark it young so that
+ * the next call to page_referenced() will also return > 0 (see
+ * page_referenced_one())
+ */
+ set_page_young(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +456,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 58be92e11939..fcec9ccb8f7e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,

/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -837,6 +838,7 @@ out:

/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..4545ac6e27eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,5 +2205,93 @@ void __init setup_nr_node_ids(void);
index 91b7f9b2b774..14c5d774ad70 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ PG_young,
+ PG_idle,
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -363,6 +367,11 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+PAGEFLAG(Young, young, PF_ANY)
+PAGEFLAG(Idle, idle, PF_ANY)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ PAGE_EXT_YOUNG,
+ PAGE_EXT_IDLE,
+#endif
};

/*
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..db817e2c2ec8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,15 @@ config DEFERRED_STRUCT_PAGE_INIT
when kswapd starts. This has a potential performance impact on
processes running early in the lifetime of the systemm until kswapd
finishes the initialisation.
+
+config IDLE_PAGE_TRACKING
+ bool "Enable idle page tracking"
+ select PROC_PAGE_MONITOR
+ select PAGE_EXTENSION if !64BIT
+ help
+ This feature allows to estimate the amount of user pages that have
+ not been touched during a given period of time. This information can
+ be useful to tune memory cgroup limits and/or for job placement
+ within a compute cluster.
+
+ See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..6c1b3ea61bfd 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ {1UL << PG_young, "young" },
+ {1UL << PG_idle, "idle" },
+#endif
};

static void dump_flags(unsigned long flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ &page_idle_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49b244b1f18c..8db3a6fc0c91 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

+ if (referenced && page_is_idle(page))
+ clear_page_idle(page);
+
+ if (page_is_young(page)) {
+ clear_page_young(page);
+ referenced++;
+ }
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);

Vladimir Davydov

unread,
Jun 12, 2015, 5:53:38 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page

+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 1e342270b9c0..ec6d1cd65698 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -148,6 +148,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;

+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25


#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

Vladimir Davydov

unread,
Jun 12, 2015, 5:54:42 AM6/12/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
mm/hwpoison-inject.c | 5 +----
mm/memory-failure.c | 16 ++--------------
2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index bf73ac17dad4..5015679014c1 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
- * We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
- lock_page(hpage);
err = hwpoison_filter(hpage);
- unlock_page(hpage);
if (err)
goto put_out;

@@ -126,7 +123,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..97005396a507 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)

Vladimir Davydov

unread,
Jul 8, 2015, 1:48:09 PM7/8/15
to Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi,

Any comments, thoughts, proposals regarding this patch? Any chance for
it to get merged?

Thanks,
Vladimir
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Andres Lagar-Cavilla

unread,
Jul 8, 2015, 7:01:36 PM7/8/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Vladimir,
I've reviewed the other five patches on your series and they're
eminently reasonable, so I'll focus my comments here, inline below.
Comments apply to both this specific patch and more broadly to the
approach you present. If I think of more I will post again. Hope that
helps!

Andres
Isolation can race in while you're processing the page, after these
checks. This is ok, but worth a small comment.

> + return NULL;
> + if (!get_page_unless_zero(page))
> + return NULL;
> +
> + zone = page_zone(page);
> + spin_lock_irq(&zone->lru_lock);
> + if (unlikely(!PageLRU(page))) {
> + put_page(page);
> + page = NULL;
> + }
> + spin_unlock_irq(&zone->lru_lock);
> + return page;
> +}
> +
> +/*
> + * This function calls page_referenced() to clear the referenced bit for all
> + * mappings to a page. Since the latter also clears the page idle flag if the
> + * page was referenced, it can be used to update the idle flag of a page.
> + */
> +static void kpageidle_clear_pte_refs(struct page *page)
> +{
> + unsigned long dummy;
> +
> + if (page_referenced(page, 0, NULL, &dummy, NULL))

Because of pte/pmd_clear_flush_young* called in the guts of
page_referenced_one, an N byte write or read to /proc/kpageidle will
cause N * 64 TLB flushes.

Additionally, because of the _notify connection to mmu notifiers, this
will also cause N * 64 EPT TLB flushes (in the KVM Intel case, similar
for other notifier flavors, you get the point).

The solution is relatively straightforward: augment
page_referenced_one with a mode marker or boolean that determines
whether tlb flushing is required.

For an access pattern tracker such as the one you propose, flushing is
not strictly necessary: the next context switch will take care. Too
bad if you missed a few accesses because the pte/pmd was loaded in the
TLB. Not so easy for MMU notifiers, because each secondary MMU has its
own semantics. You could arguably throw the towel in there, or try to
provide a framework (i.e. propagate the flushing flag) and let each
implementation fill the gaps.
Your reasoning for a host wide /proc/kpageidle is well argued, but I'm
still hesitant.

mincore() shows how to (relatively simply) resolve unmapped file pages
to their backing page cache destination. You could recycle that code
and then you'd have per process idle/idling interfaces. With the
advantage of a clear TLB flush demarcation.

> + size_t count, loff_t *ppos)
> +{
> + const u64 __user *in = (const u64 __user *)buf;
> + struct page *page;
> + unsigned long pfn, end_pfn;
> + ssize_t ret = 0;
> + u64 idle_bitmap = 0;
> + int bit;
> +
> + if (*ppos & KPMMASK || count & KPMMASK)
> + return -EINVAL;
> +
> + pfn = *ppos * BITS_PER_BYTE;
> + if (pfn >= max_pfn)
> + return -ENXIO;
> +
> + end_pfn = pfn + count * BITS_PER_BYTE;
> + if (end_pfn > max_pfn)
> + end_pfn = ALIGN(max_pfn, KPMBITS);
> +
> + for (; pfn < end_pfn; pfn++) {

Relatively straight forward to teleport forward 512 (or more
correctly: 1 << compound_order(page)) pages for THP pages, once done
with a THP head, and avoid 511 fruitless trips down rmap.c for each
tail.

> + bit = pfn % KPMBITS;
> + if (bit == 0) {
> + if (get_user(idle_bitmap, in)) {
> + ret = -EFAULT;
> + break;
> + }
> + in++;
> + }
> + if (idle_bitmap >> bit & 1) {
> + page = kpageidle_get_page(pfn);
> + if (page) {
> + kpageidle_clear_pte_refs(page);
> + set_page_idle(page);

In the common case this will make a page both young and idle. This is
fine. We will come back to it below.
Below I will comment more on the value of test_and_clear_page_young. I
think you should strive to support that, and it's trivial in the
common case of 64 bits (and requires some syntactic sugar and relaxed
guarantees for the page_ext case. Fine)
This is not in your patch, but further up in page_referenced_one there
is the pmd case.

So what happens on THP split? That was a leading question: you should
propagate the young and idle flags to the split-up tail pages.

> }
>
> + if (referenced && page_is_idle(page))
> + clear_page_idle(page);

Is it so expensive to just call clear without the test .. ?

> +
> + if (page_is_young(page)) {
> + clear_page_young(page);

referenced += test_and_clear_page_young(page) .. ?

> + referenced++;
> + }
> +

Invert the order. A page can be both young and idle -- we noted that
closer to the top of the patch.

So young bumps referenced up, and then the final referenced value is
used to clear idle.

> if (referenced) {

At this point, if you follow my suggestion of augmenting
page_referenced_one with a mode indicator (for TLB flushing), you can
set page young here. There is the added benefit of holding the
mmap_mutex lock or vma_lock, which prevents reclaim, try_to_unmap,
migration, from exploiting a small window where page young is not set
but should.

> pra->referenced++;
> pra->vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> index ab7c338eda87..db43c9b4891d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
> } else if (!PageReferenced(page)) {
> SetPageReferenced(page);
> }
> + if (page_is_idle(page))
> + clear_page_idle(page);
> }
> EXPORT_SYMBOL(mark_page_accessed);
>
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Vladimir Davydov

unread,
Jul 9, 2015, 9:19:52 AM7/9/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi Andres,

On Wed, Jul 08, 2015 at 04:01:13PM -0700, Andres Lagar-Cavilla wrote:
> On Fri, Jun 12, 2015 at 2:52 AM, Vladimir Davydov
[...]
Agree, will add one.
Frankly, I don't think that tlb flushes are such a big deal in the scope
of this feature, because one is not supposed to write to kpageidle that
often. However, I agree we'd better avoid overhead we can easily avoid,
so I'll add a new flag to differentiate between kpageidle and reclaim
path in page_referenced, as you suggested.
Hmm, I still don't see how we could handle page cache that does not
belong to any process in the scope of sys_mincore.

Besides, it'd be awkward to reuse sys_mincore for idle page tracking,
because we need two operations, set idle and check idle, while the
sys_mincore semantic implies only getting information from the kernel,
not vice versa.

Of course, we could introduce a separate syscall, say sys_idlecore, but
IMO it is not a good idea to add a syscall for such a specific feature,
which can be compiled out. I think a proc file suits better for the
purpose, especially counting that we have a bunch of similar files
(pagemap, kpageflags, kpagecount).

Anyway, I'm open for suggestions. If you have a different user API
design in mind, which in your opinion would fit better, please share.
Right, will fix.
[...]
> > @@ -798,6 +798,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> > pte_unmap_unlock(pte, ptl);
>
> This is not in your patch, but further up in page_referenced_one there
> is the pmd case.
>
> So what happens on THP split? That was a leading question: you should
> propagate the young and idle flags to the split-up tail pages.

Good catch! I completely forgot about THP slit. Will fix in the next
iteration.

>
> > }
> >
> > + if (referenced && page_is_idle(page))
> > + clear_page_idle(page);
>
> Is it so expensive to just call clear without the test .. ?

This function is normally called from a relatively cold path - memory
reclaim, where we modify page->flags anyway, so I think it won't make
any difference if we drop this check.

>
> > +
> > + if (page_is_young(page)) {
> > + clear_page_young(page);
>
> referenced += test_and_clear_page_young(page) .. ?

Yeah, that does look better.

>
> > + referenced++;
> > + }
> > +
>
> Invert the order. A page can be both young and idle -- we noted that
> closer to the top of the patch.
>
> So young bumps referenced up, and then the final referenced value is
> used to clear idle.

I don't think it'd work. Look, kpageidle_write clears pte references and
sets the idle flag. If the page was referenced it also sets the young
flag in order not to interfere with the reclaimer. When kpageidle_read
is called afterwards, it must see the idle flag set iff the page has not
been referenced since kpageidle_write set it. However, if
page_referenced was not called on the page from the reclaim path, it
will still be young no matter if it has been referenced or not and
therefore will always be identified as not idle, which is incorrect.

>
> > if (referenced) {
>
> At this point, if you follow my suggestion of augmenting
> page_referenced_one with a mode indicator (for TLB flushing), you can
> set page young here. There is the added benefit of holding the
> mmap_mutex lock or vma_lock, which prevents reclaim, try_to_unmap,
> migration, from exploiting a small window where page young is not set
> but should.

Yeah, if we go with the page_referenced mode switcher you suggested
above, it's definitely worth moving set_page_young here.

Thank you for the review!

Vladimir

>
> > pra->referenced++;
> > pra->vm_flags |= vma->vm_flags;
> > diff --git a/mm/swap.c b/mm/swap.c
> > index ab7c338eda87..db43c9b4891d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
> > } else if (!PageReferenced(page)) {
> > SetPageReferenced(page);
> > }
> > + if (page_is_idle(page))
> > + clear_page_idle(page);
> > }
> > EXPORT_SYMBOL(mark_page_accessed);
> >
--

Andres Lagar-Cavilla

unread,
Jul 10, 2015, 3:11:11 PM7/10/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi Vladimir,
Yes, it's a performance optimization, but fairly critical. Once you
open up a user-space interface, it will take off. What prevents people
from writing a daemon that will scan the entire host space every N
seconds (N=10? 60? 90? 120?). That means tens or hundreds of millions
of individual TLB flushes, which will hurt performance.

The KVM issue is not minor.
You're correct.

I wasn't asking to use mincore, just pointing out an extant code
pattern that could get you beyond the concerns re unmapping (and which
can be implemented as a proc file).

My view is that the key pieces of infrastructure (the flags, the
interactions with page_referenced_one, clear_refs, mark_page_accessed)
your patchset brings along then can be reused in many ways. Michel
Lespinasse's kernel thread can reuse them, or proc/smaps can be
augmented (or a new proc entry), to get per process idle maps.

So /proc/kpageidle is fine with me, but not crazy appealing.
You're right. Thanks!
Andres
>
>>
>> > if (referenced) {
>>
>> At this point, if you follow my suggestion of augmenting
>> page_referenced_one with a mode indicator (for TLB flushing), you can
>> set page young here. There is the added benefit of holding the
>> mmap_mutex lock or vma_lock, which prevents reclaim, try_to_unmap,
>> migration, from exploiting a small window where page young is not set
>> but should.
>
> Yeah, if we go with the page_referenced mode switcher you suggested
> above, it's definitely worth moving set_page_young here.
>
> Thank you for the review!
>
> Vladimir
>
>>
>> > pra->referenced++;
>> > pra->vm_flags |= vma->vm_flags;
>> > diff --git a/mm/swap.c b/mm/swap.c
>> > index ab7c338eda87..db43c9b4891d 100644
>> > --- a/mm/swap.c
>> > +++ b/mm/swap.c
>> > @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
>> > } else if (!PageReferenced(page)) {
>> > SetPageReferenced(page);
>> > }
>> > + if (page_is_idle(page))
>> > + clear_page_idle(page);
>> > }
>> > EXPORT_SYMBOL(mark_page_accessed);
>> >



--
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Vladimir Davydov

unread,
Jul 11, 2015, 10:48:44 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

It is based on top of v4.2-rc1-mmotm-2015-07-06-16-25

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

---- USER API ----

The user API consists of two new proc files:

* /proc/kpageidle. This file implements a bitmap where each bit corresponds
to a page, indexed by PFN. When the bit is set, the corresponding page is
idle. A page is considered idle if it has not been accessed since it was
marked idle. To mark a page idle one should set the bit corresponding to the
page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. Only user memory pages can be marked idle, for other
page types input is silently ignored. Writing to this file beyond max PFN
results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by setting corresponding bits in the
/proc/kpageidle bitmap
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the number of bits set

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v7:

This iteration addresses Andres's comments to v6:

- do not reuse page_referenced for clearing idle flag, introduce a
separate function instead; this way we won't issue expensive tlb
flushes on /proc/kpageidle read/write
- propagate young/idle flags from head to tail pages on thp split
- skip compound tail pages while reading/writing /proc/kpageidle
- cleanup page_referenced_one
v6: https://lkml.org/lkml/2015/6/12/301
v5: https://lkml.org/lkml/2015/5/12/449
def get_hugepage_size():
with open("/proc/meminfo", "r") as f:
for s in f:
k, v = s.split(":")
if k == "Hugepagesize":
return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()


def set_idle():
f = open("/proc/kpageidle", "wb", BUFSIZE)
while True:
try:
f.write(struct.pack("Q", pow(2, 64) - 1))
except IOError as err:
if err.errno == errno.ENXIO:
break
raise
f.close()


def count_idle():
f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

with open("/proc/kpageidle", "rb", BUFSIZE) as f:
while f.read(BUFSIZE): pass # update idle flag

idlememsz = {}
while True:
s1, s2 = f_flags.read(8), f_cgroup.read(8)
if not s1 or not s2:
break

flags, = struct.unpack('Q', s1)
cgino, = struct.unpack('Q', s2)

unevictable = (flags >> 18) & 1
huge = (flags >> 22) & 1
idle = (flags >> 25) & 1

if idle and not unevictable:
idlememsz[cgino] = idlememsz.get(cgino, 0) + \
(HUGEPAGE_SIZE if huge else PAGE_SIZE)

f_flags.close()
f_cgroup.close()
return idlememsz


if __name__ == "__main__":
print "Setting the idle flag for each page..."
set_idle()

raw_input("Wait until the workload accesses its working set, "
"then press Enter")

print "Counting idle pages..."
idlememsz = count_idle()

for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (6):
memcg: add page_cgroup_ino helper
hwpoison: use page_cgroup_ino for filtering by memcg
memcg: zap try_get_mem_cgroup_from_page
proc: add kpagecgroup file
proc: add kpageidle file
proc: export idle flag via kpageflags

Documentation/vm/pagemap.txt | 22 ++-
fs/proc/page.c | 278 +++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 7 +-
include/linux/mm.h | 98 ++++++++++++
include/linux/page-flags.h | 11 ++
include/linux/page_ext.h | 4 +
include/uapi/linux/kernel-page-flags.h | 1 +
mm/Kconfig | 12 ++
mm/debug.c | 4 +
mm/huge_memory.c | 5 +
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 71 +++++----
mm/memory-failure.c | 16 +-
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
17 files changed, 486 insertions(+), 62 deletions(-)

--
2.1.4

Vladimir Davydov

unread,
Jul 11, 2015, 10:48:52 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---

Vladimir Davydov

unread,
Jul 11, 2015, 10:48:59 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);

Vladimir Davydov

unread,
Jul 11, 2015, 10:49:11 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 222 +++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 98 +++++++++++++++++++
include/linux/page-flags.h | 11 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/huge_memory.c | 5 +
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
12 files changed, 380 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..e51690c5f173 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -5,6 +5,7 @@
#include <linux/ksm.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
+#include <linux/rmap.h>
#include <linux/huge_mm.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
@@ -16,6 +17,7 @@

#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)

/* /proc/kpagecount - an array exposing page counts
*
@@ -275,6 +277,222 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to rmap_walk(), which is essential for idle
+ * page tracking. With such an indicator of user pages we can skip isolated
+ * pages, but since there are not usually many of them, it will hardly affect
+ * the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+ struct zone *zone;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+
+ page = pfn_to_page(pfn);
+ if (!page || PageTail(page) || !PageLRU(page) ||
+ !get_page_unless_zero(page))
+ return NULL;
+
+ if (unlikely(PageTail(page))) {
+ put_page(page);
+ return NULL;
+ }
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ page = NULL;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ return page;
+}
+
+static int kpageidle_clear_pte_refs_one(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long addr, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
+ pmd_t *pmd;
+ pte_t *pte;
+ bool referenced = false;
+
+ if (unlikely(PageTransHuge(page))) {
+ pmd = page_check_address_pmd(page, mm, addr,
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (pmd) {
+ referenced = pmdp_test_and_clear_young(vma, addr, pmd);
+ spin_unlock(ptl);
+ }
+ } else {
+ pte = page_check_address(page, mm, addr, &ptl, 0);
+ if (pte) {
+ referenced = ptep_test_and_clear_young(vma, addr, pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+ }
+ if (referenced) {
+ clear_page_idle(page);
+ /*
+ * We cleared the referenced bit in a mapping to this page. To
+ * avoid interference with page reclaim, mark it young so that
+ * page_referenced() will return > 0.
+ */
+ set_page_young(page);
+ }
+ return SWAP_AGAIN;
+}
+
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = kpageidle_clear_pte_refs_one,
+ .anon_lock = page_lock_anon_vma_read,
+ };
+ bool need_lock;
+
+ if (!page_mapped(page) ||
+ !page_rmapping(page))
+ return;
+
+ need_lock = !PageAnon(page) || PageKsm(page);
+ if (need_lock && !trylock_page(page))
+ return;
+
+ rmap_walk(page, &rwc);
+
+ if (need_lock)
+ unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +500,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3b4d8255e806..3efd7f641f92 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -810,6 +810,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,

/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ test_and_clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -837,6 +838,7 @@ out:

/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ test_and_clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..de450c1191b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,5 +2205,103 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+ return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ SetPageYoung(page);
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+ return TestClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+static inline bool test_and_clear_page_young(struct page *page)
+{
+ return test_and_clear_bit(PAGE_EXT_YOUNG,
+ &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+ return false;
+}
+
+static inline void set_page_young(struct page *page)
+{
+}
+
+static inline bool test_and_clear_page_young(struct page *page)
+{
+ return false;
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return false;
+}
+
+static inline void set_page_idle(struct page *page)
+{
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..478f2241f284 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ PG_young,
+ PG_idle,
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -363,6 +367,13 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+TESTPAGEFLAG(Young, young, PF_ANY)
+SETPAGEFLAG(Young, young, PF_ANY)
+TESTCLEARFLAG(Young, young, PF_ANY)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51e954d..db404966faf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
/* clear PageTail before overwriting first_page */
smp_wmb();

+ if (page_is_young(page))
+ set_page_young(page_tail);
+ if (page_is_idle(page))
+ set_page_idle(page_tail);
+
/*
* __split_huge_page_splitting() already set the
* splitting bit in all pmd that could map this
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ &page_idle_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 49b244b1f18c..c96677ade3d1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -798,6 +798,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

+ if (referenced)
+ clear_page_idle(page);
+ if (test_and_clear_page_young(page))
+ referenced++;
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..db43c9b4891d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);

Vladimir Davydov

unread,
Jul 11, 2015, 10:49:19 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page

+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index e51690c5f173..4fa4eadfd30e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -149,6 +149,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;

+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25


#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

Vladimir Davydov

unread,
Jul 11, 2015, 10:50:00 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---

Vladimir Davydov

unread,
Jul 11, 2015, 10:50:25 AM7/11/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73b02b0a8f60..50069abebc3c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -116,6 +116,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,

extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+extern unsigned long page_cgroup_ino(struct page *page);

struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..894dc2169979 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -631,6 +631,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
return &memcg->css;
}

+/**
+ * page_cgroup_ino - return inode number of the memcg a page is charged to
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number or 0 if @page is not charged to any cgroup. It
+ * is safe to call this function without holding a reference to @page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !(memcg->css.flags & CSS_ONLINE))
+ memcg = parent_mem_cgroup(memcg);
+ if (memcg)
+ ino = cgroup_ino(memcg->css.cgroup);
+ rcu_read_unlock();
+ return ino;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{

Vladimir Davydov

unread,
Jul 11, 2015, 10:53:58 AM7/11/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Thu, Jul 09, 2015 at 04:19:00PM +0300, Vladimir Davydov wrote:
> On Wed, Jul 08, 2015 at 04:01:13PM -0700, Andres Lagar-Cavilla wrote:
> > On Fri, Jun 12, 2015 at 2:52 AM, Vladimir Davydov
Oh, the comment is already present - it's in the description to this
function. Minchan asked me to add it long time ago, and so I did.
Completely forgot about it.

Thanks,
Vladimir

Andres Lagar-Cavilla

unread,
Jul 13, 2015, 3:03:11 PM7/13/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Thanks for the updates, addressing THP and TLB flushing, very
elegantly. Some quick early reaction. I may come back for more :)
get_page_unless_zero does not succeed for Tail pages.
VM_BUG_ON(!PageHead)?

> + pmd = page_check_address_pmd(page, mm, addr,
> + PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> + if (pmd) {
> + referenced = pmdp_test_and_clear_young(vma, addr, pmd);

For any workload using MMU notifiers, this will lose significant
information by not querying the secondary PTE. The most
straightforward case is KVM. Once mappings are setup, all access
activity is recorded through shadow PTEs. This interface will say
"idle" even though the VM is blasting memory.

Andres
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Vladimir Davydov

unread,
Jul 14, 2015, 7:05:53 AM7/14/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Mon, Jul 13, 2015 at 12:02:57PM -0700, Andres Lagar-Cavilla wrote:
> On Sat, Jul 11, 2015 at 7:48 AM, Vladimir Davydov
> <vdav...@parallels.com> wrote:
[...]
> > +static struct page *kpageidle_get_page(unsigned long pfn)
> > +{
> > + struct page *page;
> > + struct zone *zone;
> > +
> > + if (!pfn_valid(pfn))
> > + return NULL;
> > +
> > + page = pfn_to_page(pfn);
> > + if (!page || PageTail(page) || !PageLRU(page) ||
> > + !get_page_unless_zero(page))
>
> get_page_unless_zero does not succeed for Tail pages.

True. So we don't seem to need the PageTail checks here at all, because
if kpageidle_get_page succeeds, the page must be a head, so that we
won't dive into expensive rmap_walk for tail pages. Will remove it then.
Don't think it's necessary, because PageTransHuge already does this sort
of check:

: static inline int PageTransHuge(struct page *page)
: {
: VM_BUG_ON_PAGE(PageTail(page), page);
: return PageHead(page);
: }

>
> > + pmd = page_check_address_pmd(page, mm, addr,
> > + PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> > + if (pmd) {
> > + referenced = pmdp_test_and_clear_young(vma, addr, pmd);
>
> For any workload using MMU notifiers, this will lose significant
> information by not querying the secondary PTE. The most
> straightforward case is KVM. Once mappings are setup, all access
> activity is recorded through shadow PTEs. This interface will say
> "idle" even though the VM is blasting memory.

Hmm, interesting. It seems we have to introduce
mmu_notifier_ops.clear_young then, which, in contrast to
clear_flush_young, won't flush TLB. Looking back at your comment to v6,
now I see that you already mentioned it, but I missed your point :-(
OK, will do it in the next iteration.

Thanks a lot for the review!

Vladimir

Andres Lagar-Cavilla

unread,
Jul 14, 2015, 4:27:33 PM7/14/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Tue, Jul 14, 2015 at 4:05 AM, Vladimir Davydov
There's clearly value in fixing things for KVM, but I don't have
knowledge of the other MMU notifiers. I like clear_young, maybe other
mmu notifiers will turn this into a no-op().

mmmmhh. What about TLB flushing in the mmu notifier? I guess that can
be internal to each implementation.

Andres
>
> Thanks a lot for the review!
>
> Vladimir



--
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Vladimir Davydov

unread,
Jul 15, 2015, 9:54:35 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
* /proc/kpageidle. This file implements a bitmap where each bit corresponds
to a page, indexed by PFN. When the bit is set, the corresponding page is
idle. A page is considered idle if it has not been accessed since it was
marked idle. To mark a page idle one should set the bit corresponding to the
page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. Only user memory pages can be marked idle, for other
page types input is silently ignored. Writing to this file beyond max PFN
results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by setting corresponding bits in the
/proc/kpageidle bitmap
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the number of bits set

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 5.

---- CHANGE LOG ----

Changes in v8:

- clear referenced/accessed bit in secondary ptes while accessing
/proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
- check the young flag when collapsing a huge page
- copy idle/young flags on page migration
v7: https://lkml.org/lkml/2015/7/11/119
v6: https://lkml.org/lkml/2015/6/12/301
v5: https://lkml.org/lkml/2015/5/12/449
v4: https://lkml.org/lkml/2015/5/7/580
v3: https://lkml.org/lkml/2015/4/28/224
v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup and patches 2-3 do related cleanup
- patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 5 introduces a new mmu notifier callback, clear_young, which is
a lightweight version of clear_flush_young; it is used in patch 6
- patch 6 implements the idle page tracking feature, including the
userspace API, /proc/kpageidle
- patch 7 exports idle flag via /proc/kpageflags
Vladimir Davydov (7):
memcg: add page_cgroup_ino helper
hwpoison: use page_cgroup_ino for filtering by memcg
memcg: zap try_get_mem_cgroup_from_page
proc: add kpagecgroup file
mmu-notifier: add clear_young callback
proc: add kpageidle file
proc: export idle flag via kpageflags

Documentation/vm/pagemap.txt | 22 ++-
fs/proc/page.c | 274 +++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 7 +-
include/linux/mm.h | 98 ++++++++++++
include/linux/mmu_notifier.h | 44 ++++++
include/linux/page-flags.h | 11 ++
include/linux/page_ext.h | 4 +
include/uapi/linux/kernel-page-flags.h | 1 +
mm/Kconfig | 12 ++
mm/debug.c | 4 +
mm/huge_memory.c | 11 +-
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 71 +++++----
mm/memory-failure.c | 16 +-
mm/migrate.c | 5 +
mm/mmu_notifier.c | 17 ++
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
virt/kvm/kvm_main.c | 18 +++
21 files changed, 570 insertions(+), 64 deletions(-)

--
2.1.4

Vladimir Davydov

unread,
Jul 15, 2015, 9:54:40 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !(memcg->css.flags & CSS_ONLINE))
+ memcg = parent_mem_cgroup(memcg);
+ if (memcg)
+ ino = cgroup_ino(memcg->css.cgroup);
+ rcu_read_unlock();
+ return ino;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{

Vladimir Davydov

unread,
Jul 15, 2015, 9:54:50 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---

Vladimir Davydov

unread,
Jul 15, 2015, 9:54:59 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 6 ------
mm/memcontrol.c | 48 ++++++++++++----------------------------------
2 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50069abebc3c..635edfe06bac 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -259,11 +258,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 894dc2169979..fa1447fcba33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
}

if (PageTransHuge(page)) {
@@ -5637,8 +5615,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}

- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);

Vladimir Davydov

unread,
Jul 15, 2015, 9:55:14 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);

Vladimir Davydov

unread,
Jul 15, 2015, 9:55:26 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not
only in primary, but also in secondary ptes. The latter is required in
order to estimate wss of KVM VMs. At the same time we want to avoid
flushing tlb, because it is quite expensive and it won't really affect
the final result.

Currently, there is no function for clearing pte young bit that would
meet our requirements, so this patch introduces one. To achieve that we
have to add a new mmu-notifier callback, clear_young, since there is no
method for testing-and-clearing a secondary pte w/o flushing tlb. The
new method is not mandatory and currently only implemented by KVM.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
mm/mmu_notifier.c | 17 +++++++++++++++++
virt/kvm/kvm_main.c | 18 ++++++++++++++++++
3 files changed, 79 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f4d788..a5b17137c683 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -66,6 +66,16 @@ struct mmu_notifier_ops {
unsigned long end);

/*
+ * clear_young is a lightweight version of clear_flush_young. Like the
+ * latter, it is supposed to test-and-clear the young/accessed bitflag
+ * in the secondary pte, but it may omit flushing the secondary tlb.
+ */
+ int (*clear_young)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+
+ /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
@@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
+extern int __mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_clear_young(mm, start, end);
+ return 0;
+}
+
static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
@@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
__young; \
})

+#define ptep_clear_young_notify(__vma, __address, __ptep) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
+ __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
+ ___address + PAGE_SIZE); \
+ __young; \
+})
+
+#define pmdp_clear_young_notify(__vma, __address, __pmdp) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
+ __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
+ ___address + PMD_SIZE); \
+ __young; \
+})
+
#define ptep_clear_flush_notify(__vma, __address, __ptep) \
({ \
unsigned long ___addr = __address & PAGE_MASK; \
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0741b2..5fbdd367bbed 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
}

+int __mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier *mn;
+ int young = 0, id;
+
+ id = srcu_read_lock(&srcu);
+ hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->clear_young)
+ young |= mn->ops->clear_young(mn, mm, start, end);
+ }
+ srcu_read_unlock(&srcu, id);
+
+ return young;
+}
+
int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05148a43ef9c..61500cb028a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
return young;
}

+static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int young, idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ spin_lock(&kvm->mmu_lock);
+ young = kvm_age_hva(kvm, start, end);
+ spin_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return young;
+}
+
static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address)
@@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
+ .clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
.change_pte = kvm_mmu_notifier_change_pte,
.release = kvm_mmu_notifier_release,

Vladimir Davydov

unread,
Jul 15, 2015, 9:55:34 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 218 +++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 98 +++++++++++++++++++
include/linux/page-flags.h | 11 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/huge_memory.c | 11 ++-
mm/migrate.c | 5 +
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
13 files changed, 385 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..c9266340852c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..273537885ab4 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -5,6 +5,8 @@
#include <linux/ksm.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
+#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <linux/huge_mm.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
@@ -16,6 +18,7 @@

#define KPMSIZE sizeof(u64)
#define KPMMASK (KPMSIZE - 1)
+#define KPMBITS (KPMSIZE * BITS_PER_BYTE)

/* /proc/kpagecount - an array exposing page counts
*
@@ -275,6 +278,217 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+/*
+ * Idle page tracking only considers user memory pages, for other types of
+ * pages the idle flag is always unset and an attempt to set it is silently
+ * ignored.
+ *
+ * We treat a page as a user memory page if it is on an LRU list, because it is
+ * always safe to pass such a page to rmap_walk(), which is essential for idle
+ * page tracking. With such an indicator of user pages we can skip isolated
+ * pages, but since there are not usually many of them, it will hardly affect
+ * the overall result.
+ *
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+ struct zone *zone;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+
+ page = pfn_to_page(pfn);
+ if (!page || !PageLRU(page) ||
+ !get_page_unless_zero(page))
+ return NULL;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ page = NULL;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ return page;
+}
+
+static int kpageidle_clear_pte_refs_one(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long addr, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
+ pmd_t *pmd;
+ pte_t *pte;
+ bool referenced = false;
+
+ if (unlikely(PageTransHuge(page))) {
+ pmd = page_check_address_pmd(page, mm, addr,
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (pmd) {
+ referenced = pmdp_clear_young_notify(vma, addr, pmd);
+ spin_unlock(ptl);
+ }
+ } else {
+ pte = page_check_address(page, mm, addr, &ptl, 0);
+ if (pte) {
+ referenced = ptep_clear_young_notify(vma, addr, pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+ }
+ if (referenced) {
+ clear_page_idle(page);
+ /*
+ * We cleared the referenced bit in a mapping to this page. To
+ * avoid interference with page reclaim, mark it young so that
+ * page_referenced() will return > 0.
+ */
+ set_page_young(page);
+ }
+ return SWAP_AGAIN;
+}
+
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = kpageidle_clear_pte_refs_one,
+ .anon_lock = page_lock_anon_vma_read,
+ };
+ bool need_lock;
+
+ if (!page_mapped(page) ||
+ !page_rmapping(page))
+ return;
+
+ need_lock = !PageAnon(page) || PageKsm(page);
+ if (need_lock && !trylock_page(page))
+ return;
+
+ rmap_walk(page, &rwc);
+
+ if (need_lock)
+ unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return 0;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_pte_refs(page);
+ if (page_is_idle(page))
+ idle_bitmap |= 1ULL << bit;
+ }
+ put_page(page);
+ }
+ if (bit == KPMBITS - 1) {
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +496,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
index 9671f51e954d..bb6d2ec1f268 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
/* clear PageTail before overwriting first_page */
smp_wmb();

+ if (page_is_young(page))
+ set_page_young(page_tail);
+ if (page_is_idle(page))
+ set_page_idle(page_tail);
+
/*
* __split_huge_page_splitting() already set the
* splitting bit in all pmd that could map this
@@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
VM_BUG_ON_PAGE(PageLRU(page), page);

/* If there is no mapped pte young don't collapse the page */
- if (pte_young(pteval) || PageReferenced(page) ||
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
mmu_notifier_test_young(vma->vm_mm, address))
referenced = true;
}
@@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
*/
if (page_count(page) != 1 + !!PageSwapCache(page))
goto out_unmap;
- if (pte_young(pteval) || PageReferenced(page) ||
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
mmu_notifier_test_young(vma->vm_mm, address))
referenced = true;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..3e7bb4f2b51c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -524,6 +524,11 @@ void migrate_page_copy(struct page *newpage, struct page *page)
__set_page_dirty_nobuffers(newpage);
}

+ if (page_is_young(page))
+ set_page_young(newpage);
+ if (page_is_idle(page))
+ set_page_idle(newpage);
+
/*
* Copy NUMA information to the new page, to prevent over-eager
* future migrations of this same page.

Vladimir Davydov

unread,
Jul 15, 2015, 9:55:44 AM7/15/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index c9266340852c..5896b7d7fd74 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -64,6 +64,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -124,6 +125,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page

+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 273537885ab4..13dcb823fe4e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;

+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25


#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 2:59:30 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> This function returns the inode number of the closest online ancestor of
> the memory cgroup a page is charged to. It is required for exporting
> information about which page is charged to which cgroup to userspace,
> which will be introduced by a following patch.
>
> Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 3:00:25 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> then its ino using cgroup_ino, but now we have an apter method for that,
> page_cgroup_ino, so use it instead.
>
> Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 3:03:31 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
For both /proc/kpage* interfaces you add (and more critically for the
rmap-causing one, kpageidle):

It's a good idea to do cond_sched(). Whether after each pfn, each Nth
pfn, each put_user, I leave to you, but a reasonable cadence is
needed, because user-space can call this on the entire physical
address space, and that's a lot of work to do without re-scheduling.

Andres
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 3:16:44 PM7/15/15
to Vladimir Davydov, Paolo Bonzini, k...@vger.kernel.org, Eric Northup, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> In the scope of the idle memory tracking feature, which is introduced by
> the following patch, we need to clear the referenced/accessed bit not
> only in primary, but also in secondary ptes. The latter is required in
> order to estimate wss of KVM VMs. At the same time we want to avoid
> flushing tlb, because it is quite expensive and it won't really affect
> the final result.
>
> Currently, there is no function for clearing pte young bit that would
> meet our requirements, so this patch introduces one. To achieve that we
> have to add a new mmu-notifier callback, clear_young, since there is no
> method for testing-and-clearing a secondary pte w/o flushing tlb. The
> new method is not mandatory and currently only implemented by KVM.
>
> Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>

Added Paolo Bonzini, kvm list, Eric Northup.
For reclaim, the clear_flush_young notifier may blow up the secondary
pte to estimate the access pattern, depending on hardware support (EPT
access bits available in Haswell onwards, not sure about AMD, PPC,
etc).

This is ok, because it's reclaim, we need to know the access pattern,
chances are the page is a goner anyway.

However, not so sure about that cost in this context. Depending on
user-space, this will periodically tear down all EPT tables in the
system. That's tricky.

So please add a note to that effect, so in the fullness of time kvm
may be able to refuse enacting this notifier based on performance/VM
priority/foo concerns.

> +
> + idx = srcu_read_lock(&kvm->srcu);
> + spin_lock(&kvm->mmu_lock);
> + young = kvm_age_hva(kvm, start, end);

Also please add a comment along the lines of no one really knowing
when and if to flush the secondary tlb.

We might come up with a heuristic later, or leave up to the regular
system cadence. We just don't know at the moment.

Andres

> + spin_unlock(&kvm->mmu_lock);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return young;
> +}
> +
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long address)
> @@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
> .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
> .clear_flush_young = kvm_mmu_notifier_clear_flush_young,
> + .clear_young = kvm_mmu_notifier_clear_young,
> .test_young = kvm_mmu_notifier_test_young,
> .change_pte = kvm_mmu_notifier_change_pte,
> .release = kvm_mmu_notifier_release,
> --
> 2.1.4
>



--
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 3:17:13 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> As noted by Minchan, a benefit of reading idle flag from
> /proc/kpageflags is that one can easily filter dirty and/or unevictable
> pages while estimating the size of unused memory.
>
> Note that idle flag read from /proc/kpageflags may be stale in case the
> page was accessed via a PTE, because it would be too costly to iterate
> over all page mappings on each /proc/kpageflags read to provide an
> up-to-date value. To make sure the flag is up-to-date one has to read
> /proc/kpageidle first.
>
> Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 3:42:40 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
Question: what about mlocked pages? Is there any point in calculating
their idleness?

> + !page_rmapping(page))

Not sure, does this skip SwapCache pages? Is there any point in
calculating their idleness?
Reminder to add cond_sched() or similar at some regular cadence.
Same...
Why not in the block above?

page_tail->flags |= (page->flags &
..
#ifdef CONFIG_WHATEVER_IT_WAS
1 << PG_idle
1 << PG_young
#endif


> /*
> * __split_huge_page_splitting() already set the
> * splitting bit in all pmd that could map this
> @@ -2259,7 +2264,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> /* If there is no mapped pte young don't collapse the page */
> - if (pte_young(pteval) || PageReferenced(page) ||
> + if (pte_young(pteval) ||
> + page_is_young(page) || PageReferenced(page) ||
> mmu_notifier_test_young(vma->vm_mm, address))
> referenced = true;
> }
> @@ -2686,7 +2692,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> */
> if (page_count(page) != 1 + !!PageSwapCache(page))
> goto out_unmap;
> - if (pte_young(pteval) || PageReferenced(page) ||
> + if (pte_young(pteval) ||
> + page_is_young(page) || PageReferenced(page) ||
> mmu_notifier_test_young(vma->vm_mm, address))
> referenced = true;
> }

Cool finds, thanks for the thoroughness

Andres
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Andres Lagar-Cavilla

unread,
Jul 15, 2015, 4:47:26 PM7/15/15
to Vladimir Davydov, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
Both good catches, thanks!

I think the remaining question here is performance.

Have you conducted any studies where
- there is a workload
- a daemon is poking kpageidle every N seconds/minutes
- what is the daemon cpu consumption?
- what is the workload degradation if any?

N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....

Workload candidates include TPC, spec int memory intensive things like
429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
memory bandwidth" vs floating point performance)

I'm not asking for a research paper, but if, say, a 2 minute-period
daemon introduces no degradation and adds up to a minute of cpu per
hour, then we're golden.

Andres
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Vladimir Davydov

unread,
Jul 16, 2015, 5:29:06 AM7/16/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> For both /proc/kpage* interfaces you add (and more critically for the
> rmap-causing one, kpageidle):
>
> It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> pfn, each put_user, I leave to you, but a reasonable cadence is
> needed, because user-space can call this on the entire physical
> address space, and that's a lot of work to do without re-scheduling.

I really don't think it's necessary. These files can only be
read/written by the root, who has plenty ways to kill the system anyway.
The program that is allowed to read/write these files must be conscious
and do it in batches of reasonable size. AFAICS the same reasoning
already lays behind /proc/kpagecount and /proc/kpageflag, which also do
not thrust the "right" batch size on their readers.

Thanks,
Vladimir

Vladimir Davydov

unread,
Jul 16, 2015, 5:55:01 AM7/16/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 12:42:28PM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
> <vdav...@parallels.com> wrote:
[...]
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > + struct rmap_walk_control rwc = {
> > + .rmap_one = kpageidle_clear_pte_refs_one,
> > + .anon_lock = page_lock_anon_vma_read,
> > + };
> > + bool need_lock;
> > +
> > + if (!page_mapped(page) ||
>
> Question: what about mlocked pages? Is there any point in calculating
> their idleness?

Those can be filtered out with the aid of /proc/kpageflags (this is what
the script attached to patch #0 of the series actually does). We have to
read the latter anyway in order to get information about THP. That said,
I prefer not to introduce any artificial checks for locked memory. Who
knows, may be one day somebody will use this API to track access pattern
to an mlocked area.

>
> > + !page_rmapping(page))
>
> Not sure, does this skip SwapCache pages? Is there any point in
> calculating their idleness?

A SwapCache page may be mapped, and if it is we should not skip it. If
it is unmapped, we have nothing to do.

Regarding idleness of SwapCache pages, I think we shouldn't
differentiate them from other user pages here, because a shmem/anon page
can migrate to-and-fro the swap cache occasionally during a
memory-active workload, and we don't want to lose its idle status then.

>
> > + return;
> > +
> > + need_lock = !PageAnon(page) || PageKsm(page);
> > + if (need_lock && !trylock_page(page))
> > + return;
> > +
> > + rmap_walk(page, &rwc);
> > +
> > + if (need_lock)
> > + unlock_page(page);
> > +}
[...]
> > @@ -1754,6 +1754,11 @@ static void __split_huge_page_refcount(struct page *page,
> > /* clear PageTail before overwriting first_page */
> > smp_wmb();
> >
> > + if (page_is_young(page))
> > + set_page_young(page_tail);
> > + if (page_is_idle(page))
> > + set_page_idle(page_tail);
> > +
>
> Why not in the block above?
>
> page_tail->flags |= (page->flags &
> ...
> #ifdef CONFIG_WHATEVER_IT_WAS
> 1 << PG_idle
> 1 << PG_young
> #endif

Too many ifdef's :-/ Note, the flags can be in page_ext, which mean we
would have to add something like this

#if defined(CONFIG_WHATEVER_IT_WAS) && defined(CONFIG_64BIT)
1 << PG_idle
1 << PG_young
#endif
<...>
#ifndef CONFIG_64BIT
if (page_is_young(page))
set_page_young(page_tail);
if (page_is_idle(page))
set_page_idle(page_tail);
#endif

which IMO looks less readable than what we have now.

Thanks,
Vladimir

Vladimir Davydov

unread,
Jul 16, 2015, 6:02:30 AM7/16/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 15, 2015 at 01:47:15PM -0700, Andres Lagar-Cavilla wrote:
> I think the remaining question here is performance.
>
> Have you conducted any studies where
> - there is a workload
> - a daemon is poking kpageidle every N seconds/minutes
> - what is the daemon cpu consumption?
> - what is the workload degradation if any?
>
> N candidates include 30 seconds, 1 minute, 2 minutes, 5 minutes....
>
> Workload candidates include TPC, spec int memory intensive things like
> 429.mcf, stream (http://www.cs.virginia.edu/stream/ "sustainable
> memory bandwidth" vs floating point performance)
>
> I'm not asking for a research paper, but if, say, a 2 minute-period
> daemon introduces no degradation and adds up to a minute of cpu per
> hour, then we're golden.

Fair enough. Will do that soon and report back.

Thanks a lot for the review, it was really helpful!

Vladimir

Paolo Bonzini

unread,
Jul 16, 2015, 7:35:26 AM7/16/15
to Andres Lagar-Cavilla, Vladimir Davydov, k...@vger.kernel.org, Eric Northup, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org


On 15/07/2015 21:16, Andres Lagar-Cavilla wrote:
>> > +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>> > + struct mm_struct *mm,
>> > + unsigned long start,
>> > + unsigned long end)
>> > +{
>> > + struct kvm *kvm = mmu_notifier_to_kvm(mn);
>> > + int young, idx;
> For reclaim, the clear_flush_young notifier may blow up the secondary
> pte to estimate the access pattern, depending on hardware support (EPT
> access bits available in Haswell onwards, not sure about AMD, PPC,
> etc).

It seems like this problem is limited to pre-Haswell EPT.

I'm okay with the patch. If we find problems later we can always add a
parameter to kvm_age_hva so that it effectively doesn't do anything on
clear_young.

Acked-by: Paolo Bonzini <pbon...@redhat.com>

Paolo

Vladimir Davydov

unread,
Jul 17, 2015, 5:28:14 AM7/17/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Thu, Jul 16, 2015 at 12:04:59PM -0700, Andres Lagar-Cavilla wrote:
> On Thu, Jul 16, 2015 at 2:28 AM, Vladimir Davydov <vdav...@parallels.com>
> wrote:
>
> > On Wed, Jul 15, 2015 at 12:03:18PM -0700, Andres Lagar-Cavilla wrote:
> > > For both /proc/kpage* interfaces you add (and more critically for the
> > > rmap-causing one, kpageidle):
> > >
> > > It's a good idea to do cond_sched(). Whether after each pfn, each Nth
> > > pfn, each put_user, I leave to you, but a reasonable cadence is
> > > needed, because user-space can call this on the entire physical
> > > address space, and that's a lot of work to do without re-scheduling.
> >
> > I really don't think it's necessary. These files can only be
> > read/written by the root, who has plenty ways to kill the system anyway.
> > The program that is allowed to read/write these files must be conscious
> > and do it in batches of reasonable size. AFAICS the same reasoning
> > already lays behind /proc/kpagecount and /proc/kpageflag, which also do
> > not thrust the "right" batch size on their readers.
> >
>
> Beg to disagree. You're conflating intended use with system health. A
> cond_sched() is a one-liner.

I would still prefer not to clutter the code with cond_resched's, but I
don't think it's a matter worth arguing upon, so I'll prepare a patch
that makes all /proc/kapge* files issue cond_resched periodically and
leave it up to Andrew to decide if it should be applied or not.

Vladimir Davydov

unread,
Jul 19, 2015, 8:31:47 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46
It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well
is achieved, please see the comment to patch 6.

---- CHANGE LOG ----

Changes in v9:

- add cond_resched to /proc/kpage* read/write loop (Andres)
- rebase on top of v4.2-rc2-mmotm-2015-07-15-16-46

Changes in v8:

- clear referenced/accessed bit in secondary ptes while accessing
/proc/kpageidle; this is required to estimate wss of KVM VMs (Andres)
- check the young flag when collapsing a huge page
- copy idle/young flags on page migration

v8: https://lkml.org/lkml/2015/7/15/587
---- PERFORMANCE EVALUATION ----

SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
performance impact introduced by this patch set. Three runs were carried
out:

- base: kernel without the patch
- patched: patched kernel, the feature is not used
- patched-active: patched kernel, 1 minute-period daemon is used for
tracking idle memory

For tracking idle memory, idlememstat utility was used:
https://github.com/locker/idlememstat

testcase base patched patched-active

compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)%
compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)%
crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)%
derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)%
mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)%
scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)%
scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)%
serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)%
startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)%
sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)%
xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)%

composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)%

time idlememstat:

17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
448inputs+40outputs (1major+36052minor)pagefaults 0swaps
Vladimir Davydov (8):
memcg: add page_cgroup_ino helper
hwpoison: use page_cgroup_ino for filtering by memcg
memcg: zap try_get_mem_cgroup_from_page
proc: add kpagecgroup file
mmu-notifier: add clear_young callback
proc: add kpageidle file
proc: export idle flag via kpageflags
proc: add cond_resched to /proc/kpage* read/write loop

Documentation/vm/pagemap.txt | 22 ++-
fs/proc/page.c | 282 +++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 10 +-
include/linux/mm.h | 98 ++++++++++++
include/linux/mmu_notifier.h | 44 +++++
include/linux/page-flags.h | 11 ++
include/linux/page_ext.h | 4 +
include/uapi/linux/kernel-page-flags.h | 1 +
mm/Kconfig | 12 ++
mm/debug.c | 4 +
mm/huge_memory.c | 11 +-
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 71 ++++-----
mm/memory-failure.c | 16 +-
mm/migrate.c | 5 +
mm/mmu_notifier.c | 17 ++
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
virt/kvm/kvm_main.c | 18 +++
21 files changed, 579 insertions(+), 66 deletions(-)

--
2.1.4

Vladimir Davydov

unread,
Jul 19, 2015, 8:31:53 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d92b80b63c5c..99b0e43cac45 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -345,6 +345,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
}

struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+unsigned long page_cgroup_ino(struct page *page);

static inline bool mem_cgroup_disabled(void)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1def8810880a..a91bc1ee964c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -441,6 +441,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
return &memcg->css;
}

+/**
+ * page_cgroup_ino - return inode number of the memcg a page is charged to
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number or 0 if @page is not charged to any cgroup. It
+ * is safe to call this function without holding a reference to @page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !(memcg->css.flags & CSS_ONLINE))
+ memcg = parent_mem_cgroup(memcg);
+ if (memcg)
+ ino = cgroup_ino(memcg->css.cgroup);
+ rcu_read_unlock();
+ return ino;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:00 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have an apter method for that,
page_cgroup_ino, so use it instead.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
---
index ef33ccf37224..97005396a507 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,27 +130,15 @@ static int hwpoison_filter_flags(struct page *p)
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)
{
- struct mem_cgroup *mem;
- struct cgroup_subsys_state *css;
- unsigned long ino;
-
if (!hwpoison_filter_memcg)
return 0;

- mem = try_get_mem_cgroup_from_page(p);
- if (!mem)
- return -EINVAL;
-
- css = &mem->css;
- ino = cgroup_ino(css->cgroup);
- css_put(css);
-
- if (ino != hwpoison_filter_memcg)
+ if (page_cgroup_ino(p) != hwpoison_filter_memcg)
return -EINVAL;

return 0;

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:11 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
include/linux/memcontrol.h | 9 +--------
mm/memcontrol.c | 48 ++++++++++++----------------------------------
2 files changed, 13 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 99b0e43cac45..d644aadfdd0d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -305,11 +305,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
+
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -556,11 +554,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a91bc1ee964c..b9c76a0906f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2094,40 +2094,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -5327,8 +5293,20 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
* the page lock, which serializes swap cache removal, which
* in turn serializes uncharging.
*/
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
if (page->mem_cgroup)
goto out;
+
+ if (do_swap_account) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id = lookup_swap_cgroup_id(ent);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
}

if (PageTransHuge(page)) {
@@ -5336,8 +5314,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}

- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:20 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 56faec0f73f7..3a37ed184258 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -66,6 +66,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:29 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not
only in primary, but also in secondary ptes. The latter is required in
order to estimate wss of KVM VMs. At the same time we want to avoid
flushing tlb, because it is quite expensive and it won't really affect
the final result.

Currently, there is no function for clearing pte young bit that would
meet our requirements, so this patch introduces one. To achieve that we
have to add a new mmu-notifier callback, clear_young, since there is no
method for testing-and-clearing a secondary pte w/o flushing tlb. The
new method is not mandatory and currently only implemented by KVM.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
Acked-by: Paolo Bonzini <pbon...@redhat.com>
---
include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
mm/mmu_notifier.c | 17 +++++++++++++++++
virt/kvm/kvm_main.c | 18 ++++++++++++++++++
3 files changed, 79 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f4d788..a5b17137c683 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -66,6 +66,16 @@ struct mmu_notifier_ops {
unsigned long end);

/*
+ * clear_young is a lightweight version of clear_flush_young. Like the
+ * latter, it is supposed to test-and-clear the young/accessed bitflag
+ * in the secondary pte, but it may omit flushing the secondary tlb.
+ */
+ int (*clear_young)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
+
+ /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
@@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
+extern int __mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_clear_young(mm, start, end);
+ return 0;
+}
+
+int __mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier *mn;
+ int young = 0, id;
+
+ id = srcu_read_lock(&srcu);
+ hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->clear_young)
+ young |= mn->ops->clear_young(mn, mm, start, end);
+ }
+ srcu_read_unlock(&srcu, id);
+
+ return young;
+}
+
int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a44453670..ff4173ce6924 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -387,6 +387,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
return young;
}

+static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ int young, idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ spin_lock(&kvm->mmu_lock);
+ young = kvm_age_hva(kvm, start, end);
+ spin_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return young;
+}
+
static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address)
@@ -419,6 +436,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
+ .clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
.change_pte = kvm_mmu_notifier_change_pte,
.release = kvm_mmu_notifier_release,

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:40 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
Documentation/vm/pagemap.txt | 12 ++-
fs/proc/page.c | 218 +++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 98 +++++++++++++++++++
include/linux/page-flags.h | 11 +++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/huge_memory.c | 11 ++-
mm/migrate.c | 5 +
mm/page_ext.c | 3 +
mm/rmap.c | 5 +
mm/swap.c | 2 +
13 files changed, 385 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 3a37ed184258..34fe828c3007 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -70,6 +70,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. This file implements a bitmap where each bit corresponds
+ to a page, indexed by PFN. When the bit is set, the corresponding page is
+ idle. A page is considered idle if it has not been accessed since it was
+ marked idle. To mark a page idle one should set the bit corresponding to the
+ page by writing to the file. A value written to the file is OR-ed with the
+ current bitmap value. Only user memory pages can be marked idle, for other
+ page types input is silently ignored. Writing to this file beyond max PFN
+ results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+ set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..273537885ab4 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
+static void kpageidle_clear_pte_refs(struct page *page)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = kpageidle_clear_pte_refs_one,
+ .anon_lock = page_lock_anon_vma_read,
+ };
+ bool need_lock;
+
+ if (!page_mapped(page) ||
+ !page_rmapping(page))
+ return;
+
+ need_lock = !PageAnon(page) || PageKsm(page);
+ if (need_lock && !trylock_page(page))
+ return;
+
+ rmap_walk(page, &rwc);
+
+ if (need_lock)
+ unlock_page(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return 0;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_pte_refs(page);
+ if (page_is_idle(page))
+ idle_bitmap |= 1ULL << bit;
+ }
+ put_page(page);
+ }
+ if (bit == KPMBITS - 1) {
+ if (put_user(idle_bitmap, out)) {
+ ret = -EFAULT;
+ break;
+ }
+ idle_bitmap = 0;
+ out++;
+ }
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (const u64 __user *)buf;
+ struct page *page;
+ unsigned long pfn, end_pfn;
+ ssize_t ret = 0;
+ u64 idle_bitmap = 0;
+ int bit;
+
+ if (*ppos & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ pfn = *ppos * BITS_PER_BYTE;
+ if (pfn >= max_pfn)
+ return -ENXIO;
+
+ end_pfn = pfn + count * BITS_PER_BYTE;
+ if (end_pfn > max_pfn)
+ end_pfn = ALIGN(max_pfn, KPMBITS);
+
+ for (; pfn < end_pfn; pfn++) {
+ bit = pfn % KPMBITS;
+ if (bit == 0) {
+ if (get_user(idle_bitmap, in)) {
+ ret = -EFAULT;
+ break;
+ }
+ in++;
+ }
+ if (idle_bitmap >> bit & 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_pte_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ }
+ }
+
+ *ppos += (const char __user *)in - buf;
+ if (!ret)
+ ret = (const char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +496,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 860bb0f30f14..7c9a17414106 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -459,7 +459,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -808,6 +808,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,

/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ test_and_clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -835,6 +836,7 @@ out:

/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ test_and_clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c3a2b37365f6..0e62be7d5138 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2202,5 +2202,103 @@ void __init setup_nr_node_ids(void);
index 8f9a334a6c66..5ab46adca104 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1806,6 +1806,11 @@ static void __split_huge_page_refcount(struct page *page,
/* clear PageTail before overwriting first_page */
smp_wmb();

+ if (page_is_young(page))
+ set_page_young(page_tail);
+ if (page_is_idle(page))
+ set_page_idle(page_tail);
+
/*
* __split_huge_page_splitting() already set the
* splitting bit in all pmd that could map this
@@ -2311,7 +2316,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
VM_BUG_ON_PAGE(PageLRU(page), page);

/* If there is no mapped pte young don't collapse the page */
- if (pte_young(pteval) || PageReferenced(page) ||
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
mmu_notifier_test_young(vma->vm_mm, address))
referenced = true;
}
@@ -2738,7 +2744,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
*/
if (page_count(page) != 1 + !!PageSwapCache(page))
goto out_unmap;
- if (pte_young(pteval) || PageReferenced(page) ||
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
mmu_notifier_test_young(vma->vm_mm, address))
referenced = true;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index d3529d620a5b..d86cec005aa6 100644
index 30812e9042ae..9e411aa03176 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -900,6 +900,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

+ if (referenced)
+ clear_page_idle(page);
+ if (test_and_clear_page_young(page))
+ referenced++;
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
index d398860badd1..04b6ce51bcf0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:48 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
As noted by Minchan, a benefit of reading idle flag from
/proc/kpageflags is that one can easily filter dirty and/or unevictable
pages while estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/proc/kpageidle first.

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andr...@google.com>
---
Documentation/vm/pagemap.txt | 6 ++++++
fs/proc/page.c | 3 +++
include/uapi/linux/kernel-page-flags.h | 1 +
3 files changed, 10 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 34fe828c3007..538735465693 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -65,6 +65,7 @@ There are five components to pagemap:
22. THP
23. BALLOON
24. ZERO_PAGE
+ 25. IDLE

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -125,6 +126,11 @@ Short descriptions to the page flags:
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page

+25. IDLE
+ page has not been accessed since it was marked idle (see /proc/kpageidle)
+ Note that this flag may be stale in case the page was accessed via a PTE.
+ To make sure the flag is up-to-date one has to read /proc/kpageidle first.
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 273537885ab4..13dcb823fe4e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -150,6 +150,9 @@ u64 stable_page_flags(struct page *page)
if (PageBalloon(page))
u |= 1 << KPF_BALLOON;

+ if (page_is_idle(page))
+ u |= 1 << KPF_IDLE;
+
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index a6c4962e5d46..5da5f8751ce7 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -33,6 +33,7 @@
#define KPF_THP 22
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
+#define KPF_IDLE 25


#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

Vladimir Davydov

unread,
Jul 19, 2015, 8:32:57 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Reading/writing a /proc/kpage* file may take long on machines with a lot
of RAM installed.

Suggested-by: Andres Lagar-Cavilla <andr...@google.com>
Signed-off-by: Vladimir Davydov <vdav...@parallels.com>
---
fs/proc/page.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 13dcb823fe4e..7ff7cba8617b 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -58,6 +58,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
pfn++;
out++;
count -= KPMSIZE;
+
+ cond_resched();
}

*ppos += (char __user *)out - buf;
@@ -219,6 +221,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
pfn++;
out++;
count -= KPMSIZE;
+
+ cond_resched();
}

*ppos += (char __user *)out - buf;
@@ -267,6 +271,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
pfn++;
out++;
count -= KPMSIZE;
+
+ cond_resched();
}

*ppos += (char __user *)out - buf;
@@ -421,6 +427,7 @@ static ssize_t kpageidle_read(struct file *file, char __user *buf,
idle_bitmap = 0;
out++;
}
+ cond_resched();
}

*ppos += (char __user *)out - buf;
@@ -467,6 +474,7 @@ static ssize_t kpageidle_write(struct file *file, const char __user *buf,
put_page(page);
}
}
+ cond_resched();
}

*ppos += (const char __user *)in - buf;

Vladimir Davydov

unread,
Jul 19, 2015, 8:37:29 AM7/19/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
FWIW here are idle memory stats obtained during the SPECjvm2008 run:

time total idle idle% testcase
1 m 179 MB 0 MB 0%
2 m 1770 MB 48 MB 2%
3 m 1777 MB 173 MB 9% compiler.compiler warmup
4 m 1750 MB 152 MB 8% compiler.compiler warmup
5 m 1751 MB 202 MB 11% compiler.compiler
6 m 1754 MB 252 MB 14% compiler.compiler
7 m 1754 MB 225 MB 12% compiler.compiler
8 m 1748 MB 126 MB 7% compiler.compiler
9 m 1752 MB 175 MB 10% compiler.sunflow warmup
10 m 1760 MB 168 MB 9% compiler.sunflow warmup
11 m 1759 MB 210 MB 11% compiler.sunflow
12 m 1762 MB 232 MB 13% compiler.sunflow
13 m 1761 MB 207 MB 11% compiler.sunflow
14 m 1775 MB 139 MB 7% compiler.sunflow
15 m 1775 MB 370 MB 20% compress warmup
16 m 1773 MB 515 MB 29% compress warmup
17 m 1770 MB 514 MB 29% compress
18 m 1761 MB 465 MB 26% compress
19 m 1750 MB 433 MB 24% compress
20 m 1772 MB 339 MB 19% compress
21 m 1794 MB 307 MB 17% crypto.aes warmup
22 m 1796 MB 325 MB 18% crypto.aes warmup
23 m 1798 MB 341 MB 19% crypto.aes
24 m 1798 MB 333 MB 18% crypto.aes
25 m 1797 MB 332 MB 18% crypto.aes
26 m 1798 MB 328 MB 18% crypto.aes
27 m 1798 MB 370 MB 20% crypto.rsa warmup
28 m 1793 MB 377 MB 21% crypto.rsa warmup
29 m 1786 MB 363 MB 20% crypto.rsa
30 m 1782 MB 360 MB 20% crypto.rsa
31 m 1781 MB 344 MB 19% crypto.rsa
32 m 1799 MB 328 MB 18% crypto.rsa
33 m 1799 MB 326 MB 18% crypto.signverify warmup
34 m 1799 MB 327 MB 18% crypto.signverify warmup
35 m 1799 MB 334 MB 18% crypto.signverify
36 m 1800 MB 339 MB 18% crypto.signverify
37 m 1800 MB 339 MB 18% crypto.signverify
38 m 1843 MB 323 MB 17% crypto.signverify
39 m 1903 MB 223 MB 11%
40 m 1951 MB 225 MB 11%
41 m 2498 MB 253 MB 10%
42 m 2561 MB 494 MB 19% derby warmup
43 m 2565 MB 527 MB 20% derby warmup
44 m 2577 MB 574 MB 22% derby
45 m 2621 MB 580 MB 22% derby
46 m 2641 MB 536 MB 20% derby
47 m 2256 MB 316 MB 14% derby
48 m 2244 MB 427 MB 19% mpegaudio warmup
49 m 2225 MB 781 MB 35% mpegaudio warmup
50 m 2179 MB 1143 MB 52% mpegaudio
51 m 2067 MB 1297 MB 62% mpegaudio
52 m 1976 MB 1186 MB 60% mpegaudio
53 m 2756 MB 1118 MB 40% mpegaudio
54 m 3810 MB 1831 MB 48% scimark.fft.large warmup
55 m 3252 MB 1108 MB 34% scimark.fft.large warmup
56 m 2550 MB 1271 MB 49% scimark.fft.large
57 m 3835 MB 1643 MB 42% scimark.fft.large
58 m 3067 MB 1138 MB 37% scimark.fft.large
59 m 2072 MB 1103 MB 53% scimark.fft.large
60 m 2183 MB 799 MB 36% scimark.fft.large
61 m 2159 MB 568 MB 26% scimark.lu.large warmup
62 m 2333 MB 320 MB 13% scimark.lu.large warmup
63 m 2411 MB 447 MB 18% scimark.lu.large warmup
64 m 2646 MB 345 MB 13% scimark.lu.large
65 m 2687 MB 499 MB 18% scimark.lu.large
66 m 2691 MB 459 MB 17% scimark.lu.large
67 m 2703 MB 641 MB 23% scimark.lu.large
68 m 2735 MB 1077 MB 39% scimark.lu.large
69 m 2735 MB 2310 MB 84% scimark.sor.large warmup
70 m 2735 MB 1704 MB 62% scimark.sor.large warmup
71 m 2735 MB 2034 MB 74% scimark.sor.large
72 m 2735 MB 2390 MB 87% scimark.sor.large
73 m 2735 MB 2417 MB 88% scimark.sor.large
74 m 2735 MB 1366 MB 49% scimark.sor.large
75 m 2735 MB 985 MB 36% scimark.sparse.large warmup
76 m 2759 MB 925 MB 33% scimark.sparse.large warmup
77 m 2759 MB 1192 MB 43% scimark.sparse.large
78 m 2703 MB 1120 MB 41% scimark.sparse.large
79 m 2679 MB 1035 MB 38% scimark.sparse.large
80 m 2679 MB 1069 MB 39% scimark.sparse.large
81 m 2162 MB 863 MB 39% scimark.sparse.large
82 m 2109 MB 677 MB 32% scimark.fft.small warmup
83 m 2172 MB 637 MB 29% scimark.fft.small warmup
84 m 2220 MB 655 MB 29% scimark.fft.small
85 m 2264 MB 658 MB 29% scimark.fft.small
86 m 2316 MB 656 MB 28% scimark.fft.small
87 m 2529 MB 630 MB 24% scimark.fft.small
88 m 2840 MB 645 MB 22% scimark.lu.small warmup
89 m 2983 MB 652 MB 21% scimark.lu.small warmup
90 m 2983 MB 652 MB 21% scimark.lu.small
91 m 2983 MB 651 MB 21% scimark.lu.small
92 m 2984 MB 651 MB 21% scimark.lu.small
93 m 2984 MB 652 MB 21% scimark.lu.small
94 m 2984 MB 2114 MB 70% scimark.sor.small warmup
95 m 2984 MB 2796 MB 93% scimark.sor.small warmup
96 m 2984 MB 2823 MB 94% scimark.sor.small
97 m 2984 MB 2848 MB 95% scimark.sor.small
98 m 2984 MB 2817 MB 94% scimark.sor.small
99 m 2984 MB 1366 MB 45% scimark.sor.small
100 m 2984 MB 664 MB 22% scimark.sparse.small warmup
101 m 2984 MB 654 MB 21% scimark.sparse.small warmup
102 m 2983 MB 663 MB 22% scimark.sparse.small
103 m 2983 MB 652 MB 21% scimark.sparse.small
104 m 2982 MB 651 MB 21% scimark.sparse.small
105 m 2981 MB 640 MB 21% scimark.sparse.small
106 m 2981 MB 2113 MB 70% scimark.monte_carlo warmup
107 m 2981 MB 2831 MB 94% scimark.monte_carlo warmup
108 m 2981 MB 2835 MB 95% scimark.monte_carlo
109 m 2981 MB 2863 MB 96% scimark.monte_carlo
110 m 2981 MB 2872 MB 96% scimark.monte_carlo
111 m 2881 MB 1179 MB 40% scimark.monte_carlo
112 m 2880 MB 777 MB 26% serial warmup
113 m 2882 MB 1063 MB 36% serial warmup
114 m 2880 MB 1066 MB 37% serial
115 m 2880 MB 1064 MB 36% serial
116 m 2882 MB 1064 MB 36% serial
117 m 2887 MB 1042 MB 36% serial
118 m 2886 MB 1118 MB 38% sunflow warmup
119 m 2887 MB 1161 MB 40% sunflow warmup
120 m 2887 MB 1166 MB 40% sunflow
121 m 2887 MB 1170 MB 40% sunflow
122 m 2886 MB 1172 MB 40% sunflow
123 m 2896 MB 1159 MB 40% sunflow
124 m 2906 MB 1132 MB 38% xml.transform warmup
125 m 2907 MB 1136 MB 39% xml.transform warmup
126 m 2907 MB 1137 MB 39% xml.transform
127 m 2907 MB 1137 MB 39% xml.transform
128 m 2907 MB 1134 MB 39% xml.transform
129 m 2907 MB 1120 MB 38% xml.transform
130 m 2895 MB 917 MB 31% xml.validation warmup
131 m 2894 MB 706 MB 24% xml.validation warmup
132 m 2903 MB 529 MB 18% xml.validation
133 m 2907 MB 883 MB 30% xml.validation
134 m 2894 MB 1013 MB 35% xml.validation
135 m 2907 MB 853 MB 29% xml.validation

Vladimir Davydov

unread,
Jul 21, 2015, 4:51:40 AM7/21/15
to Andres Lagar-Cavilla, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Mon, Jul 20, 2015 at 11:34:21AM -0700, Andres Lagar-Cavilla wrote:
> On Sun, Jul 19, 2015 at 5:31 AM, Vladimir Davydov <vdav...@parallels.com>
[...]
> > +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long end)
> > +{
> > + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > + int young, idx;
> > +
> >
> If you need to cut out another version please add comments as to the two
> issues raised:
> - This doesn't proactively flush TLBs -- not obvious if it should.
> - This adversely affects performance in Pre_haswell Intel EPT.

Oops, I stopped reading your e-mail in reply to the previous version of
this patch as soon as I saw the Reviewed-by tag, so I missed your
request for the comment, sorry about that.

Here it goes (incremental):
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff4173ce6924..e69a5cb99571 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -397,6 +397,19 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,

idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
+ /*
+ * Even though we do not flush TLB, this will still adversely
+ * affect performance on pre-Haswell Intel EPT, where there is
+ * no EPT Access Bit to clear so that we have to tear down EPT
+ * tables instead. If we find this unacceptable, we can always
+ * add a parameter to kvm_age_hva so that it effectively doesn't
+ * do anything on clear_young.
+ *
+ * Also note that currently we never issue secondary TLB flushes
+ * from clear_young, leaving this job up to the regular system
+ * cadence. If we find this inaccurate, we might come up with a
+ * more sophisticated heuristic later.
+ */
young = kvm_age_hva(kvm, start, end);
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);

Andrew Morton

unread,
Jul 21, 2015, 7:34:19 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org, Kees Cook
What are the bit mappings? If I read the first byte of /proc/kpageidle
I get PFN #0 in bit zero of that byte? And the second byte of
/proc/kpageidle contains PFN #8 in its LSB, etc?

Maybe this is covered in the documentation file.

> When the bit is set, the corresponding page is
> idle. A page is considered idle if it has not been accessed since it was
> marked idle.

Perhaps we can spell out in some detail what "accessed" means? I see
you've hooked into mark_page_accessed(), so a read from disk is an
access. What about a write to disk? And what about a page being
accessed from some random device (could hook into get_user_pages()?) Is
getting written to swap an access? When a dirty pagecache page is
written out by kswapd or direct reclaim?

This also should be in the permanent documentation.

> To mark a page idle one should set the bit corresponding to the
> page by writing to the file. A value written to the file is OR-ed with the
> current bitmap value. Only user memory pages can be marked idle, for other
> page types input is silently ignored. Writing to this file beyond max PFN
> results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> set.
>
> This file can be used to estimate the amount of pages that are not
> used by a particular workload as follows:
>
> 1. mark all pages of interest idle by setting corresponding bits in the
> /proc/kpageidle bitmap
> 2. wait until the workload accesses its working set
> 3. read /proc/kpageidle and count the number of bits set

Security implications. This interface could be used to learn about a
sensitive application by poking data at it and then observing its
memory access patterns. Perhaps this is why the proc files are
root-only (whcih I assume is sufficient). Some words here about the
security side of things and the reasoning behind the chosen permissions
would be good to have.

> * /proc/kpagecgroup. This file contains a 64-bit inode number of the
> memory cgroup each page is charged to, indexed by PFN.

Actually "closest online ancestor". This also should be in the
interface documentation.

> Only available when CONFIG_MEMCG is set.

CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume?

>
> This file can be used to find all pages (including unmapped file
> pages) accounted to a particular cgroup. Using /proc/kpageidle, one
> can then estimate the cgroup working set size.
>
> For an example of using these files for estimating the amount of unused
> memory pages per each memory cgroup, please see the script attached
> below.

Why were these put in /proc anyway? Rather than under /sys/fs/cgroup
somewhere? Presumably because /proc/kpageidle is useful in non-memcg
setups.

> ---- PERFORMANCE EVALUATION ----

"^___" means "end of changelog". Perhaps that should have been
"^---\n" - unclear.

> Documentation/vm/pagemap.txt | 22 ++-

I think we'll need quite a lot more than this to fully describe the
interface?

Andrew Morton

unread,
Jul 21, 2015, 7:34:27 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Sun, 19 Jul 2015 15:31:10 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:

> This function returns the inode number of the closest online ancestor of
> the memory cgroup a page is charged to. It is required for exporting
> information about which page is charged to which cgroup to userspace,
> which will be introduced by a following patch.
>
> ...
>

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -441,6 +441,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
> return &memcg->css;
> }
>
> +/**
> + * page_cgroup_ino - return inode number of the memcg a page is charged to
> + * @page: the page
> + *
> + * Look up the closest online ancestor of the memory cgroup @page is charged to
> + * and return its inode number or 0 if @page is not charged to any cgroup. It
> + * is safe to call this function without holding a reference to @page.
> + */
> +unsigned long page_cgroup_ino(struct page *page)

Shouldn't it return an ino_t?

> +{
> + struct mem_cgroup *memcg;
> + unsigned long ino = 0;
> +
> + rcu_read_lock();
> + memcg = READ_ONCE(page->mem_cgroup);
> + while (memcg && !(memcg->css.flags & CSS_ONLINE))
> + memcg = parent_mem_cgroup(memcg);
> + if (memcg)
> + ino = cgroup_ino(memcg->css.cgroup);
> + rcu_read_unlock();
> + return ino;
> +}

The function is racy, isn't it? There's nothing to prevent this inode
from getting torn down and potentially reallocated one nanosecond after
page_cgroup_ino() returns? If so, it is only safely usable by things
which don't care (such as procfs interfaces) and this should be
documented in some fashion.

Andrew Morton

unread,
Jul 21, 2015, 7:34:39 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Sun, 19 Jul 2015 15:31:11 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:

> Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> then its ino using cgroup_ino, but now we have an apter method for that,
> page_cgroup_ino, so use it instead.

I assume "an apter" was supposed to be "a helper"?

> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
> /*
> * do a racy check with elevated page count, to make sure PG_hwpoison
> * will only be set for the targeted owner (or on a free page).
> - * We temporarily take page lock for try_get_mem_cgroup_from_page().
> * memory_failure() will redo the check reliably inside page lock.
> */
> - lock_page(hpage);
> err = hwpoison_filter(hpage);
> - unlock_page(hpage);
> if (err)
> goto put_out;
>
> @@ -126,7 +123,7 @@ static int pfn_inject_init(void)
> if (!dentry)
> goto fail;
>
> -#ifdef CONFIG_MEMCG_SWAP
> +#ifdef CONFIG_MEMCG
> dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
> hwpoison_dir, &hwpoison_filter_memcg);
> if (!dentry)

Confused. We're changing the conditions under which this debugfs file
is created. Is this a typo or some unchangelogged thing or what?

Andrew Morton

unread,
Jul 21, 2015, 7:34:45 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Sun, 19 Jul 2015 15:31:13 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:

> /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
> each page is charged to, indexed by PFN. Having this information is
> useful for estimating a cgroup working set size.
>
> The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
>
> ...
>
> @@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
> .read = kpageflags_read,
> };
>
> +#ifdef CONFIG_MEMCG
> +static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + u64 __user *out = (u64 __user *)buf;
> + struct page *ppage;
> + unsigned long src = *ppos;
> + unsigned long pfn;
> + ssize_t ret = 0;
> + u64 ino;
> +
> + pfn = src / KPMSIZE;
> + count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
> + if (src & KPMMASK || count & KPMMASK)
> + return -EINVAL;

The user-facing documentation should explain that reads must be
performed in multiple-of-8 sizes.

> + while (count > 0) {
> + if (pfn_valid(pfn))
> + ppage = pfn_to_page(pfn);
> + else
> + ppage = NULL;
> +
> + if (ppage)
> + ino = page_cgroup_ino(ppage);
> + else
> + ino = 0;
> +
> + if (put_user(ino, out)) {
> + ret = -EFAULT;

Here we do the usual procfs violation of read() behaviour. read()
normally only returns an error if it read nothing. This code will
transfer a megabyte then return -EFAULT so userspace doesn't know that
it got that megabyte.

That's easy to fix, but procfs files do this all over the place anyway :(

> + break;
> + }
> +
> + pfn++;
> + out++;
> + count -= KPMSIZE;
> + }
> +
> + *ppos += (char __user *)out - buf;
> + if (!ret)
> + ret = (char __user *)out - buf;
> + return ret;
> +}
> +

Andrew Morton

unread,
Jul 21, 2015, 7:35:13 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Sun, 19 Jul 2015 15:31:16 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:

> As noted by Minchan, a benefit of reading idle flag from
> /proc/kpageflags is that one can easily filter dirty and/or unevictable
> pages while estimating the size of unused memory.
>
> Note that idle flag read from /proc/kpageflags may be stale in case the
> page was accessed via a PTE, because it would be too costly to iterate
> over all page mappings on each /proc/kpageflags read to provide an
> up-to-date value. To make sure the flag is up-to-date one has to read
> /proc/kpageidle first.

Is there any value in teaching the regular old page scanner to update
these flags? If it's doing an rmap scan anyway...

Andrew Morton

unread,
Jul 21, 2015, 7:35:46 PM7/21/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
> ...
>
>
> ...
>
> +static void kpageidle_clear_pte_refs(struct page *page)
> +{
> + struct rmap_walk_control rwc = {
> + .rmap_one = kpageidle_clear_pte_refs_one,
> + .anon_lock = page_lock_anon_vma_read,
> + };

I think this can be static const, since `arg' is unused? That would
save some cycles and stack.

> + bool need_lock;
> +
> + if (!page_mapped(page) ||
> + !page_rmapping(page))
> + return;
> +
> + need_lock = !PageAnon(page) || PageKsm(page);
> + if (need_lock && !trylock_page(page))

Oh. So the feature is a bit unreliable.

I'm not immediately seeing anything which would prevent us from using
plain old lock_page() here. What's going on?

> + return;
> +
> + rmap_walk(page, &rwc);
> +
> + if (need_lock)
> + unlock_page(page);
> +}
> +
> +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + u64 __user *out = (u64 __user *)buf;
> + struct page *page;
> + unsigned long pfn, end_pfn;
> + ssize_t ret = 0;
> + u64 idle_bitmap = 0;
> + int bit;
> +
> + if (*ppos & KPMMASK || count & KPMMASK)
> + return -EINVAL;

Interface requires 8-byte aligned offset and size.

> + pfn = *ppos * BITS_PER_BYTE;
> + if (pfn >= max_pfn)
> + return 0;
> +
> + end_pfn = pfn + count * BITS_PER_BYTE;
> + if (end_pfn > max_pfn)
> + end_pfn = ALIGN(max_pfn, KPMBITS);

So we lose up to 63 pages. Presumably max_pfn is well enough aligned
for this to not matter, dunno.

> + for (; pfn < end_pfn; pfn++) {
> + bit = pfn % KPMBITS;
> + page = kpageidle_get_page(pfn);
> + if (page) {
> + if (page_is_idle(page)) {
> + /*
> + * The page might have been referenced via a
> + * pte, in which case it is not idle. Clear
> + * refs and recheck.
> + */
> + kpageidle_clear_pte_refs(page);
> + if (page_is_idle(page))
> + idle_bitmap |= 1ULL << bit;

I don't understand what's going on here. More details, please?
Hate it when I have to go look up a C precedence table. This is

if ((idle_bitmap >> bit) & 1) {

> + page = kpageidle_get_page(pfn);
> + if (page) {
> + kpageidle_clear_pte_refs(page);
> + set_page_idle(page);
> + put_page(page);
> + }
> + }
> + }
> +
> + *ppos += (const char __user *)in - buf;
> + if (!ret)
> + ret = (const char __user *)in - buf;
> + return ret;
> +}
> +
>
> ...

Vladimir Davydov

unread,
Jul 22, 2015, 5:21:36 AM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Tue, Jul 21, 2015 at 04:34:07PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:10 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:
>
> > This function returns the inode number of the closest online ancestor of
> > the memory cgroup a page is charged to. It is required for exporting
> > information about which page is charged to which cgroup to userspace,
> > which will be introduced by a following patch.
> >
> > ...
> >
>
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -441,6 +441,29 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
> > return &memcg->css;
> > }
> >
> > +/**
> > + * page_cgroup_ino - return inode number of the memcg a page is charged to
> > + * @page: the page
> > + *
> > + * Look up the closest online ancestor of the memory cgroup @page is charged to
> > + * and return its inode number or 0 if @page is not charged to any cgroup. It
> > + * is safe to call this function without holding a reference to @page.
> > + */
> > +unsigned long page_cgroup_ino(struct page *page)
>
> Shouldn't it return an ino_t?

Yep, thanks.

>
> > +{
> > + struct mem_cgroup *memcg;
> > + unsigned long ino = 0;
> > +
> > + rcu_read_lock();
> > + memcg = READ_ONCE(page->mem_cgroup);
> > + while (memcg && !(memcg->css.flags & CSS_ONLINE))
> > + memcg = parent_mem_cgroup(memcg);
> > + if (memcg)
> > + ino = cgroup_ino(memcg->css.cgroup);
> > + rcu_read_unlock();
> > + return ino;
> > +}
>
> The function is racy, isn't it? There's nothing to prevent this inode
> from getting torn down and potentially reallocated one nanosecond after
> page_cgroup_ino() returns? If so, it is only safely usable by things
> which don't care (such as procfs interfaces) and this should be
> documented in some fashion.

Agree. Here goes the incremental patch:
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d644aadfdd0d..ad800e62cb7a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -343,7 +343,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
}

struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
-unsigned long page_cgroup_ino(struct page *page);
+ino_t page_cgroup_ino(struct page *page);

static inline bool mem_cgroup_disabled(void)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b9c76a0906f9..bd30638c2a95 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -448,8 +448,13 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
* Look up the closest online ancestor of the memory cgroup @page is charged to
* and return its inode number or 0 if @page is not charged to any cgroup. It
* is safe to call this function without holding a reference to @page.
+ *
+ * Note, this function is inherently racy, because there is nothing to prevent
+ * the cgroup inode from getting torn down and potentially reallocated a moment
+ * after page_cgroup_ino() returns, so it only should be used by callers that
+ * do not care (such as procfs interfaces).
*/
-unsigned long page_cgroup_ino(struct page *page)
+ino_t page_cgroup_ino(struct page *page)
{
struct mem_cgroup *memcg;
unsigned long ino = 0;

Vladimir Davydov

unread,
Jul 22, 2015, 5:45:26 AM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Tue, Jul 21, 2015 at 04:34:12PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:11 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:
>
> > Hwpoison allows to filter pages by memory cgroup ino. Currently, it
> > calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
> > then its ino using cgroup_ino, but now we have an apter method for that,
> > page_cgroup_ino, so use it instead.
>
> I assume "an apter" was supposed to be "a helper"?

Yes, sounds better :-)

>
> > --- a/mm/hwpoison-inject.c
> > +++ b/mm/hwpoison-inject.c
> > @@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
> > /*
> > * do a racy check with elevated page count, to make sure PG_hwpoison
> > * will only be set for the targeted owner (or on a free page).
> > - * We temporarily take page lock for try_get_mem_cgroup_from_page().
> > * memory_failure() will redo the check reliably inside page lock.
> > */
> > - lock_page(hpage);
> > err = hwpoison_filter(hpage);
> > - unlock_page(hpage);
> > if (err)
> > goto put_out;
> >
> > @@ -126,7 +123,7 @@ static int pfn_inject_init(void)
> > if (!dentry)
> > goto fail;
> >
> > -#ifdef CONFIG_MEMCG_SWAP
> > +#ifdef CONFIG_MEMCG
> > dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
> > hwpoison_dir, &hwpoison_filter_memcg);
> > if (!dentry)
>
> Confused. We're changing the conditions under which this debugfs file
> is created. Is this a typo or some unchangelogged thing or what?

This is an unchangelogged cleanup. In fact, there had been a comment
regarding it before v6, but then it got lost. Sorry about that. The
commit message should look like this:

"""
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have a helper method for that,
page_cgroup_ino, so use it instead.

This patch also loosens the hwpoison memcg filter dependency rules - it
makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
hwpoison memcg filter does not require anything (nor it used to) from
CONFIG_MEMCG_SWAP side.
"""

Or we can simply revert this cleanups if you don't like it:
---
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 5015679014c1..1cd105ee5a7b 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -123,7 +123,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_MEMCG_SWAP
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97005396a507..5ea7d8c760fa 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -130,7 +130,7 @@ static int hwpoison_filter_flags(struct page *p)
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_MEMCG_SWAP
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)

Vladimir Davydov

unread,
Jul 22, 2015, 6:33:42 AM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
It does. It's in the end of Documentation/vm/pagemap.c:

: Other notes:
:
: Reading from any of the files will return -EINVAL if you are not starting
: the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
: into the file), or if the size of the read is not a multiple of 8 bytes.

>
> > + while (count > 0) {
> > + if (pfn_valid(pfn))
> > + ppage = pfn_to_page(pfn);
> > + else
> > + ppage = NULL;
> > +
> > + if (ppage)
> > + ino = page_cgroup_ino(ppage);
> > + else
> > + ino = 0;
> > +
> > + if (put_user(ino, out)) {
> > + ret = -EFAULT;
>
> Here we do the usual procfs violation of read() behaviour. read()
> normally only returns an error if it read nothing. This code will
> transfer a megabyte then return -EFAULT so userspace doesn't know that
> it got that megabyte.

Yeah, that's how it works. I did it preliminary for /proc/kpagecgroup to
work exactly like /proc/kpageflags and /proc/kpagecount.

FWIW, the man page I have on my system already warns about this
peculiarity of read(2):

: On error, -1 is returned, and errno is set appropriately. In this
: case, it is left unspecified whether the file position (if any)
: changes.

Vladimir Davydov

unread,
Jul 22, 2015, 11:21:04 AM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Good catch, thanks.

>
> > + bool need_lock;
> > +
> > + if (!page_mapped(page) ||
> > + !page_rmapping(page))
> > + return;
> > +
> > + need_lock = !PageAnon(page) || PageKsm(page);
> > + if (need_lock && !trylock_page(page))
>
> Oh. So the feature is a bit unreliable.
>
> I'm not immediately seeing anything which would prevent us from using
> plain old lock_page() here. What's going on?

A page may be locked for quite a long period of time, e.g.
truncate_inode_pages_range() may wait until a page writeback finishes
under the page lock. Instead of stalling kpageidle scan, we'd better
move on to the next page. Of course, the result won't be 100% accurate.
In fact, it isn't accurate anyway, because we skip isolated pages,
neither can it possibly be 100% accurate, because the scan itself is not
instant so that while we are performing it the system usage pattern
might change. This new API is only supposed to give a good estimate of
memory usage pattern, which could be used as a hint for adjusting the
system configuration to improve performance.
ALIGN(x, a) resolves to ((x + a - 1) & ~(a - 1)), which is >= x, so we
shouldn't loose anything.

>
> > + for (; pfn < end_pfn; pfn++) {
> > + bit = pfn % KPMBITS;
> > + page = kpageidle_get_page(pfn);
> > + if (page) {
> > + if (page_is_idle(page)) {
> > + /*
> > + * The page might have been referenced via a
> > + * pte, in which case it is not idle. Clear
> > + * refs and recheck.
> > + */
> > + kpageidle_clear_pte_refs(page);
> > + if (page_is_idle(page))
> > + idle_bitmap |= 1ULL << bit;
>
> I don't understand what's going on here. More details, please?

The output is a bitmap, which is stored as an array of 8-byte elements,
where byte order within each word is native, i.e. if page at pfn #i is
idle we need to set bit #i%64 of element #i/64 of the array. I'll
reflect this in the documentation.
Fixed.

Here goes the incremental patch with all the fixes:
---
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7ff7cba8617b..9daa6e92450f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -362,7 +362,11 @@ static int kpageidle_clear_pte_refs_one(struct page *page,

static void kpageidle_clear_pte_refs(struct page *page)
{
- struct rmap_walk_control rwc = {
+ /*
+ * Since rwc.arg is unused, rwc is effectively immutable, so we
+ * can make it static const to save some cycles and stack.
+ */
+ static const struct rmap_walk_control rwc = {
.rmap_one = kpageidle_clear_pte_refs_one,
.anon_lock = page_lock_anon_vma_read,
};
@@ -376,7 +380,7 @@ static void kpageidle_clear_pte_refs(struct page *page)
if (need_lock && !trylock_page(page))
return;

- rmap_walk(page, &rwc);
+ rmap_walk(page, (struct rmap_walk_control *)&rwc);

if (need_lock)
unlock_page(page);
@@ -466,7 +470,7 @@ static ssize_t kpageidle_write(struct file *file, const char __user *buf,
}
in++;
}
- if (idle_bitmap >> bit & 1) {
+ if ((idle_bitmap >> bit) & 1) {
page = kpageidle_get_page(pfn);
if (page) {
kpageidle_clear_pte_refs(page);

Vladimir Davydov

unread,
Jul 22, 2015, 12:24:30 PM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org, Kees Cook
The bit mapping is an array of u64 elements. Page at pfn #i corresponds
to bit #i%64 of element #i/64. Byte order is native.

Will add this to docs.

>
> Maybe this is covered in the documentation file.
>
> > When the bit is set, the corresponding page is
> > idle. A page is considered idle if it has not been accessed since it was
> > marked idle.
>
> Perhaps we can spell out in some detail what "accessed" means? I see
> you've hooked into mark_page_accessed(), so a read from disk is an
> access. What about a write to disk? And what about a page being
> accessed from some random device (could hook into get_user_pages()?) Is
> getting written to swap an access? When a dirty pagecache page is
> written out by kswapd or direct reclaim?
>
> This also should be in the permanent documentation.

OK, will add.

>
> > To mark a page idle one should set the bit corresponding to the
> > page by writing to the file. A value written to the file is OR-ed with the
> > current bitmap value. Only user memory pages can be marked idle, for other
> > page types input is silently ignored. Writing to this file beyond max PFN
> > results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> > set.
> >
> > This file can be used to estimate the amount of pages that are not
> > used by a particular workload as follows:
> >
> > 1. mark all pages of interest idle by setting corresponding bits in the
> > /proc/kpageidle bitmap
> > 2. wait until the workload accesses its working set
> > 3. read /proc/kpageidle and count the number of bits set
>
> Security implications. This interface could be used to learn about a
> sensitive application by poking data at it and then observing its
> memory access patterns. Perhaps this is why the proc files are
> root-only (whcih I assume is sufficient).

That's one point. Another point is that if we allow unprivileged users
to access it, they may interfere with the system-wide daemon doing the
regular scan and estimating the system wss.

> Some words here about the security side of things and the reasoning
> behind the chosen permissions would be good to have.
>
> > * /proc/kpagecgroup. This file contains a 64-bit inode number of the
> > memory cgroup each page is charged to, indexed by PFN.
>
> Actually "closest online ancestor". This also should be in the
> interface documentation.

Actually, the userspace knows nothing about online/offline cgroups,
because all cgroups used to be online and charge re-parenting was used
to forcibly empty a memcg on deletion. Anyways, I'll add a note.

>
> > Only available when CONFIG_MEMCG is set.
>
> CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume?

No, it's present iff CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG, because
it might be useful even w/o CONFIG_IDLE_PAGE_TRACKING, e.g. in order to
find out which memcg pages of a particular process are accounted to.

>
> >
> > This file can be used to find all pages (including unmapped file
> > pages) accounted to a particular cgroup. Using /proc/kpageidle, one
> > can then estimate the cgroup working set size.
> >
> > For an example of using these files for estimating the amount of unused
> > memory pages per each memory cgroup, please see the script attached
> > below.
>
> Why were these put in /proc anyway? Rather than under /sys/fs/cgroup
> somewhere? Presumably because /proc/kpageidle is useful in non-memcg
> setups.

Yes, one might use it for estimating active wss of a single process or
the whole system.

>
> > ---- PERFORMANCE EVALUATION ----
>
> "^___" means "end of changelog". Perhaps that should have been
> "^---\n" - unclear.

Sorry :-/

>
> > Documentation/vm/pagemap.txt | 22 ++-
>
> I think we'll need quite a lot more than this to fully describe the
> interface?

Agree, the documentation sucks :-( Will try to forge something more
thorough.

Thanks,
Vladimir

Vladimir Davydov

unread,
Jul 22, 2015, 12:25:58 PM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Tue, Jul 21, 2015 at 04:35:00PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:16 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:
>
> > As noted by Minchan, a benefit of reading idle flag from
> > /proc/kpageflags is that one can easily filter dirty and/or unevictable
> > pages while estimating the size of unused memory.
> >
> > Note that idle flag read from /proc/kpageflags may be stale in case the
> > page was accessed via a PTE, because it would be too costly to iterate
> > over all page mappings on each /proc/kpageflags read to provide an
> > up-to-date value. To make sure the flag is up-to-date one has to read
> > /proc/kpageidle first.
>
> Is there any value in teaching the regular old page scanner to update
> these flags? If it's doing an rmap scan anyway...

I don't understand what you mean by "regular old page scanner". Could
you please elaborate?

Thanks,
Vladimir

Vladimir Davydov

unread,
Jul 22, 2015, 12:33:49 PM7/22/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Hi Andrew,

Would you mind merging this incremental patch to the original one? Or
should I better resubmit the whole series with all the fixes?

Andrew Morton

unread,
Jul 22, 2015, 3:44:30 PM7/22/15
to Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, 22 Jul 2015 19:25:28 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:

> On Tue, Jul 21, 2015 at 04:35:00PM -0700, Andrew Morton wrote:
> > On Sun, 19 Jul 2015 15:31:16 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:
> >
> > > As noted by Minchan, a benefit of reading idle flag from
> > > /proc/kpageflags is that one can easily filter dirty and/or unevictable
> > > pages while estimating the size of unused memory.
> > >
> > > Note that idle flag read from /proc/kpageflags may be stale in case the
> > > page was accessed via a PTE, because it would be too costly to iterate
> > > over all page mappings on each /proc/kpageflags read to provide an
> > > up-to-date value. To make sure the flag is up-to-date one has to read
> > > /proc/kpageidle first.
> >
> > Is there any value in teaching the regular old page scanner to update
> > these flags? If it's doing an rmap scan anyway...
>
> I don't understand what you mean by "regular old page scanner". Could
> you please elaborate?

Whenever kswapd or direct reclaim perform an rmap scan, take that as an
opportunity to also update PageIdle().

Vladimir Davydov

unread,
Jul 23, 2015, 3:58:23 AM7/23/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 22, 2015 at 01:46:21PM -0700, Andres Lagar-Cavilla wrote:
> In page_referenced_one:
>
> + if (referenced)
> + clear_page_idle(page);
>

Yep, that's it. Thanks, Andres.

Paul Gortmaker

unread,
Jul 24, 2015, 10:09:11 AM7/24/15
to Vladimir Davydov, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, LKML doc, linu...@kvack.org, cgr...@vger.kernel.org, LKML, linux...@vger.kernel.org
On Sun, Jul 19, 2015 at 8:31 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.

The version of this commit currently in linux-next breaks cris and m68k
(and maybe others). Fails with:

fs/proc/page.c:341:4: error: implicit declaration of function
'pmdp_clear_young_notify' [-Werror=implicit-function-declaration]
fs/proc/page.c:347:4: error: implicit declaration of function
'ptep_clear_young_notify' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
make[3]: *** [fs/proc/page.o] Error 1
make[2]: *** [fs/proc] Error 2

http://kisskb.ellerman.id.au/kisskb/buildresult/12470364/

Bisect says:

65525488fa86cda44fb6870f29e9859c974700cd is the first bad commit
commit 65525488fa86cda44fb6870f29e9859c974700cd
Author: Vladimir Davydov <vdav...@parallels.com>
Date: Fri Jul 24 09:11:32 2015 +1000

proc: add kpageidle file

Paul.
--

Vladimir Davydov

unread,
Jul 24, 2015, 10:17:59 AM7/24/15
to Paul Gortmaker, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, LKML doc, linu...@kvack.org, cgr...@vger.kernel.org, LKML, linux...@vger.kernel.org
On Fri, Jul 24, 2015 at 10:08:25AM -0400, Paul Gortmaker wrote:

> fs/proc/page.c:341:4: error: implicit declaration of function
> 'pmdp_clear_young_notify' [-Werror=implicit-function-declaration]
> fs/proc/page.c:347:4: error: implicit declaration of function
> 'ptep_clear_young_notify' [-Werror=implicit-function-declaration]
> cc1: some warnings being treated as errors
> make[3]: *** [fs/proc/page.o] Error 1
> make[2]: *** [fs/proc] Error 2

My bad, sorry.

It's already been reported by the kbuild-test-robot, see

[linux-next:master 3983/4215] fs/proc/page.c:332:4: error: implicit declaration of function 'pmdp_clear_young_notify'

The fix is:

From: Vladimir Davydov <vdav...@parallels.com>
Subject: [PATCH] mmu_notifier: add missing stubs for clear_young

This is a compilation fix for !CONFIG_MMU_NOTIFIER.

Fixes: mmu-notifier-add-clear_young-callback
Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a5b17137c683..a1a210d59961 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -471,6 +471,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)

#define ptep_clear_flush_young_notify ptep_clear_flush_young
#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
+#define ptep_clear_young_notify ptep_test_and_clear_young
+#define pmdp_clear_young_notify pmdp_test_and_clear_young
#define ptep_clear_flush_notify ptep_clear_flush
#define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
#define pmdp_huge_get_and_clear_notify pmdp_huge_get_and_clear

Vladimir Davydov

unread,
Jul 25, 2015, 12:25:58 PM7/25/15
to Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org, Kees Cook
The incremental patch is attached. Could you please merge it into
proc-add-kpageidle-file?
---
From: Vladimir Davydov <vdav...@parallels.com>
Subject: [PATCH] Documentation: Add idle page tracking description

Signed-off-by: Vladimir Davydov <vdav...@parallels.com>

diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 081c49777abb..6a5e2a102a45 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -14,6 +14,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
hwpoison.txt
- explains what hwpoison is
+idle_page_tracking.txt
+ - description of the idle page tracking feature.
ksm.txt
- how to use the Kernel Samepage Merging feature.
numa
diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/vm/idle_page_tracking.txt
new file mode 100644
index 000000000000..d0f332d544c4
--- /dev/null
+++ b/Documentation/vm/idle_page_tracking.txt
@@ -0,0 +1,94 @@
+MOTIVATION
+
+The idle page tracking feature allows to track which memory pages are being
+accessed by a workload and which are idle. This information can be useful for
+estimating the workload's working set size, which, in turn, can be taken into
+account when configuring the workload parameters, setting memory cgroup limits,
+or deciding where to place the workload within a compute cluster.
+
+USER API
+
+If CONFIG_IDLE_PAGE_TRACKING was enabled on compile time, a new read-write file
+is present on the proc filesystem, /proc/kpageidle.
+
+The file implements a bitmap where each bit corresponds to a memory page. The
+bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
+mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
+set, the corresponding page is idle.
+
+A page is considered idle if it has not been accessed since it was marked idle
+(for more details on what "accessed" actually means see the IMPLEMENTATION
+DETAILS section). To mark a page idle one has to set the bit corresponding to
+the page by writing to the file. A value written to the file is OR-ed with the
+current bitmap value.
+
+Only accesses to user memory pages are tracked. These are pages mapped to a
+process address space, page cache and buffer pages, swap cache pages. For other
+page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
+and hence such pages are never reported idle.
+
+For huge pages the idle flag is set only on the head page, so one has to read
+/proc/kpageflags in order to correctly count idle huge pages.
+
+Reading from or writing to /proc/kpageidle will return -EINVAL if you are not
+starting the read/write on an 8-byte boundary, or if the size of the read/write
+is not a multiple of 8 bytes. Writing to this file beyond max PFN will return
+-ENXIO.
+
+That said, in order to estimate the amount of pages that are not used by a
+workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in the
+ /proc/kpageidle bitmap. The pages can be found by reading /proc/pid/pagemap
+ if the workload is represented by a process, or by filtering out alien pages
+ using /proc/kpagecgroup in case the workload is placed in a memory cgroup.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read /proc/kpageidle and count the number of bits set. If one wants to
+ ignore certain types of pages, e.g. mlocked pages since they are not
+ reclaimable, he or she can filter them out using /proc/kpageflags.
+
+See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap,
+/proc/kpageflags, and /proc/kpagecgroup.
+
+IMPLEMENTATION DETAILS
+
+The kernel internally keeps track of accesses to user memory pages in order to
+reclaim unreferenced pages first on memory shortage conditions. A page is
+considered referenced if it has been recently accessed via a process address
+space, in which case one or more PTEs it is mapped to will have the Accessed bit
+set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
+latter happens when:
+
+ - a userspace process reads or writes a page using a system call (e.g. read(2)
+ or write(2))
+
+ - a page that is used for storing filesystem buffers is read or written,
+ because a process needs filesystem metadata stored in it (e.g. lists a
+ directory tree)
+
+ - a page is accessed by a device driver using get_user_pages()
+
+When a dirty page is written to swap or disk as a result of memory reclaim or
+exceeding the dirty memory limit, it is not marked referenced.
+
+The idle memory tracking feature adds a new page flag, the Idle flag. This flag
+is set manually, by writing to /proc/kpageidle (see the USER API section), and
+cleared automatically whenever a page is referenced as defined above.
+
+When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
+mapped to, otherwise we will not be able to detect accesses to the page coming
+from a process address space. To avoid interference with the reclaimer, which,
+as noted above, uses the Accessed bit to promote actively referenced pages, one
+more page flag is introduced, the Young flag. When the PTE Accessed bit is
+cleared as a result of setting or updating a page's Idle flag, the Young flag
+is set on the page. The reclaimer treats the Young flag as an extra PTE
+Accessed bit and therefore will consider such a page as referenced.
+
+Since the idle memory tracking feature is based on the memory reclaimer logic,
+it only works with pages that are on an LRU list, other pages are silently
+ignored. That means it will ignore a user memory page if it is isolated, but
+since there are usually not many of them, it should not affect the overall
+result noticeably. In order not to stall scanning of /proc/kpageidle, locked
+pages may be skipped too.
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 538735465693..cff513e28a13 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -71,15 +71,8 @@ There are five components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

- * /proc/kpageidle. This file implements a bitmap where each bit corresponds
- to a page, indexed by PFN. When the bit is set, the corresponding page is
- idle. A page is considered idle if it has not been accessed since it was
- marked idle. To mark a page idle one should set the bit corresponding to the
- page by writing to the file. A value written to the file is OR-ed with the
- current bitmap value. Only user memory pages can be marked idle, for other
- page types input is silently ignored. Writing to this file beyond max PFN
- results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
- set.
+ * /proc/kpageidle. This file comprises API of the idle page tracking feature.
+ See Documentation/vm/idle_page_tracking.txt for more details.

Short descriptions to the page flags:

diff --git a/mm/Kconfig b/mm/Kconfig
index a1de09926171..90fa89175102 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -666,4 +666,4 @@ config IDLE_PAGE_TRACKING
be useful to tune memory cgroup limits and/or for job placement
within a compute cluster.

- See Documentation/vm/pagemap.txt for more details.
+ See Documentation/vm/idle_page_tracking.txt for more details.

Kees Cook

unread,
Jul 27, 2015, 3:19:08 PM7/27/15
to Andrew Morton, Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, Linux API, linu...@vger.kernel.org, Linux-MM, Cgroups, LKML
On Tue, Jul 21, 2015 at 4:34 PM, Andrew Morton
<ak...@linux-foundation.org> wrote:
> On Sun, 19 Jul 2015 15:31:09 +0300 Vladimir Davydov <vdav...@parallels.com> wrote:
>> To mark a page idle one should set the bit corresponding to the
>> page by writing to the file. A value written to the file is OR-ed with the
>> current bitmap value. Only user memory pages can be marked idle, for other
>> page types input is silently ignored. Writing to this file beyond max PFN
>> results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
>> set.
>>
>> This file can be used to estimate the amount of pages that are not
>> used by a particular workload as follows:
>>
>> 1. mark all pages of interest idle by setting corresponding bits in the
>> /proc/kpageidle bitmap
>> 2. wait until the workload accesses its working set
>> 3. read /proc/kpageidle and count the number of bits set
>
> Security implications. This interface could be used to learn about a
> sensitive application by poking data at it and then observing its
> memory access patterns. Perhaps this is why the proc files are
> root-only (whcih I assume is sufficient). Some words here about the
> security side of things and the reasoning behind the chosen permissions
> would be good to have.

As long as this stays true-root-only, I think it should be safe enough.

>> * /proc/kpagecgroup. This file contains a 64-bit inode number of the
>> memory cgroup each page is charged to, indexed by PFN.
>
> Actually "closest online ancestor". This also should be in the
> interface documentation.
>
>> Only available when CONFIG_MEMCG is set.
>
> CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume?
>
>>
>> This file can be used to find all pages (including unmapped file
>> pages) accounted to a particular cgroup. Using /proc/kpageidle, one
>> can then estimate the cgroup working set size.
>>
>> For an example of using these files for estimating the amount of unused
>> memory pages per each memory cgroup, please see the script attached
>> below.
>
> Why were these put in /proc anyway? Rather than under /sys/fs/cgroup
> somewhere? Presumably because /proc/kpageidle is useful in non-memcg
> setups.

Do we need a /proc/vm/ for holding these kinds of things? We're
collecting a lot there. Or invent some way for this to be sensible in
/sys?

-Kees

--
Kees Cook
Chrome OS Security

Andrew Morton

unread,
Jul 27, 2015, 3:25:36 PM7/27/15
to Kees Cook, Vladimir Davydov, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Michal Hocko, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, Linux API, linu...@vger.kernel.org, Linux-MM, Cgroups, LKML
On Mon, 27 Jul 2015 12:18:57 -0700 Kees Cook <kees...@chromium.org> wrote:

> > Why were these put in /proc anyway? Rather than under /sys/fs/cgroup
> > somewhere? Presumably because /proc/kpageidle is useful in non-memcg
> > setups.
>
> Do we need a /proc/vm/ for holding these kinds of things? We're
> collecting a lot there. Or invent some way for this to be sensible in
> /sys?

/proc is the traditional place for such things (/proc/kpagecount,
/proc/kpageflags, /proc/pagetypeinfo). But that was probably a
mistake.

/proc/sys/vm is rather a dumping ground of random tunables and
statuses, but yes, I do think that moving the kpageidle stuff into there
would be better.

Michal Hocko

unread,
Jul 29, 2015, 8:36:41 AM7/29/15
to Vladimir Davydov, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
[...]
> ---- USER API ----
>
> The user API consists of two new proc files:

I was thinking about this for a while. I dislike the interface. It is
quite awkward to use - e.g. you have to read the full memory to check a
single memcg idleness. This might turn out being a problem especially on
large machines. It also provides a very low level information (per-pfn
idleness) which is inherently racy. Does anybody really require this
level of detail?

I would assume that most users are interested only in a single number
which tells the idleness of the system/memcg. Well, you have mentioned
a per-process reclaim but I am quite skeptical about this.

I guess the primary reason to rely on the pfn rather than the LRU walk,
which would be more targeted (especially for memcg cases), is that we
cannot hold lru lock for the whole LRU walk and we cannot continue
walking after the lock is dropped. Maybe we can try to address that
instead? I do not think this is easy to achieve but have you considered
that as an option?

--
Michal Hocko
SUSE Labs

Vladimir Davydov

unread,
Jul 29, 2015, 9:59:33 AM7/29/15
to Michal Hocko, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> [...]
> > ---- USER API ----
> >
> > The user API consists of two new proc files:
>
> I was thinking about this for a while. I dislike the interface. It is
> quite awkward to use - e.g. you have to read the full memory to check a
> single memcg idleness. This might turn out being a problem especially on
> large machines.

Yes, with this API estimating the wss of a single memory cgroup will
cost almost as much as doing this for the whole system.

Come to think of it, does anyone really need to estimate idleness of one
particular cgroup? If we are doing this for finding an optimal memcg
limits configuration or while considering a load move within a cluster
(which I think are the primary use cases for the feature), we must do it
system-wide to see the whole picture.

> It also provides a very low level information (per-pfn idleness) which
> is inherently racy. Does anybody really require this level of detail?

Well, one might want to do it per-process, obtaining PFNs from
/proc/pid/pagemap.

>
> I would assume that most users are interested only in a single number
> which tells the idleness of the system/memcg.

Yes, that's what I need it for - estimating containers' wss for setting
their limits accordingly.

> Well, you have mentioned a per-process reclaim but I am quite
> skeptical about this.

This is what Minchan mentioned initially. Personally, I'm not going to
use it per-process, but I wouldn't rule out this use case either.

>
> I guess the primary reason to rely on the pfn rather than the LRU walk,
> which would be more targeted (especially for memcg cases), is that we
> cannot hold lru lock for the whole LRU walk and we cannot continue
> walking after the lock is dropped. Maybe we can try to address that
> instead? I do not think this is easy to achieve but have you considered
> that as an option?

Yes, I have, and I've come to a conclusion it's not doable, because LRU
lists can be constantly rotating at an arbitrary rate. If you have an
idea in mind how this could be done, please share.

Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
- You can distribute a walk in time to avoid CPU bursts.
- You are free to parallelize the scanner as you wish to decrease the
scan time.

Thanks,
Vladimir

Michel Lespinasse

unread,
Jul 29, 2015, 10:13:57 AM7/29/15
to Vladimir Davydov, Michal Hocko, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
(resending as text, sorry for previous post which didn't make it to the ML)

On Wed, Jul 29, 2015 at 7:12 AM, Michel Lespinasse <wal...@google.com> wrote:
>
> On Wed, Jul 29, 2015 at 6:59 AM, Vladimir Davydov <vdav...@parallels.com> wrote:
> >> I guess the primary reason to rely on the pfn rather than the LRU walk,
> >> which would be more targeted (especially for memcg cases), is that we
> >> cannot hold lru lock for the whole LRU walk and we cannot continue
> >> walking after the lock is dropped. Maybe we can try to address that
> >> instead? I do not think this is easy to achieve but have you considered
> >> that as an option?
> >
> > Yes, I have, and I've come to a conclusion it's not doable, because LRU
> > lists can be constantly rotating at an arbitrary rate. If you have an
> > idea in mind how this could be done, please share.
> >
> > Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
> > - You can distribute a walk in time to avoid CPU bursts.
> > - You are free to parallelize the scanner as you wish to decrease the
> > scan time.
>
> There is a third way: one could go through every MM in the system and scan their page tables. Doing things that way turns out to be generally faster than scanning by physical address, because you don't have to go through RMAP for every page. But, you end up needing to take the mmap_sem lock of every MM (in turn) while scanning them, and that degrades quickly under memory load, which is exactly when you most need this feature. So, scan by address is still what we use here.
>
> My only concern about the interface is that it exposes the fact that the scan is done by address - if the interface only showed per-memcg totals, it would make it possible to change the implementation underneath if we somehow figure out how to work around the mmap_sem issue in the future. I don't think that is necessarily a blocker but this is something to keep in mind IMO.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

Michal Hocko

unread,
Jul 29, 2015, 10:26:30 AM7/29/15
to Vladimir Davydov, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
> On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> > [...]
> > > ---- USER API ----
> > >
> > > The user API consists of two new proc files:
> >
> > I was thinking about this for a while. I dislike the interface. It is
> > quite awkward to use - e.g. you have to read the full memory to check a
> > single memcg idleness. This might turn out being a problem especially on
> > large machines.
>
> Yes, with this API estimating the wss of a single memory cgroup will
> cost almost as much as doing this for the whole system.
>
> Come to think of it, does anyone really need to estimate idleness of one
> particular cgroup?

It is certainly interesting for setting the low limit.

> If we are doing this for finding an optimal memcg
> limits configuration or while considering a load move within a cluster
> (which I think are the primary use cases for the feature), we must do it
> system-wide to see the whole picture.
>
> > It also provides a very low level information (per-pfn idleness) which
> > is inherently racy. Does anybody really require this level of detail?
>
> Well, one might want to do it per-process, obtaining PFNs from
> /proc/pid/pagemap.

Sure once the interface is exported you can do whatever ;) But my
question is whether any real usecase _requires_ it.

> > I would assume that most users are interested only in a single number
> > which tells the idleness of the system/memcg.
>
> Yes, that's what I need it for - estimating containers' wss for setting
> their limits accordingly.

So why don't we export the single per memcg and global knobs then?
This would have few advantages. First of all it would be much easier to
use, you wouldn't have to export memcg ids and finally the implementation
could be changed without any user visible changes (e.g. lru vs. pfn walks),
potential caching and who knows what. In other words. Michel had a
single number interface AFAIR, what was the primary reason to move away
from that API?

> > Well, you have mentioned a per-process reclaim but I am quite
> > skeptical about this.
>
> This is what Minchan mentioned initially. Personally, I'm not going to
> use it per-process, but I wouldn't rule out this use case either.

Considering how many times we have been bitten by too broad interfaces I
would rather be conservative.

> > I guess the primary reason to rely on the pfn rather than the LRU walk,
> > which would be more targeted (especially for memcg cases), is that we
> > cannot hold lru lock for the whole LRU walk and we cannot continue
> > walking after the lock is dropped. Maybe we can try to address that
> > instead? I do not think this is easy to achieve but have you considered
> > that as an option?
>
> Yes, I have, and I've come to a conclusion it's not doable, because LRU
> lists can be constantly rotating at an arbitrary rate. If you have an
> idea in mind how this could be done, please share.

Yes this is really tricky with the current LRU implementation. I
was playing with some ideas (do some checkpoints on the way) but
none of them was really working out on a busy systems. But the LRU
implementation might change in the future. I didn't mean this as a hard
requirement it just sounds that the current implementation restrictions
shape the user visible API which is a good sign to think twice about it.

> Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
> - You can distribute a walk in time to avoid CPU bursts.

This would make the information even more volatile. I am not sure how
helpful it would be in the end.

> - You are free to parallelize the scanner as you wish to decrease the
> scan time.

This is true but you could argue similar with per-node/lru threads if this
was implemented in the kernel and really needed. I am not sure it would
be really needed though. I would expect this would be a low priority
thing.
--
Michal Hocko
SUSE Labs

Vladimir Davydov

unread,
Jul 29, 2015, 10:46:04 AM7/29/15
to Michel Lespinasse, Michal Hocko, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 29, 2015 at 07:12:13AM -0700, Michel Lespinasse wrote:
> On Wed, Jul 29, 2015 at 6:59 AM, Vladimir Davydov <vdav...@parallels.com>
> wrote:
> >> I guess the primary reason to rely on the pfn rather than the LRU walk,
> >> which would be more targeted (especially for memcg cases), is that we
> >> cannot hold lru lock for the whole LRU walk and we cannot continue
> >> walking after the lock is dropped. Maybe we can try to address that
> >> instead? I do not think this is easy to achieve but have you considered
> >> that as an option?
> >
> > Yes, I have, and I've come to a conclusion it's not doable, because LRU
> > lists can be constantly rotating at an arbitrary rate. If you have an
> > idea in mind how this could be done, please share.
> >
> > Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
> > - You can distribute a walk in time to avoid CPU bursts.
> > - You are free to parallelize the scanner as you wish to decrease the
> > scan time.
>
> There is a third way: one could go through every MM in the system and scan
> their page tables. Doing things that way turns out to be generally faster
> than scanning by physical address, because you don't have to go through
> RMAP for every page. But, you end up needing to take the mmap_sem lock of
> every MM (in turn) while scanning them, and that degrades quickly under
> memory load, which is exactly when you most need this feature. So, scan by
> address is still what we use here.

Page table scan approach has the inherent problem - it ignores unmapped
page cache. If a workload does a lot of read/write or map-access-unmap
operations, we won't be able to even roughly estimate its wss.

Michal Hocko

unread,
Jul 29, 2015, 11:09:08 AM7/29/15
to Vladimir Davydov, Michel Lespinasse, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
That page cache is trivially reclaimable if it is clean. If it needs
writeback then it is non-idle only until the next writeback. So why does
it matter for the estimation?

--
Michal Hocko
SUSE Labs

Vladimir Davydov

unread,
Jul 29, 2015, 11:28:48 AM7/29/15
to Michal Hocko, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote:
> On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
> > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> > > [...]
> > > > ---- USER API ----
> > > >
> > > > The user API consists of two new proc files:
> > >
> > > I was thinking about this for a while. I dislike the interface. It is
> > > quite awkward to use - e.g. you have to read the full memory to check a
> > > single memcg idleness. This might turn out being a problem especially on
> > > large machines.
> >
> > Yes, with this API estimating the wss of a single memory cgroup will
> > cost almost as much as doing this for the whole system.
> >
> > Come to think of it, does anyone really need to estimate idleness of one
> > particular cgroup?
>
> It is certainly interesting for setting the low limit.

Yes, but IMO there is no point in setting the low limit for one
particular cgroup w/o considering what's going on with the rest of the
system.

>
> > If we are doing this for finding an optimal memcg
> > limits configuration or while considering a load move within a cluster
> > (which I think are the primary use cases for the feature), we must do it
> > system-wide to see the whole picture.
> >
> > > It also provides a very low level information (per-pfn idleness) which
> > > is inherently racy. Does anybody really require this level of detail?
> >
> > Well, one might want to do it per-process, obtaining PFNs from
> > /proc/pid/pagemap.
>
> Sure once the interface is exported you can do whatever ;) But my
> question is whether any real usecase _requires_ it.

I only know/care about my use case, which is memcg configuration, but I
want to make the API as reusable as possible.

>
> > > I would assume that most users are interested only in a single number
> > > which tells the idleness of the system/memcg.
> >
> > Yes, that's what I need it for - estimating containers' wss for setting
> > their limits accordingly.
>
> So why don't we export the single per memcg and global knobs then?
> This would have few advantages. First of all it would be much easier to
> use, you wouldn't have to export memcg ids and finally the implementation
> could be changed without any user visible changes (e.g. lru vs. pfn walks),
> potential caching and who knows what. In other words. Michel had a
> single number interface AFAIR, what was the primary reason to move away
> from that API?

Because there is too much to be taken care of in the kernel with such an
approach and chances are high that it won't satisfy everyone. What
should the scan period be equal too? Knob. How many kthreads do we want?
Knob. I want to keep history for last N intervals (this was a part of
Michel's implementation), what should N be equal to? Knob. I want to be
able to choose between an instant scan and a scan distributed in time.
Knob. I want to see stats for anon/locked/file/dirty memory separately,
please add them to the API. You see the scale of the problem with doing
it in the kernel?

The API this patch set introduces is simple and fair. It only defines
what "idle" flag mean and gives you a way to flip it. That's it. You
wanna history? DIY. You wanna periodic scans? DIY. Etc.

>
> > > Well, you have mentioned a per-process reclaim but I am quite
> > > skeptical about this.
> >
> > This is what Minchan mentioned initially. Personally, I'm not going to
> > use it per-process, but I wouldn't rule out this use case either.
>
> Considering how many times we have been bitten by too broad interfaces I
> would rather be conservative.

I consider an API "broad" when it tries to do a lot of different things.
sys_prctl is a good example of a broad API.

/proc/kpageidle is not broad, because it does just one thing (I hope it
does it good :). If we attempted to implement the scanner in the kernel
with all those tunables I mentioned above, then we would get a broad API
IMO.

>
> > > I guess the primary reason to rely on the pfn rather than the LRU walk,
> > > which would be more targeted (especially for memcg cases), is that we
> > > cannot hold lru lock for the whole LRU walk and we cannot continue
> > > walking after the lock is dropped. Maybe we can try to address that
> > > instead? I do not think this is easy to achieve but have you considered
> > > that as an option?
> >
> > Yes, I have, and I've come to a conclusion it's not doable, because LRU
> > lists can be constantly rotating at an arbitrary rate. If you have an
> > idea in mind how this could be done, please share.
>
> Yes this is really tricky with the current LRU implementation. I
> was playing with some ideas (do some checkpoints on the way) but
> none of them was really working out on a busy systems. But the LRU
> implementation might change in the future.

It might. Then we could come up with a new /proc or /sys file which
would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is
basis, and give people a choice which one to use.

> I didn't mean this as a hard requirement it just sounds that the
> current implementation restrictions shape the user visible API which
> is a good sign to think twice about it.

Agree. That's why we are discussing it now :-)

>
> > Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
> > - You can distribute a walk in time to avoid CPU bursts.
>
> This would make the information even more volatile. I am not sure how
> helpful it would be in the end.

If you do it periodically, it is quite accurate.

>
> > - You are free to parallelize the scanner as you wish to decrease the
> > scan time.
>
> This is true but you could argue similar with per-node/lru threads if this
> was implemented in the kernel and really needed. I am not sure it would
> be really needed though. I would expect this would be a low priority
> thing.

But if you needed it one day, you'd have to extend the kernel API. With
/proc/kpageidle, you just go and fix your program.

Thanks,
Vladimir

Vladimir Davydov

unread,
Jul 29, 2015, 11:32:01 AM7/29/15
to Michel Lespinasse, Michal Hocko, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 29, 2015 at 08:08:22AM -0700, Michel Lespinasse wrote:
> On Wed, Jul 29, 2015 at 7:45 AM, Vladimir Davydov <vdav...@parallels.com>
> wrote:
> > Page table scan approach has the inherent problem - it ignores unmapped
> > page cache. If a workload does a lot of read/write or map-access-unmap
> > operations, we won't be able to even roughly estimate its wss.
>
> You can catch that in mark_page_accessed on those paths, though.

Actually, the problem here is how to find an unmapped page cache page
*to mark it idle*, not to mark it accessed.

Vladimir Davydov

unread,
Jul 29, 2015, 11:37:07 AM7/29/15
to Michal Hocko, Michel Lespinasse, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
Because it might be a part of a workload's working set, in which case
evicting it will make the workload lag.

Thanks,
Vladimir

Michal Hocko

unread,
Jul 29, 2015, 11:47:30 AM7/29/15
to Vladimir Davydov, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed 29-07-15 18:28:17, Vladimir Davydov wrote:
> On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote:
> > On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
> > > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> > > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> > > > [...]
> > > > > ---- USER API ----
> > > > >
> > > > > The user API consists of two new proc files:
> > > >
> > > > I was thinking about this for a while. I dislike the interface. It is
> > > > quite awkward to use - e.g. you have to read the full memory to check a
> > > > single memcg idleness. This might turn out being a problem especially on
> > > > large machines.
> > >
> > > Yes, with this API estimating the wss of a single memory cgroup will
> > > cost almost as much as doing this for the whole system.
> > >
> > > Come to think of it, does anyone really need to estimate idleness of one
> > > particular cgroup?
> >
> > It is certainly interesting for setting the low limit.
>
> Yes, but IMO there is no point in setting the low limit for one
> particular cgroup w/o considering what's going on with the rest of the
> system.

If you use the low limit for isolating an important load then you do not
have to care about the others that much. All you care about is to set
the reasonable protection level and let others to compete for the rest.

[...]
> > > > I would assume that most users are interested only in a single number
> > > > which tells the idleness of the system/memcg.
> > >
> > > Yes, that's what I need it for - estimating containers' wss for setting
> > > their limits accordingly.
> >
> > So why don't we export the single per memcg and global knobs then?
> > This would have few advantages. First of all it would be much easier to
> > use, you wouldn't have to export memcg ids and finally the implementation
> > could be changed without any user visible changes (e.g. lru vs. pfn walks),
> > potential caching and who knows what. In other words. Michel had a
> > single number interface AFAIR, what was the primary reason to move away
> > from that API?
>
> Because there is too much to be taken care of in the kernel with such an
> approach and chances are high that it won't satisfy everyone. What
> should the scan period be equal too?

No, just gather the data on the read request and let the userspace
to decide when/how often etc. If we are clever enough we can cache
the numbers and prevent from the walk. Write to the file and do the
mark_idle stuff.

> Knob. How many kthreads do we want?
> Knob. I want to keep history for last N intervals (this was a part of
> Michel's implementation), what should N be equal to? Knob.

This all relates to the kernel thread implementation which I wasn't
suggesting. I was referring to Michel's work which might induce that.
I was merely referring to a single number output. Sorry about the
confusion.

> I want to be
> able to choose between an instant scan and a scan distributed in time.
> Knob. I want to see stats for anon/locked/file/dirty memory separately,

Why is this useful for the memcg limits setting or the wss estimation? I
can imagine that a further drop down numbers might be interesting
from the debugging POV but I fail to see what kind of decisions from
userspace you would do based on them.

[...]
> > Yes this is really tricky with the current LRU implementation. I
> > was playing with some ideas (do some checkpoints on the way) but
> > none of them was really working out on a busy systems. But the LRU
> > implementation might change in the future.
>
> It might. Then we could come up with a new /proc or /sys file which
> would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is
> basis, and give people a choice which one to use.

This just leads to proc files count explosion we are seeing
already... Proc ended up in dump ground for different things which
didn't fit elsewhere and I am not very much happy about it to be honest.

[...]
--
Michal Hocko
SUSE Labs

Andres Lagar-Cavilla

unread,
Jul 29, 2015, 11:55:15 AM7/29/15
to Vladimir Davydov, Michal Hocko, Andrew Morton, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, Michel Lespinasse, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed, Jul 29, 2015 at 8:28 AM, Vladimir Davydov
<vdav...@parallels.com> wrote:
> On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote:
>> On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
>> > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
>> > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
>> > > [...]
>> > > > ---- USER API ----
>> > > >
>> > > > The user API consists of two new proc files:
>> > >
>> > > I was thinking about this for a while. I dislike the interface. It is
>> > > quite awkward to use - e.g. you have to read the full memory to check a
>> > > single memcg idleness. This might turn out being a problem especially on
>> > > large machines.
>> >
>> > Yes, with this API estimating the wss of a single memory cgroup will
>> > cost almost as much as doing this for the whole system.
>> >
>> > Come to think of it, does anyone really need to estimate idleness of one
>> > particular cgroup?

You can always adorn memcg with a boolean, trivially configurable from
user-space, and have all the idle computation paths skip the code if
memcg->dont_care_about_idle

>>
>> It is certainly interesting for setting the low limit.
>

Valuable, IMHO

> Yes, but IMO there is no point in setting the low limit for one
> particular cgroup w/o considering what's going on with the rest of the
> system.
>

Probably worth more fleshing out. Why not? Because global reclaim can
execute in any given context, so a noisy neighbor hurts all?

>>
>> > If we are doing this for finding an optimal memcg
>> > limits configuration or while considering a load move within a cluster
>> > (which I think are the primary use cases for the feature), we must do it
>> > system-wide to see the whole picture.
>> >
>> > > It also provides a very low level information (per-pfn idleness) which
>> > > is inherently racy. Does anybody really require this level of detail?
>> >

It's inherently racy for antagonist workloads, but a lot of workloads
are very stable.
FTR I'm happy that the subtle internals are built with this patchset,
and the DIY is very appealing.

Andres
Andres Lagar-Cavilla | Google Kernel Team | andr...@google.com

Michal Hocko

unread,
Jul 29, 2015, 11:58:25 AM7/29/15
to Vladimir Davydov, Michel Lespinasse, Andrew Morton, Andres Lagar-Cavilla, Minchan Kim, Raghavendra K T, Johannes Weiner, Greg Thelen, David Rientjes, Pavel Emelyanov, Cyrill Gorcunov, Jonathan Corbet, linu...@vger.kernel.org, linu...@vger.kernel.org, linu...@kvack.org, cgr...@vger.kernel.org, linux-...@vger.kernel.org
On Wed 29-07-15 18:36:40, Vladimir Davydov wrote:
> On Wed, Jul 29, 2015 at 05:08:55PM +0200, Michal Hocko wrote:
> > On Wed 29-07-15 17:45:39, Vladimir Davydov wrote:
[...]
> > > Page table scan approach has the inherent problem - it ignores unmapped
> > > page cache. If a workload does a lot of read/write or map-access-unmap
> > > operations, we won't be able to even roughly estimate its wss.
> >
> > That page cache is trivially reclaimable if it is clean. If it needs
> > writeback then it is non-idle only until the next writeback. So why does
> > it matter for the estimation?
>
> Because it might be a part of a workload's working set, in which case
> evicting it will make the workload lag.

My point was that no sane application will rely on the unmaped pagecache
being part of the working set. But you are right that you might have a
more complex load consisting of many applications each doing buffered
IO on the same set of files which might get evicted due to other memory
pressure in the meantime and have a higher latencies. This is where low
limit covering this memory as well might be helpful.

--
Michal Hocko
SUSE Labs
It is loading more messages.
0 new messages