Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation

42 views
Skip to first unread message

rober...@collabora.com

unread,
Aug 12, 2016, 6:10:05 PM8/12/16
to
From: Robert Foss <rober...@collabora.com>

Added documentation covering /proc/PID/totmaps.

Signed-off-by: Robert Foss <rober...@collabora.com>
---
Documentation/filesystems/proc.txt | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 7d001be..c06ff33 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -11,6 +11,7 @@ Version 1.3 Kernel version 2.2.12
Kernel version 2.4.0-test11-pre4
------------------------------------------------------------------------------
fixes/update part 1.1 Stefani Seibold <ste...@seibold.net> June 9 2009
+add totmaps Robert Foss <rober...@collabora.com> August 12 2016

Table of Contents
-----------------
@@ -147,6 +148,8 @@ Table 1-1: Process specific entries in /proc
stack Report full stack trace, enable via CONFIG_STACKTRACE
smaps an extension based on maps, showing the memory consumption of
each mapping and flags associated with it
+ totmaps an extenssion based on maps, showing the total memory
+ consumption of all mappings
numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
..............................................................................
@@ -512,6 +515,24 @@ be vanished or the reverse -- new added.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.

+The /proc/PID/totmaps is an extension based on maps, showing the memory
+consumption totals for all of the process's mappings. It lists the sums of the
+same statistics as /proc/PID/smaps.
+
+The process' mappings will be summarized as a series of lines like the
+following:
+
+Rss: 4256 kB
+Pss: 1170 kB
+Shared_Clean: 2720 kB
+Shared_Dirty: 1136 kB
+Private_Clean: 0 kB
+Private_Dirty: 400 kB
+Referenced: 4256 kB
+Anonymous: 1536 kB
+AnonHugePages: 0 kB
+Swap: 0 kB
+
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
--
2.7.4

rober...@collabora.com

unread,
Aug 12, 2016, 6:10:05 PM8/12/16
to
From: Robert Foss <rober...@collabora.com>

This is based on earlier work by Thiago Goncales. It implements a new
per process proc file which summarizes the contents of the smaps file
but doesn't display any addresses. It gives more detailed information
than statm like the PSS (proprotional set size). It differs from the
original implementation in that it doesn't use the full blown set of
seq operations, uses a different termination condition, and doesn't
displayed "Locked" as that was broken on the original implemenation.

This new proc file provides information faster than parsing the potentially
huge smaps file.

Tested-by: Robert Foss <rober...@collabora.com>
Signed-off-by: Robert Foss <rober...@collabora.com>

Signed-off-by: Sonny Rao <sonn...@chromium.org>
---
fs/proc/base.c | 1 +
fs/proc/internal.h | 3 ++
fs/proc/task_mmu.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a11eb71..de3acdf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2855,6 +2855,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("pagemap", S_IRUSR, proc_pagemap_operations),
+ REG("totmaps", S_IRUGO, proc_totmaps_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa27810..c55e1fe 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -281,6 +281,7 @@ struct proc_maps_private {
struct mm_struct *mm;
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
+ struct mem_size_stats *mss;
#endif
#ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
@@ -297,6 +298,8 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_tid_smaps_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_totmaps_operations;
+

extern unsigned long task_vsize(struct mm_struct *);
extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4648c7f..b7612e9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
struct seq_file *seq = file->private_data;
struct proc_maps_private *priv = seq->private;

+ if (!priv)
+ return 0;
+
if (priv->mm)
mmdrop(priv->mm);

@@ -802,6 +805,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
return 0;
}

+static void add_smaps_sum(struct mem_size_stats *mss,
+ struct mem_size_stats *mss_sum)
+{
+ mss_sum->resident += mss->resident;
+ mss_sum->pss += mss->pss;
+ mss_sum->shared_clean += mss->shared_clean;
+ mss_sum->shared_dirty += mss->shared_dirty;
+ mss_sum->private_clean += mss->private_clean;
+ mss_sum->private_dirty += mss->private_dirty;
+ mss_sum->referenced += mss->referenced;
+ mss_sum->anonymous += mss->anonymous;
+ mss_sum->anonymous_thp += mss->anonymous_thp;
+ mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+ struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm = priv->mm;
+ struct vm_area_struct *vma;
+ struct mem_size_stats mss_sum;
+
+ memset(&mss_sum, 0, sizeof(mss_sum));
+ down_read(&mm->mmap_sem);
+ hold_task_mempolicy(priv);
+
+ for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ memset(&mss, 0, sizeof(mss));
+ walk_page_vma(vma, &smaps_walk);
+ add_smaps_sum(&mss, &mss_sum);
+ }
+ }
+
+ seq_printf(m,
+ "Rss: %8lu kB\n"
+ "Pss: %8lu kB\n"
+ "Shared_Clean: %8lu kB\n"
+ "Shared_Dirty: %8lu kB\n"
+ "Private_Clean: %8lu kB\n"
+ "Private_Dirty: %8lu kB\n"
+ "Referenced: %8lu kB\n"
+ "Anonymous: %8lu kB\n"
+ "AnonHugePages: %8lu kB\n"
+ "Swap: %8lu kB\n",
+ mss_sum.resident >> 10,
+ (unsigned long)(mss_sum.pss >> (10 + PSS_SHIFT)),
+ mss_sum.shared_clean >> 10,
+ mss_sum.shared_dirty >> 10,
+ mss_sum.private_clean >> 10,
+ mss_sum.private_dirty >> 10,
+ mss_sum.referenced >> 10,
+ mss_sum.anonymous >> 10,
+ mss_sum.anonymous_thp >> 10,
+ mss_sum.swap >> 10);
+
+ release_task_mempolicy(priv);
+ up_read(&mm->mmap_sem);
+
+ return 0;
+}
+
static int show_pid_smap(struct seq_file *m, void *v)
{
return show_smap(m, v, 1);
@@ -812,6 +884,28 @@ static int show_tid_smap(struct seq_file *m, void *v)
return show_smap(m, v, 0);
}

+static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
+{
+ return NULL + (*pos == 0);
+}
+
+static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
+{
+ ++*pos;
+ return NULL;
+}
+
+static void m_totmaps_stop(struct seq_file *p, void *v)
+{
+}
+
+static const struct seq_operations proc_totmaps_op = {
+ .start = m_totmaps_start,
+ .next = m_totmaps_next,
+ .stop = m_totmaps_stop,
+ .show = totmaps_proc_show
+};
+
static const struct seq_operations proc_pid_smaps_op = {
.start = m_start,
.next = m_next,
@@ -836,6 +930,39 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
return do_maps_open(inode, file, &proc_tid_smaps_op);
}

+static int totmaps_open(struct inode *inode, struct file *file)
+{
+ struct proc_maps_private *priv = NULL;
+ struct seq_file *seq;
+ int ret;
+
+ ret = do_maps_open(inode, file, &proc_totmaps_op);
+ if (ret)
+ goto error;
+
+ seq = file->private_data;
+ priv = seq->private;
+
+ /*
+ * We need to grab references to the task_struct
+ * at open time, because there's a potential information
+ * leak where the totmaps file is opened and held open
+ * while the underlying pid to task mapping changes
+ * underneath it
+ */
+ priv->task = get_proc_task(inode);
+ if (!priv->task) {
+ ret = -ESRCH;
+ goto error;
+ }
+
+ return 0;
+
+error:
+ proc_map_release(inode, file);
+ return ret;
+}
+
const struct file_operations proc_pid_smaps_operations = {
.open = pid_smaps_open,
.read = seq_read,
@@ -850,6 +977,13 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

+const struct file_operations proc_totmaps_operations = {
+ .open = totmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = proc_map_release,
+};
+
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
--
2.7.4

rober...@collabora.com

unread,
Aug 12, 2016, 6:10:05 PM8/12/16
to
From: Robert Foss <rober...@collabora.com>

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.

Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
already is available
- Added support to proc_map_release() fork priv==NULL, to allow
function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
proc_map_release()


Robert Foss (3):
mm, proc: Implement /proc/<pid>/totmaps
Documentation/filesystems: Fixed typo
Documentation/filesystems: Added /proc/PID/totmaps documentation

Documentation/filesystems/proc.txt | 23 ++++++-
fs/proc/base.c | 1 +
fs/proc/internal.h | 3 +
fs/proc/task_mmu.c | 134 +++++++++++++++++++++++++++++++++++++
4 files changed, 160 insertions(+), 1 deletion(-)

--
2.7.4

Jann Horn

unread,
Aug 13, 2016, 10:50:05 AM8/13/16
to
On Fri, Aug 12, 2016 at 06:04:20PM -0400, rober...@collabora.com wrote:
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index aa27810..c55e1fe 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -281,6 +281,7 @@ struct proc_maps_private {
> struct mm_struct *mm;
> #ifdef CONFIG_MMU
> struct vm_area_struct *tail_vma;
> + struct mem_size_stats *mss;

This is unused now, right?


> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4648c7f..b7612e9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
> struct seq_file *seq = file->private_data;
> struct proc_maps_private *priv = seq->private;
>
> + if (!priv)
> + return 0;
> +

You might want to get rid of this, see below.


> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> + struct proc_maps_private *priv = NULL;
> + struct seq_file *seq;
> + int ret;
> +
> + ret = do_maps_open(inode, file, &proc_totmaps_op);
> + if (ret)
> + goto error;
[...]
> +error:
> + proc_map_release(inode, file);
> + return ret;

I don't think this is correct. Have a look at the other callers of
do_maps_open() - none of them do any cleanup steps on error, they
just return. I think the "goto error" here should be a return
instead.

Have a look at the error cases that can cause do_maps_open() to
fail: do_maps_open() just calls proc_maps_open(). If the
__seq_open_private() call fails because of memory pressure,
file->private_data is still NULL, and your newly added NULL check
in proc_map_release() causes proc_map_release() to be a no-op
there. But if proc_maps_open() fails later on, things get nasty:
If, for example, proc_mem_open() fails because of a ptrace
permission denial, __seq_open_file -> seq_open has already set
file->private_data to a struct seq_file *, and then
proc_maps_open(), prior to passing on the error code, calls
seq_release_private -> seq_release, which frees that
struct seq_file * without NULLing the private_data pointer.
As far as I can tell, proc_map_release() would then run into
a use-after-free scenario.


> + priv->task = get_proc_task(inode);
> + if (!priv->task) {
> + ret = -ESRCH;
> + goto error;
> + }

You're not actually using ->task anywhere in the current version,
right? Can this be deleted?


> +const struct file_operations proc_totmaps_operations = {
[...]
> + .release = proc_map_release,

This won't release priv->task, causing a memory leak (exploitable
through a reference counter overflow of the task_struct usage
counter).
signature.asc

Michal Hocko

unread,
Aug 14, 2016, 5:10:06 AM8/14/16
to
On Fri 12-08-16 18:04:19, rober...@collabora.com wrote:
> From: Robert Foss <rober...@collabora.com>
>
> This series implements /proc/PID/totmaps, a tool for retrieving summarized
> information about the mappings of a process.

The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?
--
Michal Hocko
SUSE Labs

Robert Foss

unread,
Aug 15, 2016, 9:10:06 AM8/15/16
to


On 2016-08-14 05:04 AM, Michal Hocko wrote:
> On Fri 12-08-16 18:04:19, rober...@collabora.com wrote:
>> From: Robert Foss <rober...@collabora.com>
>>
>> This series implements /proc/PID/totmaps, a tool for retrieving summarized
>> information about the mappings of a process.
>
> The changelog is absolutely missing the usecase. Why do we need this?
> Why existing interfaces are not sufficient?

You are absolutely right, more info information is in 1/3.
But the gist of it is that it provides a faster and more convenient way
of accessing the information in /proc/PID/smaps.

Michal Hocko

unread,
Aug 15, 2016, 9:50:05 AM8/15/16
to
On Mon 15-08-16 09:00:04, Robert Foss wrote:
>
>
> On 2016-08-14 05:04 AM, Michal Hocko wrote:
> > On Fri 12-08-16 18:04:19, rober...@collabora.com wrote:
> > > From: Robert Foss <rober...@collabora.com>
> > >
> > > This series implements /proc/PID/totmaps, a tool for retrieving summarized
> > > information about the mappings of a process.
> >
> > The changelog is absolutely missing the usecase. Why do we need this?
> > Why existing interfaces are not sufficient?
>
> You are absolutely right, more info information is in 1/3.

Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.

> But the gist of it is that it provides a faster and more convenient way of
> accessing the information in /proc/PID/smaps.

I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.

Robert Foss

unread,
Aug 15, 2016, 10:20:05 AM8/15/16
to


On 2016-08-13 10:39 AM, Jann Horn wrote:
> On Fri, Aug 12, 2016 at 06:04:20PM -0400, rober...@collabora.com wrote:
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index aa27810..c55e1fe 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -281,6 +281,7 @@ struct proc_maps_private {
>> struct mm_struct *mm;
>> #ifdef CONFIG_MMU
>> struct vm_area_struct *tail_vma;
>> + struct mem_size_stats *mss;
>
> This is unused now, right?

Fixing it in v3.

>
>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4648c7f..b7612e9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
>> struct seq_file *seq = file->private_data;
>> struct proc_maps_private *priv = seq->private;
>>
>> + if (!priv)
>> + return 0;
>> +
>
> You might want to get rid of this, see below.

Fixing it in v3.
Thanks for the thorough walkthrough, it is much appreciated.

priv->task does not appear to be used any more, and can be removed.
When "priv->task = get_proc_task(inode)" is removed, totmaps_open()
starts to look just like the other XXX_open functions.

I'll send out v3 as soon as testing has been done.

Robert Foss

unread,
Aug 15, 2016, 12:30:06 PM8/15/16
to


On 2016-08-15 09:42 AM, Michal Hocko wrote:
> On Mon 15-08-16 09:00:04, Robert Foss wrote:
>>
>>
>> On 2016-08-14 05:04 AM, Michal Hocko wrote:
>>> On Fri 12-08-16 18:04:19, rober...@collabora.com wrote:
>>>> From: Robert Foss <rober...@collabora.com>
>>>>
>>>> This series implements /proc/PID/totmaps, a tool for retrieving summarized
>>>> information about the mappings of a process.
>>>
>>> The changelog is absolutely missing the usecase. Why do we need this?
>>> Why existing interfaces are not sufficient?
>>
>> You are absolutely right, more info information is in 1/3.
>
> Patch 1 is silent about the use case as well. It is usually recommended
> to describe the motivation for the change in the cover letter.

I'll change it for v3.

>
>> But the gist of it is that it provides a faster and more convenient way of
>> accessing the information in /proc/PID/smaps.
>
> I am sorry to insist but this is far from a description I was hoping
> for. Why do we need a more convenient API? Please note that this is a
> userspace API which we will have to maintain for ever. We have made many
> mistakes in the past where exporting some information made sense at the
> time while it turned out being a mistake only later on. So let's make
> sure we will not fall into the same trap again.
>
> So please make sure you describe the use case, why the current API is
> insufficient and why it cannot be tweaked to provide the information you
> are looking for.
>

I'll add a more elaborate description to the v3 cover letter.
In v1, there was a discussion which I think presented the practical
applications rather well:

https://lkml.org/lkml/2016/8/9/628

or the qoute from Sonny Rao pasted below:

> The use case is to speed up monitoring of
> memory consumption in environments where RSS isn't precise.
>
> For example Chrome tends to many processes which have hundreds of VMAs
> with a substantial amount of shared memory, and the error of using
> RSS rather than PSS tends to be very large when looking at overall
> memory consumption. PSS isn't kept as a single number that's exported
> like RSS, so to calculate PSS means having to parse a very large smaps
> file.
>
> This process is slow and has to be repeated for many processes, and we
> found that the just act of doing the parsing was taking up a
> significant amount of CPU time, so this patch is an attempt to make
> that process cheaper.

If a reformatted version of this still isn't adequate or desirable for
the cover-letter, please give me another heads up.

Thanks!

Robert Foss

unread,
Aug 15, 2016, 4:20:06 PM8/15/16
to
Actually, priv->task is used by hold_task_mempolicy() in
totmaps_proc_show(), which as far as I understand it is needed due to
the "vma = mm->mmap" looping we do.

Michal Hocko

unread,
Aug 16, 2016, 3:20:05 AM8/16/16
to
On Mon 15-08-16 12:25:10, Robert Foss wrote:
>
>
> On 2016-08-15 09:42 AM, Michal Hocko wrote:
[...]
> > The use case is to speed up monitoring of
> > memory consumption in environments where RSS isn't precise.
> >
> > For example Chrome tends to many processes which have hundreds of VMAs
> > with a substantial amount of shared memory, and the error of using
> > RSS rather than PSS tends to be very large when looking at overall
> > memory consumption. PSS isn't kept as a single number that's exported
> > like RSS, so to calculate PSS means having to parse a very large smaps
> > file.
> >
> > This process is slow and has to be repeated for many processes, and we
> > found that the just act of doing the parsing was taking up a
> > significant amount of CPU time, so this patch is an attempt to make
> > that process cheaper.

Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps
rss:1112288 pss:1096435
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 91%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.

Robert Foss

unread,
Aug 16, 2016, 12:50:05 PM8/16/16
to
I did some performance measurements of my own, and it would seem like
there is about a 2x performance gain to be had. To me that is
substantial, and a larger gain than commonly seen.

There naturally also the benefit that this is a lot easier to interact
with programmatically.

$ ps aux | grep firefox
robertfoss 5025 24.3 13.7 3562820 2219616 ? Rl Aug15 277:44
/usr/lib/firefox/firefox https://allg.one/xpb
$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503


$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45


$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
/proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\'
/proc/5025/smaps }"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

rober...@collabora.com

unread,
Aug 16, 2016, 1:40:06 PM8/16/16
to
From: Robert Foss <rober...@collabora.com>

This series provides the /proc/PID/totmaps feature, which
summarizes the information provided by /proc/PID/smaps for
improved performance and usability reasons.

A use case is to speed up monitoring of memory consumption in
environments where RSS isn't precise.

For example Chrome tends to many processes which have hundreds of VMAs
with a substantial amount of shared memory, and the error of using
RSS rather than PSS tends to be very large when looking at overall
memory consumption. PSS isn't kept as a single number that's exported
like RSS, so to calculate PSS means having to parse a very large smaps
file.

This process is slow and has to be repeated for many processes, and we
found that the just act of doing the parsing was taking up a
significant amount of CPU time, so this patch is an attempt to make
that process cheaper.

/proc/PID/totmaps provides roughly a 2x speedup compared to parsing
/proc/PID/smaps with awk.

$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45


$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps}"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

Robert Foss (3):
mm, proc: Implement /proc/<pid>/totmaps
Documentation/filesystems: Fixed typo
Documentation/filesystems: Added /proc/PID/totmaps documentation

Documentation/filesystems/proc.txt | 23 ++++++-
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 129 +++++++++++++++++++++++++++++++++++++
4 files changed, 154 insertions(+), 1 deletion(-)

--
2.7.4

rober...@collabora.com

unread,
Aug 16, 2016, 1:40:06 PM8/16/16
to

rober...@collabora.com

unread,
Aug 16, 2016, 1:40:08 PM8/16/16
to
From: Robert Foss <rober...@collabora.com>

This is based on earlier work by Thiago Goncales. It implements a new
per process proc file which summarizes the contents of the smaps file
but doesn't display any addresses. It gives more detailed information
than statm like the PSS (proprotional set size). It differs from the
original implementation in that it doesn't use the full blown set of
seq operations, uses a different termination condition, and doesn't
displayed "Locked" as that was broken on the original implemenation.

This new proc file provides information faster than parsing the potentially
huge smaps file.

Tested-by: Robert Foss <rober...@collabora.com>
Signed-off-by: Robert Foss <rober...@collabora.com>

Signed-off-by: Sonny Rao <sonn...@chromium.org>
---
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 132 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a11eb71..de3acdf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2855,6 +2855,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("pagemap", S_IRUSR, proc_pagemap_operations),
+ REG("totmaps", S_IRUGO, proc_totmaps_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa27810..99f97d7 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -297,6 +297,8 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_tid_smaps_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_totmaps_operations;
+

extern unsigned long task_vsize(struct mm_struct *);
extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4648c7f..fe692cb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -802,6 +802,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
return 0;
}

+static void add_smaps_sum(struct mem_size_stats *mss,
+ struct mem_size_stats *mss_sum)
+{
+ mss_sum->resident += mss->resident;
+ mss_sum->pss += mss->pss;
+ mss_sum->shared_clean += mss->shared_clean;
+ mss_sum->shared_dirty += mss->shared_dirty;
+ mss_sum->private_clean += mss->private_clean;
+ mss_sum->private_dirty += mss->private_dirty;
+ mss_sum->referenced += mss->referenced;
+ mss_sum->anonymous += mss->anonymous;
+ mss_sum->anonymous_thp += mss->anonymous_thp;
+ mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+ struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm = priv->mm;
+ struct vm_area_struct *vma;
+ struct mem_size_stats mss_sum;
+
+ memset(&mss_sum, 0, sizeof(mss_sum));
+ down_read(&mm->mmap_sem);
+ hold_task_mempolicy(priv);
+
+ for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ memset(&mss, 0, sizeof(mss));
+ walk_page_vma(vma, &smaps_walk);
+ add_smaps_sum(&mss, &mss_sum);
+ }
+ }
+
@@ -812,6 +881,28 @@ static int show_tid_smap(struct seq_file *m, void *v)
@@ -836,6 +927,37 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
return do_maps_open(inode, file, &proc_tid_smaps_op);
}

+static int totmaps_open(struct inode *inode, struct file *file)
+{
+ struct proc_maps_private *priv = NULL;
+ struct seq_file *seq;
+ int ret;
+
+ ret = do_maps_open(inode, file, &proc_totmaps_op);
+ if (ret)
+ goto error;
+
+ /*
+ * We need to grab references to the task_struct
+ * at open time, because there's a potential information
+ * leak where the totmaps file is opened and held open
+ * while the underlying pid to task mapping changes
+ * underneath it
+ */
+ seq = file->private_data;
+ priv = seq->private;
+ priv->task = get_proc_task(inode);
+ if (!priv->task) {
+ ret = -ESRCH;
+ goto error;
+ }
+
+ return 0;
+
+error:
+ return ret;
+}
+
const struct file_operations proc_pid_smaps_operations = {
.open = pid_smaps_open,
.read = seq_read,
@@ -850,6 +972,13 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

+const struct file_operations proc_totmaps_operations = {

Jann Horn

unread,
Aug 16, 2016, 2:10:06 PM8/16/16
to
On Tue, Aug 16, 2016 at 01:34:16PM -0400, rober...@collabora.com wrote:
> + totmaps an extenssion based on maps, showing the total memory
> + consumption of all mappings

nit: s/extenssion/extension/
signature.asc

Jann Horn

unread,
Aug 16, 2016, 2:24:31 PM8/16/16
to
On Tue, Aug 16, 2016 at 01:34:14PM -0400, rober...@collabora.com wrote:
> From: Robert Foss <rober...@collabora.com>
>
> This is based on earlier work by Thiago Goncales. It implements a new
> per process proc file which summarizes the contents of the smaps file
> but doesn't display any addresses. It gives more detailed information
> than statm like the PSS (proprotional set size). It differs from the
> original implementation in that it doesn't use the full blown set of
> seq operations, uses a different termination condition, and doesn't
> displayed "Locked" as that was broken on the original implemenation.
>
> This new proc file provides information faster than parsing the potentially
> huge smaps file.
>
> Tested-by: Robert Foss <rober...@collabora.com>
> Signed-off-by: Robert Foss <rober...@collabora.com>
>
> Signed-off-by: Sonny Rao <sonn...@chromium.org>
> ---
[...]
> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> + struct proc_maps_private *priv = NULL;
> + struct seq_file *seq;
> + int ret;
> +
> + ret = do_maps_open(inode, file, &proc_totmaps_op);
> + if (ret)
> + goto error;
> +
> + /*
> + * We need to grab references to the task_struct
> + * at open time, because there's a potential information
> + * leak where the totmaps file is opened and held open
> + * while the underlying pid to task mapping changes
> + * underneath it
> + */
> + seq = file->private_data;
> + priv = seq->private;
> + priv->task = get_proc_task(inode);
> + if (!priv->task) {
> + ret = -ESRCH;
> + goto error;

I see that you removed the proc_map_release() call for the upper
error case as I recommended. However, for the second error case,
you do have to call it because do_maps_open() succeeded.

You could fix this by turning the first "goto error;" into
"return;" and adding the proc_map_release() call back in after
the "error:" label. This would be fine - if an error branch just
needs to return an error code, it's okay to do so directly
without jumping to an error label.

Alternatively, you could add a second label
in front of the existing "error:" label, jump to the new label
for the second error case, and call proc_map_release() between
the new label and the old one.


> + }
> +
> + return 0;
> +
> +error:
> + return ret;
> +}
> +
[...]
> +const struct file_operations proc_totmaps_operations = {
> + .open = totmaps_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = proc_map_release,
> +};

As I said regarding v2 already:
signature.asc

Robert Foss

unread,
Aug 16, 2016, 2:30:05 PM8/16/16
to


On 2016-08-16 02:01 PM, Jann Horn wrote:
> nit: s/extenssion/extension/

Thanks :)

Robert Foss

unread,
Aug 16, 2016, 2:40:05 PM8/16/16
to
Ah, naturally. Thanks for the patience and advice!

>
>
>> + }
>> +
>> + return 0;
>> +
>> +error:
>> + return ret;
>> +}
>> +
> [...]
>> +const struct file_operations proc_totmaps_operations = {
>> + .open = totmaps_open,
>> + .read = seq_read,
>> + .llseek = seq_lseek,
>> + .release = proc_map_release,
>> +};
>
> As I said regarding v2 already:
> This won't release priv->task, causing a memory leak (exploitable
> through a reference counter overflow of the task_struct usage
> counter).
>

Sorry about dropping the ball on that one, what's correct way to release
priv->task?

Jann Horn

unread,
Aug 16, 2016, 3:00:06 PM8/16/16
to
get_proc_task() does get_pid_task(), which does get_task_struct(), which
increments the ->usage field of the task. You want the inverse
operation - something that decrements ->usage and checks for zero. This is
done via put_task_struct(), which is defined a few lines below
get_task_struct().
signature.asc

rober...@collabora.com

unread,
Aug 16, 2016, 6:40:08 PM8/16/16
to
From: Robert Foss <rober...@collabora.com>

This is based on earlier work by Thiago Goncales. It implements a new
per process proc file which summarizes the contents of the smaps file
but doesn't display any addresses. It gives more detailed information
than statm like the PSS (proprotional set size). It differs from the
original implementation in that it doesn't use the full blown set of
seq operations, uses a different termination condition, and doesn't
displayed "Locked" as that was broken on the original implemenation.

This new proc file provides information faster than parsing the potentially
huge smaps file.

Tested-by: Robert Foss <rober...@collabora.com>
Signed-off-by: Robert Foss <rober...@collabora.com>

Signed-off-by: Sonny Rao <sonn...@chromium.org>
---
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 144 insertions(+)
index 4648c7f..fd8fd7f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -802,6 +802,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
return 0;
}

+static void add_smaps_sum(struct mem_size_stats *mss,
+ struct mem_size_stats *mss_sum)
+{
+ mss_sum->resident += mss->resident;
+ mss_sum->pss += mss->pss;
+ mss_sum->shared_clean += mss->shared_clean;
+ mss_sum->shared_dirty += mss->shared_dirty;
+ mss_sum->private_clean += mss->private_clean;
+ mss_sum->private_dirty += mss->private_dirty;
+ mss_sum->referenced += mss->referenced;
+ mss_sum->anonymous += mss->anonymous;
+ mss_sum->anonymous_thp += mss->anonymous_thp;
+ mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+
+ return 0;
+}
@@ -836,6 +927,49 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
return do_maps_open(inode, file, &proc_tid_smaps_op);
}

+static int totmaps_open(struct inode *inode, struct file *file)
+{
+ struct proc_maps_private *priv = NULL;
+ struct seq_file *seq;
+ int ret;
+
+ ret = do_maps_open(inode, file, &proc_totmaps_op);
+ if (ret)
+ goto error;
+
+ /*
+ * We need to grab references to the task_struct
+ * at open time, because there's a potential information
+ * leak where the totmaps file is opened and held open
+ * while the underlying pid to task mapping changes
+ * underneath it
+ */
+ seq = file->private_data;
+ priv = seq->private;
+ priv->task = get_proc_task(inode);
+ if (!priv->task) {
+ ret = -ESRCH;
+ goto error_free;
+ }
+
+ return 0;
+
+error_free:
+ proc_map_release(inode, file);
+error:
+ return ret;
+}
+
+static int totmaps_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = file->private_data;
+ struct proc_maps_private *priv = seq->private;
+
+ put_task_struct(priv->task);
+
+ return proc_map_release(inode, file);
+}
+
const struct file_operations proc_pid_smaps_operations = {
.open = pid_smaps_open,
.read = seq_read,
@@ -850,6 +984,13 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

+const struct file_operations proc_totmaps_operations = {
+ .open = totmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = totmaps_release,

rober...@collabora.com

unread,
Aug 16, 2016, 6:40:09 PM8/16/16
to
From: Robert Foss <rober...@collabora.com>

Added documentation covering /proc/PID/totmaps.

Signed-off-by: Robert Foss <rober...@collabora.com>
---
Documentation/filesystems/proc.txt | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 7d001be..4cb97df 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -11,6 +11,7 @@ Version 1.3 Kernel version 2.2.12
Kernel version 2.4.0-test11-pre4
------------------------------------------------------------------------------
fixes/update part 1.1 Stefani Seibold <ste...@seibold.net> June 9 2009
+add totmaps Robert Foss <rober...@collabora.com> August 12 2016

Table of Contents
-----------------
@@ -147,6 +148,8 @@ Table 1-1: Process specific entries in /proc
stack Report full stack trace, enable via CONFIG_STACKTRACE
smaps an extension based on maps, showing the memory consumption of
each mapping and flags associated with it
+ totmaps an extension based on maps, showing the total memory
+ consumption of all mappings

rober...@collabora.com

unread,
Aug 16, 2016, 6:40:12 PM8/16/16
to
From: Robert Foss <rober...@collabora.com>


This series provides the /proc/PID/totmaps feature, which
summarizes the information provided by /proc/PID/smaps for
improved performance and usability reasons.

A use case is to speed up monitoring of memory consumption in
environments where RSS isn't precise.

For example Chrome tends to many processes which have hundreds of VMAs
with a substantial amount of shared memory, and the error of using
RSS rather than PSS tends to be very large when looking at overall
memory consumption. PSS isn't kept as a single number that's exported
like RSS, so to calculate PSS means having to parse a very large smaps
file.

This process is slow and has to be repeated for many processes, and we
found that the just act of doing the parsing was taking up a
significant amount of CPU time, so this patch is an attempt to make
that process cheaper.

/proc/PID/totmaps provides roughly a 2x speedup compared to parsing
/proc/PID/smaps with awk.

$ ps aux | grep firefox
robertfoss 5025 24.3 13.7 3562820 2219616 ? Rl Aug15 277:44 /usr/lib/firefox/firefox https://allg.one/xpb
$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503
$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45

$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
already is available
- Added support to proc_map_release() fork priv==NULL, to allow
function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for proc_map_release()

Changes since v2:
- Removed struct mem_size_stats *mss from struct proc_maps_private
- Removed priv->task assignment in totmaps_open() call
- Moved some assignements calls totmaps_open() around to increase code
clarity
- Moved some function calls to unlock data structures before printing

Changes since v3:
- Fixed typo in totmaps documentation
- Fixed issue where proc_map_release wasn't called on error
- Fixed put_task_struct not being called during .release()

Robert Foss (3):
mm, proc: Implement /proc/<pid>/totmaps
Documentation/filesystems: Fixed typo
Documentation/filesystems: Added /proc/PID/totmaps documentation

Documentation/filesystems/proc.txt | 23 +++++-
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 141 +++++++++++++++++++++++++++++++++++++
4 files changed, 166 insertions(+), 1 deletion(-)

--
2.7.4

Michal Hocko

unread,
Aug 17, 2016, 4:30:07 AM8/17/16
to
On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> /proc/5025/smaps }"
> [...]
> Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> }"
> User time (seconds): 0.37
> System time (seconds): 0.45
> Percent of CPU this job got: 92%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

so depending on /proc/sys/vm/max_map_count you will get the maximum
possible mmaps. I am using a default so 65k mappings. Then I have
retried your 25x file parsing:
$ cat s.sh
#!/bin/sh

pid=$1
for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
done

But I am getting different results from you:
$ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l
65532
[...]
Command being timed: "sh s.sh 14808"
User time (seconds): 0.00
System time (seconds): 20.10
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20

The results are stable when I try multiple times, in fact there
shouldn't be any reason for them not to be. Then I went on to increase
max_map_count to 250k and that behaves consistently:
$ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l
250002
[...]
Command being timed: "sh s.sh 16093"
User time (seconds): 0.00
System time (seconds): 77.93
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09

So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.

Jann Horn

unread,
Aug 17, 2016, 5:40:06 AM8/17/16
to
Ah, nice, that's a reasonable test program. :)


> So with a reasonable user space the parsing is really not all that time
> consuming wrt. smaps handling. That being said I am still very skeptical
> about a dedicated proc file which accomplishes what userspace can done
> in a trivial way.

Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <err.h>
#include <stdio.h>

char buf[1000000];

int main(int argc, char **argv) {
printf("pid:%d\n", getpid());
while (1) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) continue;
if (read(fd, buf, sizeof(buf)) < 0)
err(1, "read");
close(fd);
}
}
$ gcc -Wall -o justreadloop justreadloop.c
$

Now launch your test:

$ ./mapstuff
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead Shared Object Symbol
30,43% [kernel] [k] format_decode
9,12% [kernel] [k] number
7,66% [kernel] [k] vsnprintf
7,06% [kernel] [k] __lock_acquire
3,23% [kernel] [k] lock_release
2,85% [kernel] [k] debug_lockdep_rcu_enabled
2,25% [kernel] [k] skip_atoi
2,13% [kernel] [k] lock_acquire
2,05% [kernel] [k] show_smap

That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format. (My kernel is built with a
bunch of debug options - the results might look very different on
distro kernels or so, so please try this yourself.)

I guess it could be argued that this is not just a problem with
smaps, but also a problem with format strings (or text-based interfaces
in general) just being slow in general.

(Here is a totally random and crazy thought: Can we put something into
the kernel build process that replaces printf calls that use simple
format strings with equivalent non-printf calls? Move the cost of
evaluating the format string to compile time?)
signature.asc

Michal Hocko

unread,
Aug 17, 2016, 9:10:06 AM8/17/16
to
This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.

> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> time spent on evaluating format strings. The new interface
> wouldn't have to spend that much time on format strings because there
> isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc/<pid>/stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.
---
From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers. Say we would like to get Rss and Pss from a particular
process. In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
;

printf("pid:%d\n", getpid());
pause();
return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
User time (seconds): 0.01
System time (seconds): 0.44
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47

But the perf says:
22.55% awk [kernel.kallsyms] [k] format_decode
14.65% awk [kernel.kallsyms] [k] vsnprintf
6.40% awk [kernel.kallsyms] [k] number
2.53% awk [kernel.kallsyms] [k] shmem_mapping
2.53% awk [kernel.kallsyms] [k] show_smap
1.81% awk [kernel.kallsyms] [k] lock_acquire

we are spending most of the time actually generating the output which is
quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
This will give us:
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
User time (seconds): 0.00
System time (seconds): 0.41
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42

which will give us ~7% improvement. Perf says:
28.87% awk [kernel.kallsyms] [k] seq_puts
5.30% awk [kernel.kallsyms] [k] vsnprintf
4.54% awk [kernel.kallsyms] [k] format_decode
3.73% awk [kernel.kallsyms] [k] show_smap
2.56% awk [kernel.kallsyms] [k] shmem_mapping
1.92% awk [kernel.kallsyms] [k] number
1.80% awk [kernel.kallsyms] [k] lock_acquire
1.75% awk [kernel.kallsyms] [k] print_name_value_kb

Reported-by: Jann Horn <ja...@thejh.net>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
1 file changed, 25 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84ef9de9..41c24c0811da 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
{
}

+static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
+{
+ seq_puts(m, name);
+ seq_put_decimal_ull(m, 0, val);
+ seq_puts(m, " kB\n");
+}
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
@@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)

show_map_vma(m, vma, is_pid);

- seq_printf(m,
- "Size: %8lu kB\n"
- "Rss: %8lu kB\n"
- "Pss: %8lu kB\n"
- "Shared_Clean: %8lu kB\n"
- "Shared_Dirty: %8lu kB\n"
- "Private_Clean: %8lu kB\n"
- "Private_Dirty: %8lu kB\n"
- "Referenced: %8lu kB\n"
- "Anonymous: %8lu kB\n"
- "AnonHugePages: %8lu kB\n"
- "ShmemPmdMapped: %8lu kB\n"
- "Shared_Hugetlb: %8lu kB\n"
- "Private_Hugetlb: %7lu kB\n"
- "Swap: %8lu kB\n"
- "SwapPss: %8lu kB\n"
- "KernelPageSize: %8lu kB\n"
- "MMUPageSize: %8lu kB\n"
- "Locked: %8lu kB\n",
- (vma->vm_end - vma->vm_start) >> 10,
- mss.resident >> 10,
- (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
- mss.shared_clean >> 10,
- mss.shared_dirty >> 10,
- mss.private_clean >> 10,
- mss.private_dirty >> 10,
- mss.referenced >> 10,
- mss.anonymous >> 10,
- mss.anonymous_thp >> 10,
- mss.shmem_thp >> 10,
- mss.shared_hugetlb >> 10,
- mss.private_hugetlb >> 10,
- mss.swap >> 10,
- (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
- vma_kernel_pagesize(vma) >> 10,
- vma_mmu_pagesize(vma) >> 10,
- (vma->vm_flags & VM_LOCKED) ?
+ print_name_value_kb(m, "Size: ", (vma->vm_end - vma->vm_start) >> 10);
+ print_name_value_kb(m, "Rss: ", mss.resident >> 10);
+ print_name_value_kb(m, "Pss: ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
+ print_name_value_kb(m, "Shared_Clean: ", mss.shared_clean >> 10);
+ print_name_value_kb(m, "Shared_Dirty: ", mss.shared_dirty >> 10);
+ print_name_value_kb(m, "Private_Clean: ", mss.private_clean >> 10);
+ print_name_value_kb(m, "Private_Dirty: ", mss.private_dirty >> 10);
+ print_name_value_kb(m, "Referenced: ", mss.referenced >> 10);
+ print_name_value_kb(m, "Anonymous: ", mss.anonymous >> 10);
+ print_name_value_kb(m, "AnonHugePages: ", mss.anonymous_thp >> 10);
+ print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
+ print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
+ print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
+ print_name_value_kb(m, "Swap: ", mss.swap >> 10);
+ print_name_value_kb(m, "SwapPss: ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
+ print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
+ print_name_value_kb(m, "MMUPageSize: ", vma_mmu_pagesize(vma) >> 10);
+ print_name_value_kb(m, "Locked: ", (vma->vm_flags & VM_LOCKED) ?
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
-
arch_show_smap(m, vma);
show_smap_vma_flags(m, vma);
m_cache_vma(m, vma);
--
2.8.1

Robert Foss

unread,
Aug 17, 2016, 12:50:05 PM8/17/16
to
@Sonny Rao: Maybe you can comment on how often, for how many processes
this information is needed and for which reasons this information is useful.

Sonny Rao

unread,
Aug 17, 2016, 3:10:05 PM8/17/16
to
If the question is why do we need to read RSS, PSS, Private_*, Swap
and the other fields so often?

I have two use cases so far involving monitoring per-process memory
usage, and we usually need to read stats for about 25 processes.

Here's a timing example on an fairly recent ARM system 4 core RK3288
running at 1.8Ghz

localhost ~ # time cat /proc/25946/smaps > /dev/null

real 0m0.036s
user 0m0.020s
sys 0m0.020s

localhost ~ # time cat /proc/25946/totmaps > /dev/null

real 0m0.027s
user 0m0.010s
sys 0m0.010s
localhost ~ #

I'll ignore the user time for now, and we see about 20 ms of system
time with smaps and 10 ms with totmaps, with 20 similar processes it
would be 400 milliseconds of cpu time for the kernel to get this
information from smaps vs 200 milliseconds with totmaps. Even totmaps
is still pretty slow, but much better than smaps.

Use cases:
1) Basic task monitoring -- like "top" that shows memory consumption
including PSS, Private, Swap
1 second update means about 40% of one CPU is spent in the kernel
gathering the data with smaps

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information
and we'd like to be able to get this information to make the
decision much faster than 400ms

Michal Hocko

unread,
Aug 18, 2016, 3:50:06 AM8/18/16
to
On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
[...]
I would argue that even 20% is way too much for such a monitoring. What
is the value to do it so often tha 20 vs 40ms really matters?

> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.

Michal Hocko

unread,
Aug 18, 2016, 9:10:07 PM8/18/16
to
On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> than let the kernel's OOM killer activate and need to gather this
> >> information and we'd like to be able to get this information to make
> >> the decision much faster than 400ms
> >
> > Global OOM handling in userspace is really dubious if you ask me. I
> > understand you want something better than SIGKILL and in fact this is
> > already possible with memory cgroup controller (btw. memcg will give
> > you a cheap access to rss, amount of shared, swapped out memory as
> > well). Anyway if you are getting close to the OOM your system will most
> > probably be really busy and chances are that also reading your new file
> > will take much more time. I am also not quite sure how is pss useful for
> > oom decisions.
>
> I mentioned it before, but based on experience RSS just isn't good
> enough -- there's too much sharing going on in our use case to make
> the correct decision based on RSS. If RSS were good enough, simply
> put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

> So even with memcg I think we'd have the same problem?

memcg will give you instant anon, shared counters for all processes in
the memcg.

> > Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
> > specific usecase but so far I haven't heard any sound argument for it to
> > be generally usable. It is true that smaps is unnecessarily costly but
> > at least I can see some room for improvements. A simple patch I've
> > posted cut the formatting overhead by 7%. Maybe we can do more.
>
> It seems like a general problem that if you want these values the
> existing kernel interface can be very expensive, so it would be
> generally usable by any application which wants a per process PSS,
> private data, dirty data or swap value.

yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.

> I mentioned two use cases, but I guess I don't understand the comment
> about why it's not usable by other use cases.

I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.

Robert Foss

unread,
Aug 18, 2016, 10:00:07 PM8/18/16
to
Is it technically feasible to add instant pss support to memcg?

@Sonny Rao: Would using cgroups be acceptable for chromiumos?

Minchan Kim

unread,
Aug 18, 2016, 10:30:05 PM8/18/16
to
Hi Michal,

On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > >> than let the kernel's OOM killer activate and need to gather this
> > >> information and we'd like to be able to get this information to make
> > >> the decision much faster than 400ms
> > >
> > > Global OOM handling in userspace is really dubious if you ask me. I
> > > understand you want something better than SIGKILL and in fact this is
> > > already possible with memory cgroup controller (btw. memcg will give
> > > you a cheap access to rss, amount of shared, swapped out memory as
> > > well). Anyway if you are getting close to the OOM your system will most
> > > probably be really busy and chances are that also reading your new file
> > > will take much more time. I am also not quite sure how is pss useful for
> > > oom decisions.
> >
> > I mentioned it before, but based on experience RSS just isn't good
> > enough -- there's too much sharing going on in our use case to make
> > the correct decision based on RSS. If RSS were good enough, simply
> > put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

My case is not for OOM decision but I agree it would be great if we can get
*fast* smap summary information.

PSS is really great tool to figure out how processes consume memory
more exactly rather than RSS. We have been used it for monitoring
of memory for per-process. Although it is not used for OOM decision,
it would be great if it is speed up because we don't want to spend
many CPU time for just monitoring.

For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
either. Even, we don't need address range for just monitoring when we don't
investigate in detail.

Although they are not severe overhead, why does it emit the useless
information? Even bloat day by day. :( With that, userspace tools should
spend more time to parse which is pointless.

Having said that, I'm not fan of creating new stat knob for that, either.
How about appending summary information in the end of smap?
So, monitoring users can just open the file and lseek to the (end - 1) and
read the summary only.

Thanks.

Sonny Rao

unread,
Aug 19, 2016, 1:20:06 AM8/19/16
to
Yeah it is too much (I believe I said that) but it's significantly better.

>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> than let the kernel's OOM killer activate and need to gather this
>> information and we'd like to be able to get this information to make
>> the decision much faster than 400ms
>
> Global OOM handling in userspace is really dubious if you ask me. I
> understand you want something better than SIGKILL and in fact this is
> already possible with memory cgroup controller (btw. memcg will give
> you a cheap access to rss, amount of shared, swapped out memory as
> well). Anyway if you are getting close to the OOM your system will most
> probably be really busy and chances are that also reading your new file
> will take much more time. I am also not quite sure how is pss useful for
> oom decisions.

I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS. If RSS were good enough, simply
put, this patch wouldn't exist. So even with memcg I think we'd have
the same problem?

>
> Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
> specific usecase but so far I haven't heard any sound argument for it to
> be generally usable. It is true that smaps is unnecessarily costly but
> at least I can see some room for improvements. A simple patch I've
> posted cut the formatting overhead by 7%. Maybe we can do more.

It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value. I mentioned two use cases,
but I guess I don't understand the comment about why it's not usable
by other use cases.

Sonny Rao

unread,
Aug 19, 2016, 2:30:06 AM8/19/16
to
It's possible, though I think we'd end up putting each renderer in
it's own cgroup to get the PSS stat, so it seems a bit like overkill.
I think memcg also has some overhead that we'd need to quantify but I
could be mistaken about this.

Sonny Rao

unread,
Aug 19, 2016, 2:50:04 AM8/19/16
to
That would work fine for us as long as it's fast -- i.e. we don't
still have to do all the expensive per-VMA format conversion in the
kernel.

> Thanks.

Sonny Rao

unread,
Aug 19, 2016, 2:50:05 AM8/19/16
to
On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mho...@kernel.org> wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mho...@kernel.org> wrote:
>> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
>> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> than let the kernel's OOM killer activate and need to gather this
>> >> information and we'd like to be able to get this information to make
>> >> the decision much faster than 400ms
>> >
>> > Global OOM handling in userspace is really dubious if you ask me. I
>> > understand you want something better than SIGKILL and in fact this is
>> > already possible with memory cgroup controller (btw. memcg will give
>> > you a cheap access to rss, amount of shared, swapped out memory as
>> > well). Anyway if you are getting close to the OOM your system will most
>> > probably be really busy and chances are that also reading your new file
>> > will take much more time. I am also not quite sure how is pss useful for
>> > oom decisions.
>>
>> I mentioned it before, but based on experience RSS just isn't good
>> enough -- there's too much sharing going on in our use case to make
>> the correct decision based on RSS. If RSS were good enough, simply
>> put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?

Chrome tends to use a lot of shared memory so we found PSS to be
better than RSS, and I can give you examples of the RSS and PSS on
real systems to illustrate the magnitude of the difference between
those two numbers if that would be useful.

>
>> So even with memcg I think we'd have the same problem?
>
> memcg will give you instant anon, shared counters for all processes in
> the memcg.
>

We want to be able to get per-process granularity quickly. I'm not
sure if memcg provides that exactly?

>> > Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
>> > specific usecase but so far I haven't heard any sound argument for it to
>> > be generally usable. It is true that smaps is unnecessarily costly but
>> > at least I can see some room for improvements. A simple patch I've
>> > posted cut the formatting overhead by 7%. Maybe we can do more.
>>
>> It seems like a general problem that if you want these values the
>> existing kernel interface can be very expensive, so it would be
>> generally usable by any application which wants a per process PSS,
>> private data, dirty data or swap value.
>
> yes this is really unfortunate. And if at all possible we should address
> that. Precise values require the expensive rmap walk. We can introduce
> some caching to help that. But so far it seems the biggest overhead is
> to simply format the output and that should be addressed before any new
> proc file is added.
>
>> I mentioned two use cases, but I guess I don't understand the comment
>> about why it's not usable by other use cases.
>
> I might be wrong here but a use of pss is quite limited and I do not
> remember anybody asking for large optimizations in that area. I still do
> not understand your use cases properly so I am quite skeptical about a
> general usefulness of a new file.

How do you know that usage of PSS is quite limited? I can only say
that we've been using it on Chromium OS for at least four years and
have found it very valuable, and I think I've explained the use cases
in this thread. If you have more specific questions then I can try to
clarify.

Michal Hocko

unread,
Aug 19, 2016, 4:00:10 AM8/19/16
to
So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

I am sorry for being dense but it is still not clear to me how the
single pss number can be used for oom or, in general, any serious
decisions. The counter might be useful of course for debugging purposes
or to have a general overview but then arguing about 40 vs 20ms sounds a
bit strange to me.

> Chrome tends to use a lot of shared memory so we found PSS to be
> better than RSS, and I can give you examples of the RSS and PSS on
> real systems to illustrate the magnitude of the difference between
> those two numbers if that would be useful.
>
> >
> >> So even with memcg I think we'd have the same problem?
> >
> > memcg will give you instant anon, shared counters for all processes in
> > the memcg.
> >
>
> We want to be able to get per-process granularity quickly. I'm not
> sure if memcg provides that exactly?

I will give you that information if you do process-per-memcg but that
doesn't sound ideal. I thought those 20-something processes you were
talking about are treated together but it seems I misunderstood.

Michal Hocko

unread,
Aug 19, 2016, 5:10:05 AM8/19/16
to
So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc/<pid>/statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!

Sonny Rao

unread,
Aug 19, 2016, 2:00:05 PM8/19/16
to
Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed. Swap and
PSS also give us some indication of additional memory which might get
freed up.

>
> I am sorry for being dense but it is still not clear to me how the
> single pss number can be used for oom or, in general, any serious
> decisions. The counter might be useful of course for debugging purposes
> or to have a general overview but then arguing about 40 vs 20ms sounds a
> bit strange to me.

Yeah so it's more than just the single PSS number, it's PSS,
Private_Clean, Private_dirty, Swap are all interesting numbers to make
these decisions.

Sonny Rao

unread,
Aug 19, 2016, 2:30:06 PM8/19/16
to
Another thing I noticed was that we lock down smaps on Chromium OS. I
think this is to avoid exposing more information than necessary via
proc. The totmaps file gives us just the information we need and
nothing else. I certainly don't think we need a proc file for this
use case -- do you think a new system call is better or something
else?

Minchan Kim

unread,
Aug 21, 2016, 8:10:04 PM8/21/16
to
I cannot understand how kernel space is more expensive.
Hmm. I tested your test program on my machine.


#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
cat /proc/$pid/smaps > /dev/null
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s_v.sh
pid:21925
real 0m3.365s
user 0m0.031s
sys 0m3.046s


vs.

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
/proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh
pid:21973

real 0m17.812s
user 0m12.612s
sys 0m5.187s

perf report says

39.56% awk gawk [.] dfaexec
7.61% awk [kernel.kallsyms] [k] format_decode
6.37% awk gawk [.] avoid_dfa
5.85% awk gawk [.] interpret
5.69% awk [kernel.kallsyms] [k] __memcpy
4.37% awk [kernel.kallsyms] [k] vsnprintf
2.69% awk [kernel.kallsyms] [k] number.isra.13
2.10% awk gawk [.] research
1.91% awk gawk [.] 0x00000000000351d0
1.49% awk gawk [.] free_wstr
1.27% awk gawk [.] unref
1.19% awk gawk [.] reset_record
0.95% awk gawk [.] set_record
0.95% awk gawk [.] get_field
0.94% awk [kernel.kallsyms] [k] show_smap

Parsing is much expensive than kernel.
Could you retest your test program?

> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.

Sure. What solution do you think as generic way?

Michal Hocko

unread,
Aug 22, 2016, 3:50:06 AM8/22/16
to
On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]
> #!/bin/sh
> ./smap_test &
> pid=$!
>
> for i in $(seq 25)
> do
> awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> /proc/$pid/smaps
> done
> kill $pid
>
> root@bbox:/home/barrios/test/smap# time ./s.sh
> pid:21973
>
> real 0m17.812s
> user 0m12.612s
> sys 0m5.187s

retested on the bare metal (x86_64 - 2CPUs)
Command being timed: "sh s.sh"
User time (seconds): 0.00
System time (seconds): 18.08
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

> > like a problem we are not able to address. And I would even argue that
> > we want to address it in a generic way as much as possible.
>
> Sure. What solution do you think as generic way?

either optimize seq_printf or replace it with something faster.

Michal Hocko

unread,
Aug 22, 2016, 4:00:06 AM8/22/16
to
OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

> We also use the private clean and private dirty and swap fields to
> make a few metrics for the processes and charge each process for it's
> private, shared, and swap data. Private clean and dirty are used for
> estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc/<pid>/statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

> Swap and
> PSS also give us some indication of additional memory which might get
> freed up.

Minchan Kim

unread,
Aug 22, 2016, 10:20:07 AM8/22/16
to
If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real 0m0.770s
user 0m0.574s
sys 0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.
smap_test.c

Robert Foss

unread,
Aug 22, 2016, 10:40:05 AM8/22/16
to
$ ./smap_test &
pid:19658 nr_vma:65514

$ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d
pss:%d\n", rss, pss}' /proc/19658/smaps
rss:263452 pss:262151

real 0m0.625s
user 0m0.404s
sys 0m0.216s

Michal Hocko

unread,
Aug 22, 2016, 12:50:11 PM8/22/16
to
Hmm, this is really interesting. I have checked a different machine and
it shows different results. Same code, slightly different version of awk
(4.1.0) and the results are different
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/48925/smaps"
User time (seconds): 0.43
System time (seconds): 0.27

I have no idea why those numbers are so different on my laptop
yet. It surely looks suspicious. I will try to debug this further
tomorrow. Anyway, the performance is just one side of the problem. I
have tried to express my concerns about a single exported pss value in
other email. Please try to step back and think about how useful is this
information without the knowing which resource we are talking about.

Michal Hocko

unread,
Aug 22, 2016, 1:30:07 PM8/22/16
to
On Mon 22-08-16 18:45:54, Michal Hocko wrote:
[...]
> I have no idea why those numbers are so different on my laptop
> yet. It surely looks suspicious. I will try to debug this further
> tomorrow.

Hmm, so I've tried to use my version of awk on other machine and vice
versa and it didn't make any difference. So this is independent on the
awk version it seems. So I've tried to strace /usr/bin/time and
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128

so the kernel indeed reports 0 user time for some reason. Note I
was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
modifications). The other machine which reports non-0 utime is 3.12
SLES kernel. Maybe I am hitting some accounting bug. At first I was
suspecting CONFIG_NO_HZ_FULL because that is the main difference between
my and the other machine but then I've noticed that the tests I was
doing in kvm have this disabled too.. so it must be something else.

Weird...

Michal Hocko

unread,
Aug 22, 2016, 1:50:07 PM8/22/16
to
4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
in both kernels.

Sonny Rao

unread,
Aug 22, 2016, 6:50:06 PM8/22/16
to
I see the specific point about PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing. But what about the private_clean and private_dirty? Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg. Swap seems to be available in /proc/<pid>/status.

I looked at the "shared" value in /proc/<pid>/statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes. On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss: 98972 kB
Pss: 54717 kB
Shared_Clean: 19020 kB
Shared_Dirty: 26352 kB
Private_Clean: 0 kB
Private_Dirty: 53600 kB
Referenced: 92184 kB
Anonymous: 46524 kB
AnonHugePages: 24576 kB
Swap: 13148 kB


RSS is 80% higher than PSS and 84% higher than private data

localhost ~ # cat /proc/21470/totmaps
Rss: 118420 kB
Pss: 70938 kB
Shared_Clean: 22212 kB
Shared_Dirty: 26520 kB
Private_Clean: 0 kB
Private_Dirty: 69688 kB
Referenced: 111500 kB
Anonymous: 79928 kB
AnonHugePages: 24576 kB
Swap: 12964 kB

RSS is 66% higher than RSS and 69% higher than private data

localhost ~ # cat /proc/21435/totmaps
Rss: 97156 kB
Pss: 50044 kB
Shared_Clean: 21920 kB
Shared_Dirty: 26400 kB
Private_Clean: 0 kB
Private_Dirty: 48836 kB
Referenced: 90012 kB
Anonymous: 75228 kB
AnonHugePages: 24576 kB
Swap: 13064 kB

RSS is 94% higher than PSS and 98% higher than private data.

It looks like there's a set of about 40MB of shared pages which cause
the difference in this case.
Swap was roughly even on these but I don't think it's always going to be true.

Michal Hocko

unread,
Aug 23, 2016, 4:30:07 AM8/23/16
to
and one more thing. It is not like utime accounting would be completely
broken and always report 0. Other commands report non-0 values even on
4.6 kernels. I will try to bisect this down later today.

Rik van Riel

unread,
Aug 23, 2016, 5:50:07 PM8/23/16
to
On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
> OK, so it seems I found it. I was quite lucky because
> account_user_time
> is not all that popular function and there were basically no changes
> besides Riks ff9a9b4c4334 ("sched, time: Switch
> VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity") and that seems to cause the regression.
> Reverting
> the commit on top of the current mmotm seems to fix the issue for me.
>
> And just to give Rik more context. While debugging overhead of the
> /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> -v
> (source for ./max_mmap is [1])
>
> root@test1:~# uname -r
> 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> root@test1:~# ./max_map 
> pid:2990 maps:65515
> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> rss:263368 pss:262203
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.45
>         Percent of CPU this job got: 98%
>

> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> rss:263316 pss:262199
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
>         User time (seconds): 0.18
>         System time (seconds): 0.29
>         Percent of CPU this job got: 97%

The patch in question makes user and system
time accounting essentially tick-based. If
jiffies changes while the task is in user
mode, time gets accounted as user time, if
jiffies changes while the task is in system
mode, time gets accounted as system time.

If you get "unlucky", with a job like the
above, it is possible all time gets accounted
to system time.

This would be true both with the system running
with a periodic timer tick (before and after my
patch is applied), and in nohz_idle mode (after
my patch).

However, it does seem quite unlikely that you
get zero user time, since you have 125 timer
ticks in half a second. Furthermore, you do not
even have NO_HZ_FULL enabled...

Does the workload consistently get zero user
time?

If so, we need to dig further to see under
what precise circumstances that happens.

On my laptop, with kernel 4.6.3-300.fc24.x86_64
I get this:

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
"rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
rss:263368 pss:262145
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
{printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
User time (seconds): 0.64
System time (seconds): 0.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83

The main difference between your and my
NO_HZ config seems to be that NO_HZ_FULL
is set here. However, it is not enabled
at run time, so both of our systems
should only really get NO_HZ_IDLE
effectively.

Running tasks should get sampled with the
regular timer tick, while they are running.

In other words, vtime accounting should be
disabled in both of our tests, for everything
except the idle task.

Do I need to do anything special to reproduce
your bug, besides running the max mmap program
and the awk script?

Marcin Jabrzyk

unread,
Aug 24, 2016, 6:20:06 AM8/24/16
to
Sorry to hijack the thread, but I've found it recently
and I guess it's the best place to present our point.
We are working at our custom OS based on Linux and we also suffered much
by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
application memory management polices (Low Memory Killer) using data
provided by smaps but we failed due to very long time needed for reading
and parsing properly the file.

We've also observed that RSS measurement is often highly over PSS which
seems to be more real memory usage for process. Using smaps we would
be able to calculate USS usage and know exact minimum value of memory
that would be freed after terminating some process. Those are very
important sources of information as they give as the possibility to
provide best possible app life-cycle.

We have also tried to use smaps in some application for OS developers
as source of detailed information of memory usage of the system.
For checking possible ways of improvement we tried totmaps from earlier
version. On sample case for our app the CPU usage as presented by 'top'
decreases from ~60% to ~4.5% only by changing source from smpas to tomaps.

So we are also very interested in using interface such as totmaps as it
gives detailed and complete memory usage information for user-space and
in our case much of information provided by smaps is for us not useful
at all.

We are also using or tried using other interfaces like status, statm,
cgroups.memory etc. but still totmaps/smaps are still the best interface
to get all of the informations per process based in single place.

>>
>>> We also use the private clean and private dirty and swap fields to
>>> make a few metrics for the processes and charge each process for it's
>>> private, shared, and swap data. Private clean and dirty are used for
>>> estimating a lower bound on how much memory would be freed.
>>
>> I can imagine that this kind of information might be useful and
>> presented in /proc/<pid>/statm. The question is whether some of the
>> existing consumers would see the performance impact due to he page table
>> walk. Anyway even these counters might get quite tricky because even
>> shareable resources are considered private if the process is the only
>> one to map them (so again this might be a file on tmpfs...).
>>
>>> Swap and
>>> PSS also give us some indication of additional memory which might get
>>> freed up.
>> --
>> Michal Hocko
>> SUSE Labs
>
>

--
Marcin Jabrzyk
Samsung R&D Institute Poland
Samsung Electronics

Michal Hocko

unread,
Aug 29, 2016, 10:40:05 AM8/29/16
to
[Sorry for a late reply, I was busy with other stuff]

On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mho...@kernel.org> wrote:
[...]
> But what about the private_clean and private_dirty? Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?

I guess private_clean can be used as a reasonable estimate.
private_dirty less so because it may refer to e.g. tmpfs which is not
mapped by other process and so no memory would be freed after unmap
without removing the file.

> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.

I agree on this part definitely. I also understand that what we provide
currently is quite confusing and not really helpful. But I am afraid
that the accounting is far from trivial to make right for all the
possible cases.

> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.

Maybe we can be more clever and do some form of caching. I haven't
thought that through to see how hard that could be. I mean we could
cache some data per mm_struct and invalidate them only after the current
value would get too much out of sync.

> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg. Swap seems to be available in /proc/<pid>/status.
>
> I looked at the "shared" value in /proc/<pid>/statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?

task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm.
If they are not shared by anybody else they would be considered private
by smaps.
OK, I see that those processes differ in the way how they are using
memory but I am not really sure what kind of conclusion you can draw
from that.

Michal Hocko

unread,
Aug 30, 2016, 4:20:05 AM8/30/16
to
On Mon 29-08-16 16:37:04, Michal Hocko wrote:
> [Sorry for a late reply, I was busy with other stuff]
>
> On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mho...@kernel.org> wrote:
> [...]
> > But what about the private_clean and private_dirty? Surely
> > those are more generally useful for calculating a lower bound on
> > process memory usage without additional knowledge?
>
> I guess private_clean can be used as a reasonable estimate.

I was thinking about this more and I think I am wrong here. Even truly
MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will
become not all that interesting and similarly misleading as its _dirty
variant (mmaped file after [m]sync should become _clean) and that
doesn't mean the memory will get freed after the process which maps it
terminates. Take shmem as an example again.

> private_dirty less so because it may refer to e.g. tmpfs which is not
> mapped by other process and so no memory would be freed after unmap
> without removing the file.

Michal Hocko

unread,
Aug 30, 2016, 4:30:05 AM8/30/16
to
On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

Jacek Anaszewski

unread,
Aug 31, 2016, 5:50:11 AM8/31/16
to
Hi Robert,

On 08/17/2016 12:33 AM, rober...@collabora.com wrote:
> From: Robert Foss <rober...@collabora.com>
>
> This is based on earlier work by Thiago Goncales. It implements a new
> per process proc file which summarizes the contents of the smaps file
> but doesn't display any addresses. It gives more detailed information
> than statm like the PSS (proprotional set size). It differs from the
> original implementation in that it doesn't use the full blown set of
> seq operations, uses a different termination condition, and doesn't
> displayed "Locked" as that was broken on the original implemenation.
>
> This new proc file provides information faster than parsing the potentially
> huge smaps file.
>
> Tested-by: Robert Foss <rober...@collabora.com>
> Signed-off-by: Robert Foss <rober...@collabora.com>
>
> Signed-off-by: Sonny Rao <sonn...@chromium.org>
> ---
> fs/proc/base.c | 1 +
> fs/proc/internal.h | 2 +
> fs/proc/task_mmu.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 144 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index a11eb71..de3acdf 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2855,6 +2855,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
> REG("smaps", S_IRUGO, proc_pid_smaps_operations),
> REG("pagemap", S_IRUSR, proc_pagemap_operations),
> + REG("totmaps", S_IRUGO, proc_totmaps_operations),
> #endif
> #ifdef CONFIG_SECURITY
> DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index aa27810..99f97d7 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -297,6 +297,8 @@ extern const struct file_operations proc_pid_smaps_operations;
> extern const struct file_operations proc_tid_smaps_operations;
> extern const struct file_operations proc_clear_refs_operations;
> extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_totmaps_operations;
> +
>
> extern unsigned long task_vsize(struct mm_struct *);
> extern unsigned long task_statm(struct mm_struct *,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4648c7f..fd8fd7f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -802,6 +802,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> return 0;
> }
>
> +static void add_smaps_sum(struct mem_size_stats *mss,
> + struct mem_size_stats *mss_sum)
> +{
> + mss_sum->resident += mss->resident;
> + mss_sum->pss += mss->pss;
> + mss_sum->shared_clean += mss->shared_clean;
> + mss_sum->shared_dirty += mss->shared_dirty;
> + mss_sum->private_clean += mss->private_clean;
> + mss_sum->private_dirty += mss->private_dirty;
> + mss_sum->referenced += mss->referenced;
> + mss_sum->anonymous += mss->anonymous;
> + mss_sum->anonymous_thp += mss->anonymous_thp;
> + mss_sum->swap += mss->swap;
> +}
> +
> +static int totmaps_proc_show(struct seq_file *m, void *data)
> +{
> + struct proc_maps_private *priv = m->private;
> + struct mm_struct *mm = priv->mm;
> + struct vm_area_struct *vma;
> + struct mem_size_stats mss_sum;
> +
> + memset(&mss_sum, 0, sizeof(mss_sum));
> + down_read(&mm->mmap_sem);
> + hold_task_mempolicy(priv);
> +
> + for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
> + struct mem_size_stats mss;
> + struct mm_walk smaps_walk = {
> + .pmd_entry = smaps_pte_range,
> + .mm = vma->vm_mm,
> + .private = &mss,
> + };
> +
> + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
> + memset(&mss, 0, sizeof(mss));
> + walk_page_vma(vma, &smaps_walk);
> + add_smaps_sum(&mss, &mss_sum);
> + }
> + }
> +
> + release_task_mempolicy(priv);
> + up_read(&mm->mmap_sem);
> +
> + seq_printf(m,
> + "Rss: %8lu kB\n"
> + "Pss: %8lu kB\n"
> + "Shared_Clean: %8lu kB\n"
> + "Shared_Dirty: %8lu kB\n"
> + "Private_Clean: %8lu kB\n"
> + "Private_Dirty: %8lu kB\n"
> + "Referenced: %8lu kB\n"
> + "Anonymous: %8lu kB\n"
> + "AnonHugePages: %8lu kB\n"
> + "Swap: %8lu kB\n",
> + mss_sum.resident >> 10,
> + (unsigned long)(mss_sum.pss >> (10 + PSS_SHIFT)),
> + mss_sum.shared_clean >> 10,
> + mss_sum.shared_dirty >> 10,
> + mss_sum.private_clean >> 10,
> + mss_sum.private_dirty >> 10,
> + mss_sum.referenced >> 10,
> + mss_sum.anonymous >> 10,
> + mss_sum.anonymous_thp >> 10,
> + mss_sum.swap >> 10);
> +
> + return 0;
> +}
> +
> static int show_pid_smap(struct seq_file *m, void *v)
> {
> return show_smap(m, v, 1);
> @@ -812,6 +881,28 @@ static int show_tid_smap(struct seq_file *m, void *v)
> return show_smap(m, v, 0);
> }
>
> +static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
> +{
> + return NULL + (*pos == 0);
> +}
> +
> +static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
> +{
> + ++*pos;
> + return NULL;
> +}
> +
> +static void m_totmaps_stop(struct seq_file *p, void *v)
> +{
> +}
> +
> +static const struct seq_operations proc_totmaps_op = {
> + .start = m_totmaps_start,
> + .next = m_totmaps_next,
> + .stop = m_totmaps_stop,
> + .show = totmaps_proc_show
> +};
> +
> static const struct seq_operations proc_pid_smaps_op = {
> .start = m_start,
> .next = m_next,
> @@ -836,6 +927,49 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
> return do_maps_open(inode, file, &proc_tid_smaps_op);
> }
>
> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> + struct proc_maps_private *priv = NULL;
> + struct seq_file *seq;
> + int ret;
> +
> + ret = do_maps_open(inode, file, &proc_totmaps_op);
> + if (ret)
> + goto error;
> +
> + /*
> + * We need to grab references to the task_struct
> + * at open time, because there's a potential information
> + * leak where the totmaps file is opened and held open
> + * while the underlying pid to task mapping changes
> + * underneath it
> + */
> + seq = file->private_data;
> + priv = seq->private;
> + priv->task = get_proc_task(inode);
> + if (!priv->task) {
> + ret = -ESRCH;
> + goto error_free;
> + }
> +
> + return 0;
> +
> +error_free:
> + proc_map_release(inode, file);
> +error:
> + return ret;
> +}
> +
> +static int totmaps_release(struct inode *inode, struct file *file)
> +{
> + struct seq_file *seq = file->private_data;
> + struct proc_maps_private *priv = seq->private;
> +
> + put_task_struct(priv->task);
> +
> + return proc_map_release(inode, file);
> +}
> +
> const struct file_operations proc_pid_smaps_operations = {
> .open = pid_smaps_open,
> .read = seq_read,
> @@ -850,6 +984,13 @@ const struct file_operations proc_tid_smaps_operations = {
> .release = proc_map_release,
> };
>
> +const struct file_operations proc_totmaps_operations = {
> + .open = totmaps_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = totmaps_release,
> +};
> +
> enum clear_refs_types {
> CLEAR_REFS_ALL = 1,
> CLEAR_REFS_ANON,
>

When reading totmaps of kernel processes the following NULL pointer
dereference occurs:

Unable to handle kernel NULL pointer dereference at virtual address 00000044
pgd = ee6e0000
[00000044] *pgd=7b83a831
Internal error: Oops: 17 [#6] PREEMPT SMP ARM
Modules linked in:
CPU: 2 PID: 1495 Comm: cat Tainted: G D W
4.8.0-rc2-00010-g22fe2db-dirty #159
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
task: ee596e00 task.stack: ee470000
PC is at down_read+0xc/0x48
LR is at totmaps_proc_show+0x2c/0x1e8
pc : [<c06c01f4>] lr : [<c022a154>] psr: 40000013
sp : ee471db8 ip : 00000000 fp : 00000000
r10: edfe1080 r9 : 00000001 r8 : 00000044
r7 : ee4abd00 r6 : edfe1080 r5 : edde0b80 r4 : 00000044
r3 : 00000000 r2 : 00000000 r1 : ffffffc8 r0 : 00000044
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 6e6e004a DAC: 00000051
Process cat (pid: 1495, stack limit = 0xee470210)
Stack: (0xee471db8 to 0xee472000)
1da0: 00000000
c022a154
1dc0: ee596e00 024200ca 00000000 024200ca 00000000 00000081 c0b02594
024200ca
1de0: 00000055 ee5b7e44 00000800 c019cad0 00000000 c06c1af0 00000001
c032aa90
1e00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
1e20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
1e40: 00000000 00000000 c0a69764 c0a69758 0000000b c01afd60 eff4d000
eff4d000
1e60: edc55f20 00000000 edfe10b0 0001c000 20000013 c06bfc9c 7ab80c7f
c01bc060
1e80: 00000002 ef001b80 c0a695d0 024000c0 00008000 ee471ec0 00008000
edfe1080
1ea0: ee4abd00 00000001 00000001 ee471f80 00000000 c01fe564 0001c000
edfe10b0
1ec0: 00000000 00000000 00024e84 ee5b7e00 ee5b7e44 c0705348 0001c000
ee4abd00
1ee0: ee471f80 00008000 ee470000 0001c000 00000000 c01dc850 c0b06aac
ee471fb0
1f00: b6fbf220 b6fbf7c4 000001ff c0101308 386d6a0e 32e4d737 386d6a0e
32e4d737
1f20: 00002838 00000000 ee4abd00 bec0eba0 00000000 bec0ed84 ee596e00
00000000
1f40: ee4abd00 00008000 0001c000 00000000 ee471f80 c01ddca0 00000004
ee478124
1f60: 00000001 00000000 00000000 ee4abd00 ee4abd00 00008000 0001c000
c01ddd64
1f80: 00000000 00000000 00000000 00008000 0001c000 00000003 00000003
c0107ac4
1fa0: 00000000 c0107900 00008000 0001c000 00000003 0001c000 00008000
0001c000
1fc0: 00008000 0001c000 00000003 00000003 00008000 00000000 0000005e
00000000
1fe0: 00000000 bec0eb0c 0000c694 b6f4248c 60000010 00000003 fdfffffb
ffffffff
[<c06c01f4>] (down_read) from [<c022a154>] (totmaps_proc_show+0x2c/0x1e8)
[<c022a154>] (totmaps_proc_show) from [<c01fe564>] (seq_read+0x1c8/0x4b8)
[<c01fe564>] (seq_read) from [<c01dc850>] (__vfs_read+0x2c/0x110)
[<c01dc850>] (__vfs_read) from [<c01ddca0>] (vfs_read+0x8c/0x110)
[<c01ddca0>] (vfs_read) from [<c01ddd64>] (SyS_read+0x40/0x8c)
[<c01ddd64>] (SyS_read) from [<c0107900>] (ret_fast_syscall+0x0/0x3c)

It seems that some protection is needed for such processes, so that
totmaps would return empty string then, like in case of smaps.

--
Best regards,
Jacek Anaszewski

Robert Foss

unread,
Aug 31, 2016, 12:50:06 PM8/31/16
to
Thanks for the testing Jacek!

I had a look around the corresponding smaps code, but I'm not seeing any
checks, do you know where that check actually is made?


Rob.

Mateusz Guzik

unread,
Aug 31, 2016, 1:10:40 PM8/31/16
to
On Wed, Aug 31, 2016 at 12:36:26PM -0400, Robert Foss wrote:
> On 2016-08-31 05:45 AM, Jacek Anaszewski wrote:
> > > +static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
> > > +{
> > > + return NULL + (*pos == 0);
> > > +}
> > > +
> > > +static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
> > > +{
> > > + ++*pos;
> > > + return NULL;
> > > +}
> > > +
> >
> > When reading totmaps of kernel processes the following NULL pointer
> > dereference occurs:
> >
> > Unable to handle kernel NULL pointer dereference at virtual address
> > 00000044
> > [<c06c01f4>] (down_read) from [<c022a154>] (totmaps_proc_show+0x2c/0x1e8)
> > [<c022a154>] (totmaps_proc_show) from [<c01fe564>] (seq_read+0x1c8/0x4b8)
> > [<c01fe564>] (seq_read) from [<c01dc850>] (__vfs_read+0x2c/0x110)
> > [<c01dc850>] (__vfs_read) from [<c01ddca0>] (vfs_read+0x8c/0x110)
> > [<c01ddca0>] (vfs_read) from [<c01ddd64>] (SyS_read+0x40/0x8c)
> > [<c01ddd64>] (SyS_read) from [<c0107900>] (ret_fast_syscall+0x0/0x3c)
> >
> > It seems that some protection is needed for such processes, so that
> > totmaps would return empty string then, like in case of smaps.
> >
>
> Thanks for the testing Jacek!
>
> I had a look around the corresponding smaps code, but I'm not seeing any
> checks, do you know where that check actually is made?
>

See m_start in f/sproc/task_mmu.c. It not only check for non-null mm,
but also tries to bump ->mm_users and only then proceeds to walk the mm.

--
Mateusz Guzik

Robert Foss

unread,
Sep 1, 2016, 7:50:31 PM9/1/16
to
So a m_totmaps_start that looks something like the below would be
enough? And if so, would mm->mm_users need to be decrement inside of
m_totmaps_start?

static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
{
struct proc_maps_private *priv = m->private;
struct mm_struct *mm;

mm = priv->mm;
if (!mm || !atomic_inc_not_zero(&mm->mm_users))
return NULL;

rober...@collabora.com

unread,
Sep 5, 2016, 4:20:06 PM9/5/16
to
From: Robert Foss <rober...@collabora.com>

This series provides the /proc/PID/totmaps feature, which
summarizes the information provided by /proc/PID/smaps for
improved performance and usability reasons.

A use case is to speed up monitoring of memory consumption in
environments where RSS isn't precise.

For example Chrome tends to many processes which have hundreds of VMAs
with a substantial amount of shared memory, and the error of using
RSS rather than PSS tends to be very large when looking at overall
memory consumption. PSS isn't kept as a single number that's exported
like RSS, so to calculate PSS means having to parse a very large smaps
file.

This process is slow and has to be repeated for many processes, and we
found that the just act of doing the parsing was taking up a
significant amount of CPU time, so this patch is an attempt to make
that process cheaper.

/proc/PID/totmaps provides roughly a 2x speedup compared to parsing
/proc/PID/smaps with awk.

$ ps aux | grep firefox
robertfoss 5025 24.3 13.7 3562820 2219616 ? Rl Aug15 277:44 /usr/lib/firefox/firefox https://allg.one/xpb
$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503
$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
User time (seconds): 0.00
System time (seconds): 0.40
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45

$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' /proc/5025/smaps }"
[...]
Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps }"
User time (seconds): 0.37
System time (seconds): 0.45
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89


Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
already is available
- Added support to proc_map_release() fork priv==NULL, to allow
function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for proc_map_release()

Changes since v2:
- Removed struct mem_size_stats *mss from struct proc_maps_private
- Removed priv->task assignment in totmaps_open() call
- Moved some assignements calls totmaps_open() around to increase code
clarity
- Moved some function calls to unlock data structures before printing

Changes since v3:
- Fixed typo in totmaps documentation
- Fixed issue where proc_map_release wasn't called on error
- Fixed put_task_struct not being called during .release()

Changes since v4:
- Prevent access to invalid processes

Robert Foss (3):
mm, proc: Implement /proc/<pid>/totmaps
Documentation/filesystems: Fixed typo
Documentation/filesystems: Added /proc/PID/totmaps documentation

Documentation/filesystems/proc.txt | 23 +++++-
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 148 +++++++++++++++++++++++++++++++++++++
4 files changed, 173 insertions(+), 1 deletion(-)

--
2.7.4

rober...@collabora.com

unread,
Sep 5, 2016, 4:20:06 PM9/5/16
to
From: Robert Foss <rober...@collabora.com>

This is based on earlier work by Thiago Goncales. It implements a new
per process proc file which summarizes the contents of the smaps file
but doesn't display any addresses. It gives more detailed information
than statm like the PSS (proprotional set size). It differs from the
original implementation in that it doesn't use the full blown set of
seq operations, uses a different termination condition, and doesn't
displayed "Locked" as that was broken on the original implemenation.

This new proc file provides information faster than parsing the potentially
huge smaps file.

Tested-by: Robert Foss <rober...@collabora.com>
Signed-off-by: Robert Foss <rober...@collabora.com>

Signed-off-by: Sonny Rao <sonn...@chromium.org>
---
fs/proc/base.c | 1 +
fs/proc/internal.h | 2 +
fs/proc/task_mmu.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 151 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ac0df4d..dc7e81b7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2854,6 +2854,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("pagemap", S_IRUSR, proc_pagemap_operations),
+ REG("totmaps", S_IRUGO, proc_totmaps_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 7931c55..3bdafe8 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -298,6 +298,8 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_tid_smaps_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_totmaps_operations;
+

extern unsigned long task_vsize(struct mm_struct *);
extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84e..f0f4fee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -810,6 +810,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
@@ -820,6 +889,35 @@ static int show_tid_smap(struct seq_file *m, void *v)
return show_smap(m, v, 0);
}

+static void *m_totmaps_start(struct seq_file *m, loff_t *ppos)
+{
+ struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm;
+
+ mm = priv->mm;
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ return NULL;
+
+ return NULL + (*ppos == 0);
+}
+
+static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
+{
+ ++*pos;
+ return NULL;
+}
+
+static void m_totmaps_stop(struct seq_file *p, void *v)
+{
+}
+
+static const struct seq_operations proc_totmaps_op = {
+ .start = m_totmaps_start,
+ .next = m_totmaps_next,
+ .stop = m_totmaps_stop,
+ .show = totmaps_proc_show
+};
+
static const struct seq_operations proc_pid_smaps_op = {
.start = m_start,
.next = m_next,
@@ -844,6 +942,49 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
@@ -858,6 +999,13 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

+const struct file_operations proc_totmaps_operations = {
+ .open = totmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = totmaps_release,
+};
+
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
--
2.7.4

rober...@collabora.com

unread,
Sep 5, 2016, 4:20:06 PM9/5/16
to
From: Robert Foss <rober...@collabora.com>

Fixed a -> an typo.

Signed-off-by: Robert Foss <rober...@collabora.com>
---
Documentation/filesystems/proc.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 68080ad..fcc1ac0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -145,7 +145,7 @@ Table 1-1: Process specific entries in /proc
symbol the task is blocked in - or "0" if not blocked.
pagemap Page table
stack Report full stack trace, enable via CONFIG_STACKTRACE
- smaps a extension based on maps, showing the memory consumption of
+ smaps an extension based on maps, showing the memory consumption of
each mapping and flags associated with it
numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
--
2.7.4

Oleg Nesterov

unread,
Sep 7, 2016, 9:00:06 AM9/7/16
to
On 09/05, rober...@collabora.com wrote:
>
> @@ -2854,6 +2854,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
> REG("smaps", S_IRUGO, proc_pid_smaps_operations),
> REG("pagemap", S_IRUSR, proc_pagemap_operations),
> + REG("totmaps", S_IRUGO, proc_totmaps_operations),

I must have missed something, but I fail to understand why this patch
is so complicated.

Just use ONE("totmaps", S_IRUGO, proc_totmaps_operations) ?

> +static int totmaps_proc_show(struct seq_file *m, void *data)
> +{
> + struct proc_maps_private *priv = m->private;
> + struct mm_struct *mm = priv->mm;
> + struct vm_area_struct *vma;
> + struct mem_size_stats mss_sum;
> +
> + memset(&mss_sum, 0, sizeof(mss_sum));
> + down_read(&mm->mmap_sem);
> + hold_task_mempolicy(priv);
^^^^^^^^^^^^^^^^^^^^^^^^^
why?

> + for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {

Hmm. the usage of ->tail_vma looks just wrong. I guess the code should
work because it is NULL but still.

> + struct mem_size_stats mss;
> + struct mm_walk smaps_walk = {
> + .pmd_entry = smaps_pte_range,
> + .mm = vma->vm_mm,
> + .private = &mss,
> + };
> +
> + if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
> + memset(&mss, 0, sizeof(mss));
> + walk_page_vma(vma, &smaps_walk);
> + add_smaps_sum(&mss, &mss_sum);
> + }
> + }

Why? I mean, why not walk_page_range() ? You do not need this for-each-vma
loop at all? At least if you change this patch to use the ONE() helper, and
everything else looks unneeded in this case.

Oleg.

Kees Cook

unread,
Sep 7, 2016, 7:30:05 PM9/7/16
to
On Mon, Sep 5, 2016 at 1:14 PM, <rober...@collabora.com> wrote:
> From: Robert Foss <rober...@collabora.com>
>
> Fixed a -> an typo.
>
> Signed-off-by: Robert Foss <rober...@collabora.com>

Acked-by: Kees Cook <kees...@chromium.org>

This could be taken directly into the docs tree, I think -- no reason
to make it depend on the rest of the series.

-Kees

> ---
> Documentation/filesystems/proc.txt | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 68080ad..fcc1ac0 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -145,7 +145,7 @@ Table 1-1: Process specific entries in /proc
> symbol the task is blocked in - or "0" if not blocked.
> pagemap Page table
> stack Report full stack trace, enable via CONFIG_STACKTRACE
> - smaps a extension based on maps, showing the memory consumption of
> + smaps an extension based on maps, showing the memory consumption of
> each mapping and flags associated with it
> numa_maps an extension based on maps, showing the memory locality and
> binding policy as well as mem usage (in pages) of each mapping.
> --
> 2.7.4
>



--
Kees Cook
Nexus Security

Robert Foss

unread,
Sep 7, 2016, 8:30:05 PM9/7/16
to


On 2016-09-07 07:22 PM, Kees Cook wrote:
> On Mon, Sep 5, 2016 at 1:14 PM, <rober...@collabora.com> wrote:
>> From: Robert Foss <rober...@collabora.com>
>>
>> Fixed a -> an typo.
>>
>> Signed-off-by: Robert Foss <rober...@collabora.com>
>
> Acked-by: Kees Cook <kees...@chromium.org>
>
> This could be taken directly into the docs tree, I think -- no reason
> to make it depend on the rest of the series.

Agreed. Would you like a separate submission for that?


Rob.

Jonathan Corbet

unread,
Sep 8, 2016, 2:30:12 AM9/8/16
to
On Wed, 7 Sep 2016 20:22:00 -0400
Robert Foss <rober...@collabora.com> wrote:

> > This could be taken directly into the docs tree, I think -- no reason
> > to make it depend on the rest of the series.
>
> Agreed. Would you like a separate submission for that?

Please, I've lost track of the original...

Thanks,

jon

rober...@collabora.com

unread,
Sep 8, 2016, 6:50:06 PM9/8/16
to
From: Robert Foss <rober...@collabora.com>

Fixed a -> an typo.

Signed-off-by: Robert Foss <rober...@collabora.com>
Acked-by: Kees Cook <kees...@chromium.org>

---
Documentation/filesystems/proc.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 68080ad..fcc1ac0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -145,7 +145,7 @@ Table 1-1: Process specific entries in /proc
symbol the task is blocked in - or "0" if not blocked.
pagemap Page table
stack Report full stack trace, enable via CONFIG_STACKTRACE
- smaps a extension based on maps, showing the memory consumption of
+ smaps an extension based on maps, showing the memory consumption of
each mapping and flags associated with it
numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
--
2.9.3

Michal Hocko

unread,
Sep 12, 2016, 8:10:10 AM9/12/16
to
On Mon 05-09-16 16:14:06, rober...@collabora.com wrote:
> From: Robert Foss <rober...@collabora.com>
>
> This series provides the /proc/PID/totmaps feature, which
> summarizes the information provided by /proc/PID/smaps for
> improved performance and usability reasons.
>
> A use case is to speed up monitoring of memory consumption in
> environments where RSS isn't precise.
>
> For example Chrome tends to many processes which have hundreds of VMAs
> with a substantial amount of shared memory, and the error of using
> RSS rather than PSS tends to be very large when looking at overall
> memory consumption. PSS isn't kept as a single number that's exported
> like RSS, so to calculate PSS means having to parse a very large smaps
> file.
>
> This process is slow and has to be repeated for many processes, and we
> found that the just act of doing the parsing was taking up a
> significant amount of CPU time, so this patch is an attempt to make
> that process cheaper.

I still maintain my concerns about a single pss value. It might work in
a very specific situations where the consumer knows what is shared but
other than that the value can be more misleading than helpful. So a NACK
from me until I am shown that this is usable in general and still
helpful.

Sonny Rao

unread,
Sep 12, 2016, 11:50:07 AM9/12/16
to
I know you think Pss isn't useful in general (though I'll point out
two other independent people said they found it useful) but how about
the other fields like Swap, Private_Dirty and Private_Shared?

If we removed Pss would you still NACK it?

Michal Hocko

unread,
Sep 12, 2016, 1:20:10 PM9/12/16
to
sure, and one of them admitted that the value is useful because they
_know_ the resource. The other was quite vague for me to understand
all the details. Please, try to understand that once you provide a user
API then it will carved in stone. If the interface is poor and ambigous
it will bite us later. One very specific usecase doesn't justify
something that might be really misleading for 90% of cases.

> but how about the other fields like Swap, Private_Dirty and
> Private_Shared?

Private_Shared can be pretty confusing as well without the whole context
as well see my other emails in the original thread (just to remind
shmem/tmpfs makes all this really confusing).

Sonny Rao

unread,
Sep 12, 2016, 1:30:13 PM9/12/16
to
But this is exactly the issue -- RSS is can be just as confusing if
you don't know something about the application. I think the issue is
how common that situation is, and you seem to believe that it's so
uncommon that it's actually better to keep the information more
difficult to get for those of us who know something about our systems.

That's fine, I guess we just have to disagree here, thanks for look at this.

Robert Foss

unread,
Sep 12, 2016, 6:20:06 PM9/12/16
to
Hey Oleg!

Thanks for the feedback, I'll keep it in mind, but currently it looks
like the patch is on ice for non-implementation related reasons.


Rob.

Michal Hocko

unread,
Sep 13, 2016, 3:20:06 AM9/13/16
to
On Mon 12-09-16 10:28:53, Sonny Rao wrote:
> On Mon, Sep 12, 2016 at 10:15 AM, Michal Hocko <mho...@kernel.org> wrote:
> > On Mon 12-09-16 08:31:36, Sonny Rao wrote:
[...]
> >> but how about the other fields like Swap, Private_Dirty and
> >> Private_Shared?
> >
> > Private_Shared can be pretty confusing as well without the whole context
> > as well see my other emails in the original thread (just to remind
> > shmem/tmpfs makes all this really confusing).
>
> But this is exactly the issue -- RSS is can be just as confusing if
> you don't know something about the application.

I agree that rss can be confusing but we will not make the situation any
better if we add yet another confusing metric.

> I think the issue is
> how common that situation is, and you seem to believe that it's so
> uncommon that it's actually better to keep the information more
> difficult to get for those of us who know something about our systems.
>
> That's fine, I guess we just have to disagree here, thanks for look at this.

I think you should just step back and think more about what exactly
you expect from the counter(s). I believe what you want is an
estimate of a freeable memory when the particular process dies or is
killed. That would mean resident single mapped private anonymous memory
+ unlinked single mapped shareable mappings + single mapped swapped out
memory. Maybe I've missed something but it should be something along
those lines. Definitely something that the current smaps infrastructure
doesn't give you, though.

Sonny Rao

unread,
Sep 13, 2016, 4:30:06 PM9/13/16
to
Yes your description of what we want is pretty good. Having a
reasonable lower bound on the estimate is fine, though we probably
want to break out swapped out memory separately. Given that smaps
doesn't provide this in a straightforward way, what do you think is
the right way to provide this information?

Michal Hocko

unread,
Sep 14, 2016, 5:20:05 AM9/14/16
to
Why would you want to separate that?

> Given that smaps
> doesn't provide this in a straightforward way, what do you think is
> the right way to provide this information?

I would be tempted to sneak it into /proc/<pid>/statm because that looks
like a proper place but getting this information is not for free
performance wise so I am not really sure something that relies on this
file would see unexpected stalls. Maybe this could be worked around by
some caching... I would suggest to check who is actually using this file
(top/ps etc...)

If this would be unacceptable then a new file could be considered.

Robert Foss

unread,
Sep 19, 2016, 11:20:06 AM9/19/16
to
What would this caching look like? Can any information be re-used
between vma walks?

Michal Hocko

unread,
Sep 19, 2016, 3:40:05 PM9/19/16
to
On Mon 19-09-16 11:16:31, Robert Foss wrote:
> On 2016-09-14 05:12 AM, Michal Hocko wrote:
> > On Tue 13-09-16 13:27:39, Sonny Rao wrote:
[...]
> > > Given that smaps
> > > doesn't provide this in a straightforward way, what do you think is
> > > the right way to provide this information?
> >
> > I would be tempted to sneak it into /proc/<pid>/statm because that looks
> > like a proper place but getting this information is not for free
> > performance wise so I am not really sure something that relies on this
> > file would see unexpected stalls. Maybe this could be worked around by
> > some caching... I would suggest to check who is actually using this file
> > (top/ps etc...)
>
> What would this caching look like? Can any information be re-used between
> vma walks?

yes basically return the same value if called within HZ or something
similar. But that assumes that statm latency really matters and it is
called often enough.

Jann Horn

unread,
Sep 19, 2016, 4:00:17 PM9/19/16
to
On Mon, Sep 19, 2016 at 09:51:13PM +0200, Michal Hocko wrote:
> [not sure why the CC list was trimmed - do no do that please unless you
> have a strong reason for that - if this was not intentional please
> restpre it]

Ah, sorry, pressed the wrong key.


> On Mon 19-09-16 21:40:01, Jann Horn wrote:
> > That sounds horrible. If some application decides that they want to check
> > statm directly after some action or so (like after program startup), this is
> > going to give them a very bad time. That probably doesn't happen
> > often - but still.
> >
> > I can already imagine some developer going "yeah, that usleep()... that's
> > because the kernel API returns stale information for a couple milliseconds
> > after we do something *shrug*".
> >
> > What are you trying to optimize for? Ten users on the same machine, each of
> > which is running "top" because it looks so great?
>
> Please try to read what I wrote again. I didn't say this would be
> needed. The idea was that _if_ /proc/<pid>/statm is used very _often_
> than some caching might help to reduce the overhead. Especially when you
> consider that the information is not precise anyway. It can change
> anytime while you are doing the address space walk.
signature.asc

Sonny Rao

unread,
Sep 19, 2016, 4:20:06 PM9/19/16
to
Just thinking out loud here -- I haven't looked closely at the code so
please bear with me :-)

Instead of checking when the last read was and returning old data,
what about a scheme where we still have a timestamp for last stat read
on and any changes to that address space invalidate the timestamp.

The invalidation could be racy because we're not too concerned about
immediate accuracy -- so just a write. The main issue I could see
which this is that it could cause the cacheline holding this timestamp
to bounce around a lot? Maybe there's an existing solution in the
page table locking that could be leveraged here to at least maintain
whatever scalability enhancements are present for this type of
situation where there are many updates happening in parallel.

Robert Foss

unread,
Sep 19, 2016, 8:30:05 PM9/19/16
to
Any single application querying more often than HZ, would presumably do
so for accuracy reasons.
However for multiple applications that combined query more often than
HZ, this would most definitely be halpful in terms of performance.

@Sonny, does chromiumos fall into the first or second category?

Sonny Rao

unread,
Sep 19, 2016, 8:40:06 PM9/19/16
to
It's a single application -- and it definitely doesn't query at HZ --
especially given how long it takes to gather the data :-)

Michal Hocko

unread,
Sep 30, 2016, 5:50:05 AM9/30/16
to
[CC Mike and Mel as they have seen some accounting oddities
when doing performance testing. They can share details but
essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143...@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
2.01% as [kernel.vmlinux] [k] page_fault
0.59% cc1 [kernel.vmlinux] [k] page_fault
0.15% git [kernel.vmlinux] [k] page_fault
0.12% bash [kernel.vmlinux] [k] page_fault
0.11% sh [kernel.vmlinux] [k] page_fault
0.08% gcc [kernel.vmlinux] [k] page_fault
0.06% make [kernel.vmlinux] [k] page_fault
0.04% rm [kernel.vmlinux] [k] page_fault
0.03% ld [kernel.vmlinux] [k] page_fault
0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% cat [kernel.vmlinux] [k] page_fault
0.01% collect2 [kernel.vmlinux] [k] page_fault
0.00% sh [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% rm [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% grep [kernel.vmlinux] [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> >
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> >
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> >         User time (seconds): 0.00
> >         System time (seconds): 0.45
> >         Percent of CPU this job got: 98%
> >
>
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> >         User time (seconds): 0.18
> >         System time (seconds): 0.29
> >         Percent of CPU this job got: 97%
>
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
>
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
>
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
>
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
>
> Does the workload consistently get zero user
> time?
>
> If so, we need to dig further to see under
> what precise circumstances that happens.
>
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
>
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> User time (seconds): 0.64
> System time (seconds): 0.19
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
>
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
>
> Running tasks should get sampled with the
> regular timer tick, while they are running.
>
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
>
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?

Rik van Riel

unread,
Sep 30, 2016, 8:40:05 AM9/30/16
to
On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote:
> [CC Mike and Mel as they have seen some accounting oddities
>  when doing performance testing. They can share details but
>  essentially the system time just gets too high]
>
> For your reference the email thread started
> http://lkml.kernel.org/r/20160823143...@dhcp22.suse.cz
>
> I suspect this is mainly for short lived processes - like kernel
> compile
> $ /usr/bin/time -v make mm/mmap.o
> [...]
>         User time (seconds): 0.45
>         System time (seconds): 0.82
>         Percent of CPU this job got: 111%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
> $ rm mm/mmap.o
> $ /usr/bin/time -v make mm/mmap.o
> [...]
>         User time (seconds): 0.47
>         System time (seconds): 1.55
>         Percent of CPU this job got: 107%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

I was not able to get the "expected" results from
your last reproducer, but this one does happen on
my system, too.

The bad news is, I still have no clue what is causing
it...

--
All Rights Reversed.
signature.asc
0 new messages