Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH v1 0/3] per-process reclaim

181 views
Skip to first unread message

Minchan Kim

unread,
Jun 13, 2016, 4:00:05 AM6/13/16
to
Hi all,

http://thread.gmane.org/gmane.linux.kernel/1480728

I sent per-process reclaim patchset three years ago. Then, last
feedback from akpm was that he want to know real usecase scenario.

Since then, I got question from several embedded people of various
company "why it's not merged into mainline" and heard they have used
the feature as in-house patch and recenlty, I noticed android from
Qualcomm started to use it.

Of course, our product have used it and released it in real procuct.

Quote from Sangwoo Park <angwoo...@lge.com>
Thanks for the data, Sangwoo!
"
- Test scenaro
- platform: android
- target: MSM8952, 2G DDR, 16G eMMC
- scenario
retry app launch and Back Home with 16 apps and 16 turns
(total app launch count is 256)
- result:
resume count | cold launching count
-----------------------------------------------------------------
vanilla | 85 | 171
perproc reclaim | 184 | 72
"

Higher resume count is better because cold launching needs loading
lots of resource data which takes above 15 ~ 20 seconds for some
games while successful resume just takes 1~5 second.

As perproc reclaim way with new management policy, we could reduce
cold launching a lot(i.e., 171-72) so that it reduces app startup
a lot.

Another useful function from this feature is to make swapout easily
which is useful for testing swapout stress and workloads.

Thanks.

Cc: Redmond <u934...@gmail.com>
Cc: ZhaoJunmin Zhao(Junmin) <zhaoj...@huawei.com>
Cc: Vinayak Menon <vinm...@codeaurora.org>
Cc: Juneho Choi <juno...@lge.com>
Cc: Sangwoo Park <sangwo...@lge.com>
Cc: Chan Gyun Jeong <chan....@lge.com>

Minchan Kim (3):
mm: vmscan: refactoring force_reclaim
mm: vmscan: shrink_page_list with multiple zones
mm: per-process reclaim

Documentation/filesystems/proc.txt | 15 ++++
fs/proc/base.c | 1 +
fs/proc/internal.h | 1 +
fs/proc/task_mmu.c | 149 +++++++++++++++++++++++++++++++++++++
include/linux/rmap.h | 4 +
mm/vmscan.c | 85 ++++++++++++++++-----
6 files changed, 235 insertions(+), 20 deletions(-)

--
1.9.1

Minchan Kim

unread,
Jun 13, 2016, 4:00:06 AM6/13/16
to
These day, there are many platforms available in the embedded market
and sometime, they has more hints about workingset than kernel so
they want to involve memory management more heavily like android's
lowmemory killer and ashmem or user-daemon with lowmemory notifier.

This patch adds add new method for userspace to manage memory
efficiently via knob "/proc/<pid>/reclaim" so platform can reclaim
any process anytime.

One of useful usecase is to avoid process killing for getting free
memory in android, which was really terrible experience because I
lost my best score of game I had ever after I switch the phone call
while I enjoyed the game as well as slow start-up by cold launching.

Our product have used it in real procuct.

Quote from Sangwoo Park <angwoo...@lge.com>
Thanks for the data, Sangwoo!
"
- Test scenaro
- platform: android
- target: MSM8952, 2G DDR, 16G eMMC
- scenario
retry app launch and Back Home with 16 apps and 16 turns
(total app launch count is 256)
- result:
resume count | cold launching count
-----------------------------------------------------------------
vanilla | 85 | 171
perproc reclaim | 184 | 72
"

Higher resume count is better because cold launching needs loading
lots of resource data which takes above 15 ~ 20 seconds for some
games while successful resume just takes 1~5 second.

As perproc reclaim way with new management policy, we could reduce
cold launching a lot(i.e., 171-72) so that it reduces app startup
a lot.

Another useful function from this feature is to make swapout easily
which is useful for testing swapout stress and workloads.

Interface:

Reclaim file-backed pages only.
echo 1 > /proc/<pid>/reclaim
Reclaim anonymous pages only.
echo 2 > /proc/<pid>/reclaim
Reclaim all pages
echo 3 > /proc/<pid>/reclaim

bit 1 : file, bit 2 : anon, bit 1 & 2 : all

Note:
If a page is shared by other processes(i.e., page_mapcount(page) > 1),
it couldn't be reclaimed.

Cc: Sangwoo Park <sangwo...@lge.com>
Signed-off-by: Minchan Kim <min...@kernel.org>
---
Documentation/filesystems/proc.txt | 15 ++++
fs/proc/base.c | 1 +
fs/proc/internal.h | 1 +
fs/proc/task_mmu.c | 149 +++++++++++++++++++++++++++++++++++++
include/linux/rmap.h | 4 +
mm/vmscan.c | 40 ++++++++++
6 files changed, 210 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 50fcf48f4d58..3b6adf370f3c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -138,6 +138,7 @@ Table 1-1: Process specific entries in /proc
maps Memory maps to executables and library files (2.4)
mem Memory held by this process
root Link to the root directory of this process
+ reclaim Reclaim pages in this process
stat Process status
statm Process memory status information
status Process status in human readable form
@@ -536,6 +537,20 @@ To reset the peak resident set size ("high water mark") to the process's

Any other value written to /proc/PID/clear_refs will have no effect.

+The file /proc/PID/reclaim is used to reclaim pages in this process.
+bit 1: file, bit 2: anon, bit 3: all
+
+To reclaim file-backed pages,
+ > echo 1 > /proc/PID/reclaim
+
+To reclaim anonymous pages,
+ > echo 2 > /proc/PID/reclaim
+
+To reclaim all pages,
+ > echo 3 > /proc/PID/reclaim
+
+If a page is shared by several processes, it cannot be reclaimed.
+
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
using /proc/kpageflags and number of times a page is mapped using
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 93e7754fd5b2..b957d929516d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2848,6 +2848,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
+ REG("reclaim", S_IWUSR, proc_reclaim_operations),
#ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa2781095bd1..ef2b01533c97 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -209,6 +209,7 @@ struct pde_opener {
extern const struct inode_operations proc_link_inode_operations;

extern const struct inode_operations proc_pid_link_inode_operations;
+extern const struct file_operations proc_reclaim_operations;

extern void proc_init_inodecache(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84ef9de9..31e4657f8fe9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -11,6 +11,7 @@
#include <linux/mempolicy.h>
#include <linux/rmap.h>
#include <linux/swap.h>
+#include <linux/mm_inline.h>
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
#include <linux/page_idle.h>
@@ -1465,6 +1466,154 @@ const struct file_operations proc_pagemap_operations = {
};
#endif /* CONFIG_PROC_PAGE_MONITOR */

+static int reclaim_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct mm_struct *mm = walk->mm;
+ struct vm_area_struct *vma = walk->private;
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ LIST_HEAD(page_list);
+ int isolated = 0;
+
+ split_huge_pmd(vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) != 1)
+ continue;
+
+ if (PageTransCompound(page)) {
+ get_page(page);
+ if (!trylock_page(page)) {
+ put_page(page);
+ goto out;
+ }
+ pte_unmap_unlock(orig_pte, ptl);
+
+ if (split_huge_page(page)) {
+ unlock_page(page);
+ put_page(page);
+ orig_pte = pte_offset_map_lock(mm, pmd,
+ addr, &ptl);
+ goto out;
+ }
+ put_page(page);
+ unlock_page(page);
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ pte--;
+ addr -= PAGE_SIZE;
+ continue;
+ }
+
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (isolate_lru_page(page))
+ continue;
+
+ list_add(&page->lru, &page_list);
+ inc_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+ isolated++;
+ if (isolated >= SWAP_CLUSTER_MAX) {
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages_from_list(&page_list);
+ isolated = 0;
+ cond_resched();
+ orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ }
+ }
+
+out:
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages_from_list(&page_list);
+
+ cond_resched();
+ return 0;
+}
+
+enum reclaim_type {
+ RECLAIM_FILE = 1,
+ RECLAIM_ANON,
+ RECLAIM_ALL,
+};
+
+static ssize_t reclaim_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task;
+ char buffer[PROC_NUMBUF];
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int itype;
+ int rv;
+ enum reclaim_type type;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ rv = kstrtoint(strstrip(buffer), 10, &itype);
+ if (rv < 0)
+ return rv;
+ type = (enum reclaim_type)itype;
+ if (type < RECLAIM_FILE || type > RECLAIM_ALL)
+ return -EINVAL;
+
+ task = get_proc_task(file->f_path.dentry->d_inode);
+ if (!task)
+ return -ESRCH;
+
+ mm = get_task_mm(task);
+ if (mm) {
+ struct mm_walk reclaim_walk = {
+ .pmd_entry = reclaim_pte_range,
+ .mm = mm,
+ };
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ reclaim_walk.private = vma;
+
+ if (is_vm_hugetlb_page(vma))
+ continue;
+
+ if (!vma_is_anonymous(vma) && !(type & RECLAIM_FILE))
+ continue;
+
+ if (vma_is_anonymous(vma) && !(type & RECLAIM_ANON))
+ continue;
+
+ walk_page_range(vma->vm_start, vma->vm_end,
+ &reclaim_walk);
+ }
+ flush_tlb_mm(mm);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ }
+ put_task_struct(task);
+
+ return count;
+}
+
+const struct file_operations proc_reclaim_operations = {
+ .write = reclaim_write,
+ .llseek = noop_llseek,
+};
+
#ifdef CONFIG_NUMA

struct numa_maps {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 5704f101b52e..e90a21b78da3 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -10,6 +10,10 @@
#include <linux/rwsem.h>
#include <linux/memcontrol.h>

+extern int isolate_lru_page(struct page *page);
+extern void putback_lru_page(struct page *page);
+extern unsigned long reclaim_pages_from_list(struct list_head *page_list);
+
/*
* The anon_vma heads a list of private "related" vmas, to scan if
* an anonymous page pointing to this anon_vma needs to be unmapped:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d20c9e863d35..442866f77251 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,6 +1212,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* appear not as the counts should be low
*/
list_add(&page->lru, &free_pages);
+ /*
+ * If pagelist are from multiple zones, we should decrease
+ * NR_ISOLATED_ANON + x on freed pages in here.
+ */
+ if (!zone)
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
continue;

cull_mlocked:
@@ -1280,6 +1287,39 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
return ret;
}

+unsigned long reclaim_pages_from_list(struct list_head *page_list)
+{
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .force_reclaim = 1,
+ };
+
+ unsigned long nr_reclaimed, dummy1, dummy2, dummy3, dummy4, dummy5;
+ struct page *page;
+
+ list_for_each_entry(page, page_list, lru)
+ ClearPageActive(page);
+
+ nr_reclaimed = shrink_page_list(page_list, &sc,
+ TTU_UNMAP|TTU_IGNORE_ACCESS,
+ &dummy1, &dummy2, &dummy3,
+ &dummy4, &dummy5);
+
+ while (!list_empty(page_list)) {
+ page = lru_to_page(page_list);
+ list_del(&page->lru);
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+ putback_lru_page(page);
+ }
+
+ return nr_reclaimed;
+}
+
/*
* Attempt to remove the specified page from its LRU. Only take this page
* if it is of the appropriate PageActive status. Pages which are being
--
1.9.1

Hillf Danton

unread,
Jun 13, 2016, 6:10:05 AM6/13/16
to
Check fatal signal after reclaiming a mapping?

Chen Feng

unread,
Jun 13, 2016, 8:00:07 AM6/13/16
to
Hi Minchan,
Thanks Minchan.

Yes, this is useful interface when there are memory pressure and let the userspace(Android)
to pick process for reclaim. We also take there series into our platform.

But I have a question on the reduce app startup time. Can you also share your
theory(management policy) on how can the app reduce it's startup time?

ZhaoJunmin Zhao(Junmin)

unread,
Jun 13, 2016, 8:30:06 AM6/13/16
to
Yes, In Huawei device, we use the interface now! Now according to
procsss LRU state in ActivityManagerService, we can reclaim some process
in proactive way.

>>
>> Cc: Redmond <u934...@gmail.com>
>> Cc: ZhaoJunmin Zhao(Junmin) <zhaoj...@huawei.com>
>> Cc: Vinayak Menon <vinm...@codeaurora.org>
>> Cc: Juneho Choi <juno...@lge.com>
>> Cc: Sangwoo Park <sangwo...@lge.com>
>> Cc: Chan Gyun Jeong <chan....@lge.com>
>>
>> Minchan Kim (3):
>> mm: vmscan: refactoring force_reclaim
>> mm: vmscan: shrink_page_list with multiple zones
>> mm: per-process reclaim
>>
>> Documentation/filesystems/proc.txt | 15 ++++
>> fs/proc/base.c | 1 +
>> fs/proc/internal.h | 1 +
>> fs/proc/task_mmu.c | 149 +++++++++++++++++++++++++++++++++++++
>> include/linux/rmap.h | 4 +
>> mm/vmscan.c | 85 ++++++++++++++++-----
>> 6 files changed, 235 insertions(+), 20 deletions(-)
>>
>
>
> .
>

Vinayak Menon

unread,
Jun 13, 2016, 9:40:08 AM6/13/16
to
Thanks Minchan for bringing this up. When we had tried the earlier patchset in its original form,
the resume of the app that was reclaimed, was taking a lot of time. But from the data shown above it looks
to be improving the resume time. Is that the resume time of "other" apps which were able to retain their working set
because of the more efficient swapping of low priority apps with per process reclaim ?
Because of the higher resume time we had to modify the logic a bit and device a way to pick a "set" of low priority
(oom_score_adj) tasks and reclaim certain number of pages (only anon) from each of them (the number of pages reclaimed
from each task being proportional to task size). This deviates from the original intention of the patch to rescue a
particular app of interest, but still using the hints on working set provided by userspace and avoiding high resume stalls.
The increased swapping was helping in maintaining a better memory state and lesser page cache reclaim,
resulting in better app resume time, and lesser task kills.

So would it be better if a userspace knob is provided to tell the kernel, the max number of pages to be reclaimed from a task ?
This way userspace can make calculations depending on priority, task size etc and reclaim the required number of pages from
each task, and thus avoid the resume stall because of reclaiming an entire task.

And also, would it be possible to implement the same using per task memcg by setting the limits and swappiness in such a
way that it results inthe same thing that per process reclaim does ?

Thanks,
Vinayak

Johannes Weiner

unread,
Jun 13, 2016, 11:10:06 AM6/13/16
to
Hi Minchan,

On Mon, Jun 13, 2016 at 04:50:58PM +0900, Minchan Kim wrote:
> These day, there are many platforms available in the embedded market
> and sometime, they has more hints about workingset than kernel so
> they want to involve memory management more heavily like android's
> lowmemory killer and ashmem or user-daemon with lowmemory notifier.
>
> This patch adds add new method for userspace to manage memory
> efficiently via knob "/proc/<pid>/reclaim" so platform can reclaim
> any process anytime.

Cgroups are our canonical way to control system resources on a per
process or group-of-processes level. I don't like the idea of adding
ad-hoc interfaces for single-use cases like this.

For this particular case, you can already stick each app into its own
cgroup and use memory.force_empty to target-reclaim them.

Or better yet, set the soft limits / memory.low to guide physical
memory pressure, once it actually occurs, toward the least-important
apps? We usually prefer doing work on-demand rather than proactively.

The one-cgroup-per-app model would give Android much more control and
would also remove a *lot* of overhead during task switches, see this:
https://lkml.org/lkml/2014/12/19/358

Rik van Riel

unread,
Jun 13, 2016, 1:10:05 PM6/13/16
to
On Mon, 2016-06-13 at 16:50 +0900, Minchan Kim wrote:
> These day, there are many platforms available in the embedded market
> and sometime, they has more hints about workingset than kernel so
> they want to involve memory management more heavily like android's
> lowmemory killer and ashmem or user-daemon with lowmemory notifier.
>
> This patch adds add new method for userspace to manage memory
> efficiently via knob "/proc/<pid>/reclaim" so platform can reclaim
> any process anytime.
>

Could it make sense to invoke this automatically,
perhaps from the Android low memory killer code?

--
All Rights Reversed.

signature.asc

Minchan Kim

unread,
Jun 14, 2016, 8:50:05 PM6/14/16
to
Yeb, We might need it in page_walker.

Thanks.

Minchan Kim

unread,
Jun 14, 2016, 8:50:05 PM6/14/16
to
Hi Johannes,
I didn't notice that. Thanks for the pointing.
I read the thread you pointed out and read memcg code.

Firstly, I thought one-cgroup-per-app model is abuse of memcg but now
I feel your suggestion does make sense that it's right direction for
control memory from the userspace. Just a concern is that not sure
how hard we can map memory management model from global memory pressure
to per-app pressure model smoothly.

A question is it seems cgroup2 doesn't have per-cgroup swappiness.
Why?

I think we need it in one-cgroup-per-app model.

Minchan Kim

unread,
Jun 14, 2016, 8:50:05 PM6/14/16
to
Hi Chen,
What I meant about start-up time is as follows,

If a app is killed, it should launch from start so if it was the game app,
it should load lots of resource file which takes a long time.
However, if the game was not killed, we can enjoy the game without cold
start so it is very fast startup.

Sorry for confusing.

Minchan Kim

unread,
Jun 14, 2016, 9:00:10 PM6/14/16
to
Sorry for confusing. I meant the app should start from the scratch
if it was killed, which might need load a hundread megabytes while
resume needs to load just workingset memory which would be smaller.

> Because of the higher resume time we had to modify the logic a bit and device a way to pick a "set" of low priority
> (oom_score_adj) tasks and reclaim certain number of pages (only anon) from each of them (the number of pages reclaimed
> from each task being proportional to task size). This deviates from the original intention of the patch to rescue a
> particular app of interest, but still using the hints on working set provided by userspace and avoiding high resume stalls.
> The increased swapping was helping in maintaining a better memory state and lesser page cache reclaim,
> resulting in better app resume time, and lesser task kills.

Fair enough.

>
> So would it be better if a userspace knob is provided to tell the kernel, the max number of pages to be reclaimed from a task ?
> This way userspace can make calculations depending on priority, task size etc and reclaim the required number of pages from
> each task, and thus avoid the resume stall because of reclaiming an entire task.
>
> And also, would it be possible to implement the same using per task memcg by setting the limits and swappiness in such a
> way that it results inthe same thing that per process reclaim does ?

Yeb, I read Johannes's thread which suggests one-cgroup-per-app model.
It does make sense to me. It is worth to try although I guess it's not
easy to control memory usage on demand, not proactively.
If we can do, maybe we don't need per-process reclaim policy which
is rather coarse-grained model of reclaim POV.
However, a concern with one-cgroup-per-app model is LRU list size
of a cgroup is much smaller so how LRU aging works well and
LRU churing(e.g., compaction) effect would be severe than old.

I guess codeaurora tried memcg model for android.
Could you share if you know something?

Thanks.


>
> Thanks,
> Vinayak

Minchan Kim

unread,
Jun 14, 2016, 9:10:05 PM6/14/16
to
It's doable. In fact, It was first internal implementation of our
product. However, I wanted to use it on platforms which don't have
lowmemory killer. :)

Vinayak Menon

unread,
Jun 16, 2016, 12:30:05 AM6/16/16
to

On 6/15/2016 6:27 AM, Minchan Kim wrote:
>
> Yeb, I read Johannes's thread which suggests one-cgroup-per-app model.
> It does make sense to me. It is worth to try although I guess it's not
> easy to control memory usage on demand, not proactively.
> If we can do, maybe we don't need per-process reclaim policy which
> is rather coarse-grained model of reclaim POV.
> However, a concern with one-cgroup-per-app model is LRU list size
> of a cgroup is much smaller so how LRU aging works well and
> LRU churing(e.g., compaction) effect would be severe than old.
And I was thinking what would vmpressure mean and how to use it when cgroup is per task.
>
> I guess codeaurora tried memcg model for android.
> Could you share if you know something?
>
We tried, but had issues with charge migration and then Johannes suggested per task cgroup.
But that's not tried yet.

Thanks

Michal Hocko

unread,
Jun 16, 2016, 7:10:05 AM6/16/16
to
On Wed 15-06-16 09:40:27, Minchan Kim wrote:
[...]
> A question is it seems cgroup2 doesn't have per-cgroup swappiness.
> Why?

There was no strong use case for it AFAICT.

> I think we need it in one-cgroup-per-app model.

I wouldn't be opposed if it is really needed.
--
Michal Hocko
SUSE Labs

Johannes Weiner

unread,
Jun 16, 2016, 10:50:05 AM6/16/16
to
On Wed, Jun 15, 2016 at 09:40:27AM +0900, Minchan Kim wrote:
> A question is it seems cgroup2 doesn't have per-cgroup swappiness.
> Why?
>
> I think we need it in one-cgroup-per-app model.

Can you explain why you think that?

As we have talked about this recently in the LRU balancing thread,
swappiness is the cost factor between file IO and swapping, so the
only situation I can imagine you'd need a memcg swappiness setting is
when you have different cgroups use different storage devices that do
not have comparable speeds.

So I'm not sure I understand the relationship to an app-group model.

Minchan Kim

unread,
Jun 17, 2016, 2:50:06 AM6/17/16
to
Hi Hannes,
Sorry for lacking the inforamtion. I should have written more clear.
In fact, what we need is *per-memcg-swap-device*.

What I want is to avoid kill background application although memory
is overflow because cold launcing of app takes a very long time
compared to resume(ie, just switching). I also want to keep a mount
of free pages in the memory so that new application startup cannot
be stuck by reclaim activities.

To get free memory, I want to reclaim less important app rather than
killing. In this time, we can support two swap devices.

A one is zram, other is slow storage but much bigger than zram size.
Then, we can use storage swap to reclaim pages for not-important app
while we can use zram swap for for important app(e.g., forground app,
system services, daemon and so on).

IOW, we want to support mutiple swap device with one-cgroup-per-app
and the storage speed is totally different.

Balbir Singh

unread,
Jun 17, 2016, 3:30:06 AM6/17/16
to
Yes, I'd agree. cgroups can group many tasks, but the group size can be
1 as well. Could you try the same test with the recommended approach and
see if it works as desired?

Balbir Singh

Vinayak Menon

unread,
Jun 17, 2016, 4:00:06 AM6/17/16
to
With cgroup v2, IIUC there can be only a single hierarchy where all controllers exist, and
a process can be part of only one cgroup. If that is true, with per task cgroup, a task can
be present only in its own cgroup. That being the case would it be feasible to have other
parallel controllers like CPU which would not be able to work efficiently with per task cgroup ?
0 new messages