[RFC][PATCH 0/7 v2] memcg: prevent failure in pre

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 1:47:55 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

This is a v2 patch for preventing failure in memcg->pre_destroy().
With this patch, ->pre_destroy() will never return error code and
users will not see warning at rmdir(). And this work will simplify
memcg->pre_destroy(), largely.

This patch is based on linux-next + hugetlb memory control patches.

I post this as RFC because I'll have vacation in the next week and
hugetlb patches are not visible in linux-next yet.
So, I'm not in hurry. Please review when you have time.

I'll rebase this onto memcg-devel in the next post.
== BTW, memory cgroup's github is here ==
git://github.com/mstsxfx/memcg-devel.git

Since v1, Whole patch designs are changed. In this version, I didn't
remove ->pre_destroy() but make it succeed always. There are no
asynchronous operation and no big patches. But this introduces
2 changes to cgroup core.

After this series, if use_hierarchy==0, all resources will be moved
to root cgroup at rmdir() or force_empty().

Brief patch conents are

0001 : my version of compile-fix for linux-next, Aneesh will post his own version.
0002 : fix error code in hugetlb_force_memcg_empty
0003 : add res_counter_uncharge_until()
0004 : use res_counter_uncharge_until() at move_parent()
0005 : move charges to root cgroup at rmdir, if use_hierarchy=0
0006 : clean up mem_cgroup_move_account()
0007 : cgroup : avoid attaching task to cgroup where ->pre_destroy() is running.
0008 : cgroup : avoid creating a new cgroup under a cgroup where ->pre_destroy() is running.
0009 : remove -EINTR from memcg->pre_destroy().

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 1:51:31 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

Maybe Aneesh will post his own version. This is just for my work.

From eaf25c555ec809006220ef22ef152aa04c30c1af Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Date: Thu, 26 Apr 2012 17:51:52 +0900
Subject: [PATCH 1/9] compile-fix

My version of compile fix for linux-next, discussed between Andrew and
Aneesh.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---
include/linux/hugetlb.h | 4 ----
include/linux/memcontrol.h | 15 +++++++++++++++
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 828b073..fc226be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -343,8 +343,4 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
#define hstate_index(h) 0
#endif

-#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
-#endif
-
#endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f173811..ca0914a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -450,9 +450,14 @@ extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
struct page *page);
extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);

+extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
+
extern void mem_cgroup_hugetlb_migrate(struct page *oldhpage,
struct page *newhpage);
#else
+struct cgroup;
+struct mem_cgroup;
+
static inline int
mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
struct mem_cgroup **ptr)
@@ -494,6 +499,16 @@ mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
return 0;
}

+static inline void mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+ struct page *newhpage)
+{
+ return;
+}
+
+static inline int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+ return 0;
+}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif /* _LINUX_MEMCONTROL_H */

--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 1:53:12 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

EBUSY should be returned.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

mm/hugetlb.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 17ae2e4..4dd6b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
int ret = 0, idx = 0;

do {
- if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+ if (cgroup_task_count(cgroup)
+ || !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
goto out;
+ }
/*
* If the task doing the cgroup_rmdir got a signal
* we don't really need to loop till the hugetlb resource
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 1:55:05 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

From bb0168d5c85f62f36434956e4728a67d0cc41e55 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Date: Thu, 26 Apr 2012 18:48:07 +0900
Subject: [PATCH 3/9] memcg: add res_counter_uncharge_until()

At killing res_counter which is a child of other counter,
we need to do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)

This is not atomic and wasting cpu. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates
to ancestors until specified res_counter.

res_counter_uncharge_until(child, parent, xxx)

This ops is atomic and more efficient.

Originaly-written-by: Frederic Weisbecker <fwei...@gmail.com>

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

Documentation/cgroups/resource_counter.txt | 8 ++++++++
include/linux/res_counter.h | 3 +++
kernel/res_counter.c | 13 +++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..703103a 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -92,6 +92,14 @@ to work with it.

The _locked routines imply that the res_counter->lock is taken.

+ f. void res_counter_uncharge_until
+ (struct res_counter *rc, struct res_counter *top,
+ unsinged long val)
+
+ Almost same as res_cunter_uncharge() but propagation of uncharge
+ stops when rc == top. This is useful when kill a res_coutner in
+ child cgroup.
+
2.1 Other accounting routines

There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index da81af0..d11c1cd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -135,6 +135,9 @@ int __must_check res_counter_charge_nofail(struct res_counter *counter,
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
void res_counter_uncharge(struct res_counter *counter, unsigned long val);

+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d508363..f4ec411 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -66,6 +66,8 @@ done:
return ret;
}

+
+
int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
struct res_counter **limit_fail_at)
{
@@ -99,13 +101,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
counter->usage -= val;
}

-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;

local_irq_save(flags);
- for (c = counter; c != NULL; c = c->parent) {
+ for (c = counter; c != top; c = c->parent) {
spin_lock(&c->lock);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
@@ -113,6 +117,11 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
local_irq_restore(flags);
}

+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+ res_counter_uncharge_until(counter, NULL, val);
+}
+

static inline unsigned long long *
res_counter_member(struct res_counter *counter, int member)
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 1:57:00 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

By using res_counter_uncharge_until(), we can avoid
unnecessary charging.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++------------------
1 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 613bb15..ed53d64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
}

/*
+ * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
+ * This is useful when moving usage to parent cgroup.
+ */
+static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ if (!mem_cgroup_is_root(memcg)) {
+ unsigned long bytes = nr_pages * PAGE_SIZE;
+
+ res_counter_uncharge_until(&memcg->res,
+ memcg->res.parent, bytes);
+ if (do_swap_account)
+ res_counter_uncharge_until(&memcg->memsw,
+ memcg->memsw.parent, bytes);
+ }
+}
+
+/*
* A helper function to get mem_cgroup from ID. must be called under
* rcu_read_lock(). The caller must check css_is_removed() or some if
* it's concern. (dropping refcnt from swap can be called against removed
@@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
nr_pages = hpage_nr_pages(page);

parent = mem_cgroup_from_cont(pcg);
- ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
- if (ret)
- goto put_back;
+ if (!parent->use_hierarchy) {
+ ret = __mem_cgroup_try_charge(NULL,
+ gfp_mask, nr_pages, &parent, false);
+ if (ret)
+ goto put_back;
+ }

if (nr_pages > 1)
flags = compound_lock_irqsave(page);

- ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);
+ if (parent->use_hierarchy) {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);
+ } else {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, true);
+
+ if (ret)
+ __mem_cgroup_cancel_charge(parent, nr_pages);
+ }

if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
@@ -3295,6 +3325,7 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
struct cgroup *pcgrp = cgroup->parent;
struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+ struct res_counter *counter;

if (!get_page_unless_zero(page))
goto out;
@@ -3305,28 +3336,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
goto err_out;

csize = PAGE_SIZE << compound_order(page);
- /*
- * If we have use_hierarchy set we can never fail here. So instead of
- * using res_counter_uncharge use the open-coded variant which just
- * uncharge the child res_counter. The parent will retain the charge.
- */
- if (parent->use_hierarchy) {
- unsigned long flags;
- struct res_counter *counter;
-
- counter = &memcg->hugepage[idx];
- spin_lock_irqsave(&counter->lock, flags);
- res_counter_uncharge_locked(counter, csize);
- spin_unlock_irqrestore(&counter->lock, flags);
- } else {
+ /* If parent->use_hierarchy == 0, we need to charge parent */
+ if (!parent->use_hierarchy) {
ret = res_counter_charge(&parent->hugepage[idx],
csize, &fail_res);
if (ret) {
ret = -EBUSY;
goto err_out;
}
- res_counter_uncharge(&memcg->hugepage[idx], csize);
}
+ counter = &memcg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
pc->mem_cgroup = parent;
err_out:
unlock_page_cgroup(pc);
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 2:01:02 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

Now, at removal of cgroup, ->pre_destroy() is called and move charges
to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
is that the 'moving' hits parent's resource limitation. It happens only
when use_hierarchy=0. This was a mistake of original design.(it's me...)

Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
cannot justify moving charges to parent...parent and children are in
flat configuration, not hierarchical.

This patch modifes to move charges to root cgroup at rmdir/force_empty
if use_hierarchy==0. This will much simplify rmdir() and reduce error
in ->pre_destroy.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

Documentation/cgroups/memory.txt | 12 ++++++----
mm/memcontrol.c | 39 +++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 54c338d..82ce1ef 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)

-Such charges are freed or moved to their parent. At moving, both of RSS
-and CACHES are moved to parent.
-rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
+Such charges are freed or moved to their parent if use_hierarchy=1.
+if use_hierarchy=0, the charges will be moved to root cgroup.

Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.

+About use_hierarchy, see Section 6.

5. Misc. interfaces.

@@ -413,13 +413,15 @@ will be charged as a new owner of it.

Almost all pages tracked by this memory cgroup will be unmapped and freed.
Some pages cannot be freed because they are locked or in-use. Such pages are
- moved to parent and this cgroup will be empty. This may return -EBUSY if
- VM is too busy to free/move all pages immediately.
+ moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
+ cgroup will be empty.

Typical use case of this interface is that calling this before rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.

+ About use_hierarchy, see Section 6.
+
5.2 stat file

memory.stat file includes following statistics
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed53d64..62200f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,

nr_pages = hpage_nr_pages(page);

parent = mem_cgroup_from_cont(pcg);

- if (!parent->use_hierarchy) {
- ret = __mem_cgroup_try_charge(NULL,
- gfp_mask, nr_pages, &parent, false);

- if (ret)
- goto put_back;

- }
+ /*
+ * if use_hierarchy==0, move charges to root cgroup.
+ * in root cgroup, we don't touch res_counter
+ */
+ if (!parent->use_hierarchy)
+ parent = root_mem_cgroup;

if (nr_pages > 1)
flags = compound_lock_irqsave(page);

- if (parent->use_hierarchy) {

- ret = mem_cgroup_move_account(page, nr_pages,

- pc, child, parent, false);
- if (!ret)
- __mem_cgroup_cancel_local_charge(child, nr_pages);
- } else {

- ret = mem_cgroup_move_account(page, nr_pages,

- pc, child, parent, true);

-
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);

- }

+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);

if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);

-put_back:
putback_lru_page(page);
put:
put_page(page);
@@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
csize = PAGE_SIZE << compound_order(page);

/* If parent->use_hierarchy == 0, we need to charge parent */

if (!parent->use_hierarchy) {
- ret = res_counter_charge(&parent->hugepage[idx],
- csize, &fail_res);
- if (ret) {
- ret = -EBUSY;
- goto err_out;
- }
+ parent = root_mem_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);

}
counter = &memcg->hugepage[idx];

res_counter_uncharge_until(counter, counter->parent, csize);
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 2:02:28 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

Now, all caller passes 'false' for 'bool uncharge', remove the argument.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

mm/memcontrol.c | 20 ++++++--------------
1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62200f1..2715223 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2589,23 +2589,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
* @pc: page_cgroup of the page.
* @from: mem_cgroup which the page is moved from.
* @to: mem_cgroup which the page is moved to. @from != @to.
- * @uncharge: whether we should call uncharge and css_put against @from.
*
* The caller must confirm following.
* - page is not on LRU (isolate_page() is useful.)
* - compound_lock is held when nr_pages > 1
*
- * This function doesn't do "charge" nor css_get to new cgroup. It should be
- * done by a caller(__mem_cgroup_try_charge would be useful). If @uncharge is
- * true, this function does "uncharge" from old cgroup, but it doesn't if
- * @uncharge is false, so a caller should do "uncharge".
+ * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
+ * from old cgroup.
*/
static int mem_cgroup_move_account(struct page *page,
unsigned int nr_pages,
struct page_cgroup *pc,
struct mem_cgroup *from,
- struct mem_cgroup *to,
- bool uncharge)
+ struct mem_cgroup *to)
{
unsigned long flags;
int ret;
@@ -2639,9 +2635,6 @@ static int mem_cgroup_move_account(struct page *page,
preempt_enable();
}
mem_cgroup_charge_statistics(from, anon, -nr_pages);
- if (uncharge)
- /* This is not "cancel", but cancel_charge does all we need. */
- __mem_cgroup_cancel_charge(from, nr_pages);

/* caller should have done css_get */
pc->mem_cgroup = to;
@@ -2706,7 +2699,7 @@ static int mem_cgroup_move_parent(struct page *page,
flags = compound_lock_irqsave(page);

ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, false);

+ pc, child, parent);
if (!ret)
__mem_cgroup_cancel_local_charge(child, nr_pages);

@@ -5724,8 +5717,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (!isolate_lru_page(page)) {
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
- pc, mc.from, mc.to,
- false)) {
+ pc, mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
mc.moved_charge += HPAGE_PMD_NR;
}
@@ -5755,7 +5747,7 @@ retry:
goto put;
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, 1, pc,
- mc.from, mc.to, false)) {
+ mc.from, mc.to)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 2:04:31 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
in rmdir() isn't called under cgroup_mutex().

It's better to avoid attaching a task to a cgroup which
is under pre_destroy(). Considering memcg, the attached task may
increase resource usage after memcg's pre_destroy() confirms that
memcg is empty. This is not good.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

kernel/cgroup.c | 5 ++++-

1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ad8eae5..7a3076b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
if (cgrp == oldcgrp)
return 0;

+ if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
+ return -EBUSY;
+
tset.single.task = tsk;
tset.single.cgrp = oldcgrp;

@@ -4181,7 +4184,6 @@ again:
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);

/*
* In general, subsystem has no css->refcnt after pre_destroy(). But
@@ -4193,6 +4195,7 @@ again:
* and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
*/
set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+ mutex_unlock(&cgroup_mutex);

/*
* Call pre_destroy handlers of subsys. Notify subsystems
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 2:06:23 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

When ->pre_destroy() is called, it should be guaranteed that
new child cgroup is not created under a cgroup, where pre_destroy()
is running. If not, ->pre_destroy() must check children and
return -EBUSY, which causes warning.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

kernel/cgroup.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7a3076b..003ceed 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3970,6 +3970,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,

mutex_lock(&cgroup_mutex);

+ /*
+ * prevent making child cgroup when ->pre_destroy() is running.
+ */
+ if (test_bit(CGRP_WAIT_ON_RMDIR, &parent->flags)) {
+ mutex_unlock(&cgroup_mutex);
+ atomic_dec(&sb->s_active);
+ return -EBUSY;
+ }
+
init_cgroup_housekeeping(cgrp);

cgrp->parent = parent;
--
1.7.4.1

KAMEZAWA Hiroyuki

unread,

Apr 27, 2012, 2:08:32 AM4/27/12

to Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

When force_empty() called by ->pre_destroy(), no memory reclaim happens
and it doesn't take very long time which requires signal_pending() check.
And if we return -EINTR from pre_destroy(), cgroup.c show warning.

This patch removes signal check in force_empty(). By this, ->pre_destroy()
returns success always.

Note: check for 'cgroup is empty' remains for force_empty interface.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---

mm/hugetlb.c | 10 +---------
mm/memcontrol.c | 14 +++++---------
2 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4dd6b39..770f1642 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)

int ret = 0, idx = 0;

do {

+ /* see memcontrol.c::mem_cgroup_force_empty() */
if (cgroup_task_count(cgroup)
|| !list_empty(&cgroup->children)) {

ret = -EBUSY;
goto out;
}

- /*
- * If the task doing the cgroup_rmdir got a signal
- * we don't really need to loop till the hugetlb resource
- * usage become zero.
- */
- if (signal_pending(current)) {
- ret = -EINTR;
- goto out;
- }
for_each_hstate(h) {
spin_lock(&hugetlb_lock);
list_for_each_entry(page, &h->hugepage_activelist, lru) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2715223..ee350c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
pc = lookup_page_cgroup(page);

ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
- if (ret == -ENOMEM || ret == -EINTR)
- break;

if (ret == -EBUSY || ret == -EINVAL) {
/* found lock contention or "pc" is obsolete. */
@@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
busy = NULL;
}

- if (!ret && !list_empty(list))
+ if (!loop)
return -EBUSY;
return ret;
}
@@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
move_account:
do {
ret = -EBUSY;
+ /*
+ * This never happens when this is called by ->pre_destroy().
+ * But we need to take care of force_empty interface.
+ */
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
- ret = -EINTR;
- if (signal_pending(current))
- goto out;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
@@ -3918,9 +3917,6 @@ move_account:
}
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
- /* it seems parent cgroup doesn't have enough mem */
- if (ret == -ENOMEM)
- goto try_to_free;
cond_resched();
/* "ret" should also be checked to ensure all lists are empty. */
} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
--
1.7.4.1

Frederic Weisbecker

unread,

Apr 27, 2012, 6:39:47 AM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
>
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> ---
> kernel/cgroup.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index ad8eae5..7a3076b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> if (cgrp == oldcgrp)
> return 0;
>
> + if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> + return -EBUSY;
> +

You probably need to update cgroup_attach_proc() as well?

Tejun Heo

unread,

Apr 27, 2012, 2:16:53 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

Hello,

On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a v2 patch for preventing failure in memcg->pre_destroy().
> With this patch, ->pre_destroy() will never return error code and
> users will not see warning at rmdir(). And this work will simplify
> memcg->pre_destroy(), largely.
>
> This patch is based on linux-next + hugetlb memory control patches.

Ergh... can you please set up a git branch somewhere for review
purposes?

Thanks.

--
tejun

Tejun Heo

unread,

Apr 27, 2012, 2:18:50 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 02:53:03PM +0900, KAMEZAWA Hiroyuki wrote:
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index d508363..f4ec411 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -66,6 +66,8 @@ done:
> return ret;
> }
>
> +
> +

Contamination?

--
tejun

Tejun Heo

unread,

Apr 27, 2012, 2:20:33 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 02:54:58PM +0900, KAMEZAWA Hiroyuki wrote:
> By using res_counter_uncharge_until(), we can avoid
> unnecessary charging.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> ---
> mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++------------------
> 1 files changed, 42 insertions(+), 21 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 613bb15..ed53d64 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> }
>
> /*
> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.

^typo ^ unnecessary s

--
tejun

Ying Han

unread,

Apr 27, 2012, 2:26:28 PM4/27/12

to Glauber Costa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Tejun Heo, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 10:16 AM, Glauber Costa <glo...@parallels.com> wrote:

> Kame, this is a nitpick, but I usually prefer to write this like:
>
> if (mem_cgroup_is_root(memcg))
> return;
>
> res_counter...
>
> Specially with memcg, where function names are bigger than average, in
> comparison.
>
> the code itself seems fine.

>
>> +/*
>> * A helper function to get mem_cgroup from ID. must be called under
>> * rcu_read_lock(). The caller must check css_is_removed() or some if
>> * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>> nr_pages = hpage_nr_pages(page);
>>
>> parent = mem_cgroup_from_cont(pcg);

>> - ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);

>> - if (ret)
>> - goto put_back;
>> + if (!parent->use_hierarchy) {

> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
>
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.
>> + ret = __mem_cgroup_try_charge(NULL,
>> + gfp_mask, nr_pages,&parent, false);

>> + if (ret)
>> + goto put_back;
>> + }
>

> Why? If we are not hierarchical, we should not charge the parent, right?

This is how it is implemented today and I think he changed that to
move to root on the next patch.

>
>> if (nr_pages> 1)
>> flags = compound_lock_irqsave(page);
>>
>> - ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
>> - if (ret)
>> - __mem_cgroup_cancel_charge(parent, nr_pages);
>> + if (parent->use_hierarchy) {
>> + ret = mem_cgroup_move_account(page, nr_pages,
>> + pc, child, parent, false);
>> + if (!ret)
>> + __mem_cgroup_cancel_local_charge(child, nr_pages);
>> + } else {
>> + ret = mem_cgroup_move_account(page, nr_pages,
>> + pc, child, parent, true);
>> +
>> + if (ret)
>> + __mem_cgroup_cancel_charge(parent, nr_pages);
>> + }
>

> Calling move account also seems not necessary to me. If we are not
> uncharging + charging, we won't even touch the parent.

Today for user_hierarchy = 0, the charge is moved to parent as well as
the stats. But that is changed on the following patches.

--Ying

Ying Han

unread,

Apr 27, 2012, 3:13:00 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Thu, Apr 26, 2012 at 10:58 PM, KAMEZAWA Hiroyuki
<kamezaw...@jp.fujitsu.com> wrote:
> Now, at removal of cgroup, ->pre_destroy() is called and move charges
> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
> is that the 'moving' hits parent's resource limitation. It happens only
> when use_hierarchy=0. This was a mistake of original design.(it's me...)

Nice patch, i can see how broken it is now with use_hierarchy=0...

nitpick on the documentation below:

>
> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
> cannot justify moving charges to parent...parent and children are in
> flat configuration, not hierarchical.
>
> This patch modifes to move charges to root cgroup at rmdir/force_empty
> if use_hierarchy==0. This will much simplify rmdir() and reduce error
> in ->pre_destroy.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> ---
> Documentation/cgroups/memory.txt | 12 ++++++----
> mm/memcontrol.c | 39 +++++++++++++------------------------
> 2 files changed, 21 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 54c338d..82ce1ef 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
> tasks have migrated away from it. (because we charge against pages, not
> against tasks.)
>
> -Such charges are freed or moved to their parent. At moving, both of RSS
> -and CACHES are moved to parent.
> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
> +Such charges are freed or moved to their parent if use_hierarchy=1.
> +if use_hierarchy=0, the charges will be moved to root cgroup.

It is more clear that we move the stats to root (if use_hierarchy==0)
or parent (if use_hierarchy==1), and no change on the charge except
uncharging from the child.

--Ying

Tejun Heo

unread,

Apr 27, 2012, 4:32:12 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:

> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
>
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Hmm... once memcg's pre_destroy() can't fail, I think what we should
do is marking a cgroup DEAD before calling pre_destroy() and the
existing cgroup_is_removed() check should be enough. Patches upto
this point already make ->pre_destroy() not fail, right?

Thanks.

--
tejun

Tejun Heo

unread,

Apr 27, 2012, 4:33:48 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 01:31:59PM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> > attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> > in rmdir() isn't called under cgroup_mutex().
> >
> > It's better to avoid attaching a task to a cgroup which
> > is under pre_destroy(). Considering memcg, the attached task may
> > increase resource usage after memcg's pre_destroy() confirms that
> > memcg is empty. This is not good.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>
> Hmm... once memcg's pre_destroy() can't fail, I think what we should
> do is marking a cgroup DEAD before calling pre_destroy() and the
> existing cgroup_is_removed() check should be enough. Patches upto
> this point already make ->pre_destroy() not fail, right?

Ooh, I was confused about patch order. This patch, probably with the
update suggested by Frederic looks good as an interim solution for
now.

Tejun Heo

unread,

Apr 27, 2012, 4:40:48 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 03:04:14PM +0900, KAMEZAWA Hiroyuki wrote:
> When ->pre_destroy() is called, it should be guaranteed that
> new child cgroup is not created under a cgroup, where pre_destroy()
> is running. If not, ->pre_destroy() must check children and
> return -EBUSY, which causes warning.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Hmm... I'm getting confused more. Why do we need these cgroup changes
at all? cgroup still has cgrp->count check and
cgroup_clear_css_refs() after pre_destroy() calls. The order of
changes should be,

* Make memcg pre_destroy() not fail; however, pre_destroy() should
still be ready to be retried. That's the defined interface.

* cgroup core updated to drop pre_destroy() retrying and guarantee
that pre_destroy() invocation will happen only once.

* memcg and other cgroups can update their pre_destroy() if the "won't
be retried" part can simplify their implementations.

So, there's no reason to be updating cgroup pre_destroy() semantics at
this point and these updates actually break cgroup API as it currently
stands. The only change necessary is memcg's pre_destroy() not
returning zero.

Thanks.

--
tejun

Tejun Heo

unread,

Apr 27, 2012, 4:41:40 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

On Fri, Apr 27, 2012 at 01:40:35PM -0700, Tejun Heo wrote:
> stands. The only change necessary is memcg's pre_destroy() not
> returning zero.

Umm.. that should have been "always returning zero". :)

Ying Han

unread,

Apr 27, 2012, 5:28:53 PM4/27/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

This looks a bit strange to me... why we make the change ?

--Ying

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 7:48:28 PM4/27/12

to Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
>> This is a v2 patch for preventing failure in memcg->pre_destroy().
>> With this patch, ->pre_destroy() will never return error code and
>> users will not see warning at rmdir(). And this work will simplify
>> memcg->pre_destroy(), largely.
>>
>> This patch is based on linux-next + hugetlb memory control patches.
>
> Ergh... can you please set up a git branch somewhere for review
> purposes?
>

I'm sorry...I can't. (To do that, I need to pass many my company's check.)
I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
linux-next.

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 7:51:20 PM4/27/12

to Glauber Costa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 2:08 AM, Glauber Costa <glo...@parallels.com> wrote:

> On 04/27/2012 02:53 AM, KAMEZAWA Hiroyuki wrote:
>> From bb0168d5c85f62f36434956e4728a67d0cc41e55 Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki<kamezaw...@jp.fujitsu.com>
>> Date: Thu, 26 Apr 2012 18:48:07 +0900
>> Subject: [PATCH 3/9] memcg: add res_counter_uncharge_until()
>>
>> At killing res_counter which is a child of other counter,
>> we need to do
>> res_counter_uncharge(child, xxx)
>> res_counter_charge(parent, xxx)
>>
>> This is not atomic and wasting cpu. This patch adds
>> res_counter_uncharge_until(). This function's uncharge propagates
>> to ancestors until specified res_counter.
>>
>> res_counter_uncharge_until(child, parent, xxx)
>>
>> This ops is atomic and more efficient.
>>
>> Originaly-written-by: Frederic Weisbecker<fwei...@gmail.com>
>> Signed-off-by: KAMEZAWA Hiroyuki<kamezaw...@jp.fujitsu.com>

> I have been carrying Frederic's patch itself in my code for a while now.
>
> Why not just use it? What are you doing differently to justify writing a
> patch yourself? It's a bit of credit giving as well

I don't need "charge" part for my purpose and have no justification to add it.
And task-limit cgroup has no justification, either.

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 7:51:49 PM4/27/12

to Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:18 AM, Tejun Heo <t...@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 02:53:03PM +0900, KAMEZAWA Hiroyuki wrote:
>> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
>> index d508363..f4ec411 100644
>> --- a/kernel/res_counter.c
>> +++ b/kernel/res_counter.c
>> @@ -66,6 +66,8 @@ done:
>> return ret;
>> }
>>
>> +
>> +
>
> Contamination?

Ah, sorry. I'll fix.

-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 7:58:17 PM4/27/12

to Glauber Costa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 2:16 AM, Glauber Costa <glo...@parallels.com> wrote:
> On 04/27/2012 02:54 AM, KAMEZAWA Hiroyuki wrote:

> Kame, this is a nitpick, but I usually prefer to write this like:
>
> if (mem_cgroup_is_root(memcg))
> return;
>
> res_counter...
>
> Specially with memcg, where function names are bigger than average, in
> comparison.
>
> the code itself seems fine.
>

Ok, I'll use that style in the next post.

>> +/*
>> * A helper function to get mem_cgroup from ID. must be called under
>> * rcu_read_lock(). The caller must check css_is_removed() or some if
>> * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>> nr_pages = hpage_nr_pages(page);
>>
>> parent = mem_cgroup_from_cont(pcg);

>> - ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);

>> - if (ret)
>> - goto put_back;
>> + if (!parent->use_hierarchy) {

> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
>
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.

We need to find parent even if use_hierarchy==0 in this patch.
I'll consider to use it in later patch, thank you for pointing out.

>> + ret = __mem_cgroup_try_charge(NULL,
>> + gfp_mask, nr_pages,&parent, false);

>> + if (ret)
>> + goto put_back;
>> + }
>

> Why? If we are not hierarchical, we should not charge the parent, right?
Current implementation moves charges to parent regardless of use_hierarchy.
It's handled in a following patch.

>
>> if (nr_pages> 1)
>> flags = compound_lock_irqsave(page);
>>
>> - ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
>> - if (ret)
>> - __mem_cgroup_cancel_charge(parent, nr_pages);
>> + if (parent->use_hierarchy) {
>> + ret = mem_cgroup_move_account(page, nr_pages,
>> + pc, child, parent, false);
>> + if (!ret)
>> + __mem_cgroup_cancel_local_charge(child, nr_pages);
>> + } else {
>> + ret = mem_cgroup_move_account(page, nr_pages,
>> + pc, child, parent, true);
>> +
>> + if (ret)
>> + __mem_cgroup_cancel_charge(parent, nr_pages);
>> + }
>

> Calling move account also seems not necessary to me. If we are not
> uncharging + charging, we won't even touch the parent.

we need to overwrite pc->mem_cgroup and touch other statistics.

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 7:59:17 PM4/27/12

to Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:20 AM, Tejun Heo <t...@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 02:54:58PM +0900, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>> ---
>> mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++------------------
>> 1 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 613bb15..ed53d64 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>> }
>>
>> /*
>> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
>
> ^typo ^ unnecessary s

Ya, thanks. I'll fix it.

-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 8:01:30 PM4/27/12

to Ying Han, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V, Andrew Morton

Seems nicer. I'll use your text in next ver.

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 8:06:27 PM4/27/12

to Frederic Weisbecker, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Fri, Apr 27, 2012 at 7:39 PM, Frederic Weisbecker <fwei...@gmail.com> wrote:
> On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
>> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
>> in rmdir() isn't called under cgroup_mutex().
>>
>> It's better to avoid attaching a task to a cgroup which
>> is under pre_destroy(). Considering memcg, the attached task may
>> increase resource usage after memcg's pre_destroy() confirms that
>> memcg is empty. This is not good.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>> ---
>> kernel/cgroup.c | 5 ++++-
>> 1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index ad8eae5..7a3076b 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>> if (cgrp == oldcgrp)
>> return 0;
>>
>> + if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
>> + return -EBUSY;
>> +
>
> You probably need to update cgroup_attach_proc() as well?
>

Ahh...I missed that. Thank you for pointing out !

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 8:20:58 PM4/27/12

to Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 5:40 AM, Tejun Heo <t...@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 03:04:14PM +0900, KAMEZAWA Hiroyuki wrote:
>> When ->pre_destroy() is called, it should be guaranteed that
>> new child cgroup is not created under a cgroup, where pre_destroy()
>> is running. If not, ->pre_destroy() must check children and
>> return -EBUSY, which causes warning.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>
> Hmm... I'm getting confused more. Why do we need these cgroup changes
> at all? cgroup still has cgrp->count check and
> cgroup_clear_css_refs() after pre_destroy() calls. The order of
> changes should be,
>
> * Make memcg pre_destroy() not fail; however, pre_destroy() should
> still be ready to be retried. That's the defined interface.
>
> * cgroup core updated to drop pre_destroy() retrying and guarantee
> that pre_destroy() invocation will happen only once.
>
> * memcg and other cgroups can update their pre_destroy() if the "won't
> be retried" part can simplify their implementations.
>

What I thought was...
Assume a memory cgoup A, with use_hierarchy==1.

1. thread:0 start calling pre->destroy of cgroup A
2. thread:0 it sometimes calls cond_resched or other sleep functions.
3. thread:1 create a cgroup B under "A"
4. thread:1 attach a thread X to cgroup A/B
5. res_counter of A charged up. but pre_destroy() can't find what happens
because it scans LRU of A.

So, we have -EBUSY now. I considered some options to fix this.

option 1) just return 0 instead of -EBUSY when pre_destroy() finds a
task or a child.

There is a race....even if we return 0 here and expects cgroup code
can catch it,
the thread or a child we found may be moved to other cgroup before we check it
in cgroup's final check.
In that case, the cgroup will be freed before full-ack of
pre_destory() and the charges
will be lost.

option 2) move all codes to ->destory()
That was previous version of this set.

This is option3 that preventing creation of new child.

If you don't like this, I'll move all codes to ->destroy() and use
asynchronous again.

Thanks,
-Kame

Hiroyuki Kamezawa

unread,

Apr 27, 2012, 8:25:15 PM4/27/12

to Ying Han, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V, Andrew Morton

Ah, I should this move to an independet patch.
Because we don't have -ENOMEM path to exit loop, the return value of
this function
is
0 (if loop !=0 this means lru is empty under the lru lock )
-EBUSY (if loop== 0)

I'll move this part out as an independent clean up patch

thanks,
-kame

Tejun Heo

unread,

Apr 27, 2012, 10:00:19 PM4/27/12

to Hiroyuki Kamezawa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

Hi, KAME.

On Sat, Apr 28, 2012 at 09:20:52AM +0900, Hiroyuki Kamezawa wrote:
> What I thought was...
> Assume a memory cgoup A, with use_hierarchy==1.
>
> 1. thread:0 start calling pre->destroy of cgroup A
> 2. thread:0 it sometimes calls cond_resched or other sleep functions.
> 3. thread:1 create a cgroup B under "A"
> 4. thread:1 attach a thread X to cgroup A/B
> 5. res_counter of A charged up. but pre_destroy() can't find what happens
> because it scans LRU of A.
>
> So, we have -EBUSY now. I considered some options to fix this.
>
> option 1) just return 0 instead of -EBUSY when pre_destroy() finds a
> task or a child.
>
> There is a race....even if we return 0 here and expects cgroup code
> can catch it,
> the thread or a child we found may be moved to other cgroup before we check it
> in cgroup's final check.
> In that case, the cgroup will be freed before full-ack of
> pre_destory() and the charges
> will be lost.

So, cgroup code won't proceed with rmdir if children are created
inbetween and note that the race condition of lost charge you
described above existed before this change - ie. new cgroup could be
created after pre_destroy() is complete.

The current cgroup rmdir code is transitional. It has to support both
retrying and non-retrying pre_destroy()s and that means we can't mark
the cgroup DEAD before starting invoking pre_destroy(); however, we
can do that once memcg's pre_destroy() is converted which will also
remove all the WAIT_ON_RMDIR mechanism and the above described race.

There really isn't much point in trying to make the current cgroup
rmdir behave perfectly when the next step is removing all the fixed up
parts.

So, IMHO, just making pre_destroy() clean up its own charges and
always returning 0 is enough. There's no need to fix up old
non-critical race condition at this point in the patch stream. cgroup
rmdir simplification will make them disappear anyway.

Thanks.

--
tejun

Hiroyuki Kamezawa

unread,

Apr 28, 2012, 5:32:28 AM4/28/12

to Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

So, hmm, ok. I'll drop patch 7 & 8. memcg may return -EBUSY in very very
race case but users will not see it in the most case.
I'll fix limit, move-charge and use_hierarchy problem first.
Thanks,
-Kame

Michal Hocko

unread,

Apr 28, 2012, 12:14:13 PM4/28/12

to Hiroyuki Kamezawa, Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat 28-04-12 08:48:18, Hiroyuki Kamezawa wrote:
> On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <t...@kernel.org> wrote:
> > Hello,
> >
> > On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> >> This is a v2 patch for preventing failure in memcg->pre_destroy().
> >> With this patch, ->pre_destroy() will never return error code and
> >> users will not see warning at rmdir(). And this work will simplify
> >> memcg->pre_destroy(), largely.
> >>
> >> This patch is based on linux-next + hugetlb memory control patches.
> >
> > Ergh... can you please set up a git branch somewhere for review
> > purposes?
> >
> I'm sorry...I can't. (To do that, I need to pass many my company's check.)
> I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
> linux-next.

I can push it to memcg-devel tree if you want.

>
> Thanks,
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Tejun Heo

unread,

Apr 28, 2012, 5:32:04 PM4/28/12

to Hiroyuki Kamezawa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

Hello, KAME.

On Sat, Apr 28, 2012 at 06:31:38PM +0900, Hiroyuki Kamezawa wrote:
> > So, IMHO, just making pre_destroy() clean up its own charges and
> > always returning 0 is enough. There's no need to fix up old
> > non-critical race condition at this point in the patch stream. cgroup
> > rmdir simplification will make them disappear anyway.
>
> So, hmm, ok. I'll drop patch 7 & 8. memcg may return -EBUSY in very very
> race case but users will not see it in the most case.
> I'll fix limit, move-charge and use_hierarchy problem first.

IIUC, memcg can just return 0 when child creation races against
pre_destroy(). cgroup will retry if child exists after pre_destroy()
completion. If child comes and goes before cgroup checks its
existence, some charges may be lost but that race already exists and
it will be gone once the retry logic is removed. Also, returning
-errno will trigger WARN_ON() w/o the legacy behavior flag.

Thank you very much.

--
tejun

Michal Hocko

unread,

Apr 29, 2012, 2:03:20 AM4/29/12

to Hiroyuki Kamezawa, Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V, Andrew Morton

On Sat 28-04-12 18:13:58, Michal Hocko wrote:
> On Sat 28-04-12 08:48:18, Hiroyuki Kamezawa wrote:
> > On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <t...@kernel.org> wrote:
> > > Hello,
> > >
> > > On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > >> This is a v2 patch for preventing failure in memcg->pre_destroy().
> > >> With this patch, ->pre_destroy() will never return error code and
> > >> users will not see warning at rmdir(). And this work will simplify
> > >> memcg->pre_destroy(), largely.
> > >>
> > >> This patch is based on linux-next + hugetlb memory control patches.
> > >
> > > Ergh... can you please set up a git branch somewhere for review
> > > purposes?
> > >
> > I'm sorry...I can't. (To do that, I need to pass many my company's check.)
> > I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
> > linux-next.
>
> I can push it to memcg-devel tree if you want.

As a separate branch of course...

Aneesh Kumar K.V

unread,

Apr 30, 2012, 5:06:05 AM4/30/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Andrew Morton, kamezawa...@gmail.com

KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> writes:

> By using res_counter_uncharge_until(), we can avoid
> unnecessary charging.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Reviewed-by: Aneesh Kumar K.V <aneesh...@linux.vnet.ibm.com>

> ---
> mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++++------------------
> 1 files changed, 42 insertions(+), 21 deletions(-)
>

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 613bb15..ed53d64 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c

> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> }
>
> /*
> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.

> + * This is useful when moving usage to parent cgroup.
> + */
> +static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
> + unsigned int nr_pages)
> +{
> + if (!mem_cgroup_is_root(memcg)) {
> + unsigned long bytes = nr_pages * PAGE_SIZE;
> +
> + res_counter_uncharge_until(&memcg->res,
> + memcg->res.parent, bytes);
> + if (do_swap_account)
> + res_counter_uncharge_until(&memcg->memsw,
> + memcg->memsw.parent, bytes);
> + }
> +}

> +

> +/*
> * A helper function to get mem_cgroup from ID. must be called under
> * rcu_read_lock(). The caller must check css_is_removed() or some if
> * it's concern. (dropping refcnt from swap can be called against removed
> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
> nr_pages = hpage_nr_pages(page);
>
> parent = mem_cgroup_from_cont(pcg);

> - ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);

> - if (ret)
> - goto put_back;
> + if (!parent->use_hierarchy) {

> + ret = __mem_cgroup_try_charge(NULL,
> + gfp_mask, nr_pages, &parent, false);

> + if (ret)
> + goto put_back;
> + }
>

> if (nr_pages > 1)
> flags = compound_lock_irqsave(page);
>
> - ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
> - if (ret)
> - __mem_cgroup_cancel_charge(parent, nr_pages);
> + if (parent->use_hierarchy) {
> + ret = mem_cgroup_move_account(page, nr_pages,
> + pc, child, parent, false);
> + if (!ret)
> + __mem_cgroup_cancel_local_charge(child, nr_pages);
> + } else {
> + ret = mem_cgroup_move_account(page, nr_pages,
> + pc, child, parent, true);
> +
> + if (ret)
> + __mem_cgroup_cancel_charge(parent, nr_pages);
> + }
>

May be a comment around this ? I had to look closer to find why
there is a if (!ret) and if (ret) difference.

> if (nr_pages > 1)
> compound_unlock_irqrestore(page, flags);

> @@ -3295,6 +3325,7 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
> struct cgroup *pcgrp = cgroup->parent;
> struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
> + struct res_counter *counter;
>
> if (!get_page_unless_zero(page))
> goto out;
> @@ -3305,28 +3336,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
> goto err_out;
>
> csize = PAGE_SIZE << compound_order(page);
> - /*
> - * If we have use_hierarchy set we can never fail here. So instead of
> - * using res_counter_uncharge use the open-coded variant which just
> - * uncharge the child res_counter. The parent will retain the charge.
> - */
> - if (parent->use_hierarchy) {
> - unsigned long flags;
> - struct res_counter *counter;
> -
> - counter = &memcg->hugepage[idx];
> - spin_lock_irqsave(&counter->lock, flags);
> - res_counter_uncharge_locked(counter, csize);
> - spin_unlock_irqrestore(&counter->lock, flags);
> - } else {
> + /* If parent->use_hierarchy == 0, we need to charge parent */
> + if (!parent->use_hierarchy) {

> ret = res_counter_charge(&parent->hugepage[idx],

> csize, &fail_res);
> if (ret) {
> ret = -EBUSY;
> goto err_out;
> }
> - res_counter_uncharge(&memcg->hugepage[idx], csize);
> }
> + counter = &memcg->hugepage[idx];
> + res_counter_uncharge_until(counter, counter->parent, csize);
> +
> pc->mem_cgroup = parent;
> err_out:
> unlock_page_cgroup(pc);

-aneesh

Aneesh Kumar K.V

unread,

Apr 30, 2012, 5:10:08 AM4/30/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Andrew Morton, kamezawa...@gmail.com

KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> writes:

> Now, at removal of cgroup, ->pre_destroy() is called and move charges
> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
> is that the 'moving' hits parent's resource limitation. It happens only
> when use_hierarchy=0. This was a mistake of original design.(it's me...)
>

> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
> cannot justify moving charges to parent...parent and children are in
> flat configuration, not hierarchical.
>
> This patch modifes to move charges to root cgroup at rmdir/force_empty
> if use_hierarchy==0. This will much simplify rmdir() and reduce error
> in ->pre_destroy.
>

> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Reviewed-by: Anees Kumar K.V <aneesh...@linux.vnet.ibm.com>

> ---
> Documentation/cgroups/memory.txt | 12 ++++++----
> mm/memcontrol.c | 39 +++++++++++++------------------------
> 2 files changed, 21 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 54c338d..82ce1ef 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
> tasks have migrated away from it. (because we charge against pages, not
> against tasks.)
>
> -Such charges are freed or moved to their parent. At moving, both of RSS
> -and CACHES are moved to parent.
> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
> +Such charges are freed or moved to their parent if use_hierarchy=1.
> +if use_hierarchy=0, the charges will be moved to root cgroup.
>

> Charges recorded in swap information is not updated at removal of cgroup.
> Recorded information is discarded and a cgroup which uses swap (swapcache)
> will be charged as a new owner of it.
>
> +About use_hierarchy, see Section 6.
>
> 5. Misc. interfaces.
>
> @@ -413,13 +413,15 @@ will be charged as a new owner of it.
>
> Almost all pages tracked by this memory cgroup will be unmapped and freed.
> Some pages cannot be freed because they are locked or in-use. Such pages are
> - moved to parent and this cgroup will be empty. This may return -EBUSY if
> - VM is too busy to free/move all pages immediately.
> + moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
> + cgroup will be empty.
>
> Typical use case of this interface is that calling this before rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
>
> + About use_hierarchy, see Section 6.
> +
> 5.2 stat file
>
> memory.stat file includes following statistics

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed53d64..62200f1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,

> nr_pages = hpage_nr_pages(page);
>
> parent = mem_cgroup_from_cont(pcg);

> - if (!parent->use_hierarchy) {
> - ret = __mem_cgroup_try_charge(NULL,

> - gfp_mask, nr_pages, &parent, false);

> - if (ret)
> - goto put_back;

> - }
> + /*
> + * if use_hierarchy==0, move charges to root cgroup.
> + * in root cgroup, we don't touch res_counter

> + */
> + if (!parent->use_hierarchy)
> + parent = root_mem_cgroup;

>
> if (nr_pages > 1)
> flags = compound_lock_irqsave(page);
>

> - if (parent->use_hierarchy) {

> - ret = mem_cgroup_move_account(page, nr_pages,

> - pc, child, parent, false);
> - if (!ret)
> - __mem_cgroup_cancel_local_charge(child, nr_pages);
> - } else {

> - ret = mem_cgroup_move_account(page, nr_pages,

> - pc, child, parent, true);

> -
> - if (ret)
> - __mem_cgroup_cancel_charge(parent, nr_pages);

> - }

> + ret = mem_cgroup_move_account(page, nr_pages,
> + pc, child, parent, false);
> + if (!ret)
> + __mem_cgroup_cancel_local_charge(child, nr_pages);
>

> if (nr_pages > 1)
> compound_unlock_irqrestore(page, flags);

> -put_back:
> putback_lru_page(page);
> put:
> put_page(page);
> @@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
> csize = PAGE_SIZE << compound_order(page);

> /* If parent->use_hierarchy == 0, we need to charge parent */

> if (!parent->use_hierarchy) {
> - ret = res_counter_charge(&parent->hugepage[idx],
> - csize, &fail_res);
> - if (ret) {
> - ret = -EBUSY;
> - goto err_out;
> - }
> + parent = root_mem_cgroup;
> + /* root has no limit */
> + res_counter_charge_nofail(&parent->hugepage[idx],
> + csize, &fail_res);
> }

> counter = &memcg->hugepage[idx];

> res_counter_uncharge_until(counter, counter->parent, csize);
> --

Ying Han

unread,

Apr 30, 2012, 1:03:01 PM4/30/12

to Hiroyuki Kamezawa, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V, Andrew Morton

Thanks ~

--Ying

Suleiman Souhlal

unread,

May 1, 2012, 6:28:54 PM5/1/12

to KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton, kamezawa...@gmail.com

2012/4/26 KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>:

> return -EBUSY;
> return ret;
> }
> @@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
> move_account:
> do {
> ret = -EBUSY;
> + /*
> + * This never happens when this is called by ->pre_destroy().
> + * But we need to take care of force_empty interface.
> + */
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> goto out;

Are you sure this never happens when called by ->pre_destroy()?
Can't a task still get attached to the cgroup while ->pre_destroy() is running?

At least, I don't see anything in the cgroup code that prevents
someone from newly attaching a task at that point.
In fact, there is code that seems to handle the case when someone
attached to the cgroup after pre_destroy() has run: See the
cgroup_wakeup_rmdir_waiter() call in cgroup_attach_task().

-- Suleiman

Hiroyuki Kamezawa

unread,

May 1, 2012, 11:34:44 PM5/1/12

to Suleiman Souhlal, KAMEZAWA Hiroyuki, Linux Kernel, linu...@kvack.org, cgr...@vger.kernel.org, Michal Hocko, Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V, Andrew Morton

see whole series of patch series, 7 & 8 is against that probelm.
But they will be dropped and this race will remain. And this patch's
title will be
changed to be "remove -EINTR" rather than "remove failure of pre_destroy*.
pre_destrou() will continue to fail until cgroup core is fixed.

Thanks,
-Kame

[RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

KAMEZAWA Hiroyuki

Frederic Weisbecker

Tejun Heo

Tejun Heo

Tejun Heo

Ying Han

Ying Han

Tejun Heo

Tejun Heo

Tejun Heo

Tejun Heo

Ying Han

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Hiroyuki Kamezawa

Tejun Heo

Hiroyuki Kamezawa

Michal Hocko

Tejun Heo

Michal Hocko

Aneesh Kumar K.V

Aneesh Kumar K.V

Ying Han

Suleiman Souhlal

Hiroyuki Kamezawa