This is a v2 patch for preventing failure in memcg->pre_destroy().
With this patch, ->pre_destroy() will never return error code and
users will not see warning at rmdir(). And this work will simplify
memcg->pre_destroy(), largely.
This patch is based on linux-next + hugetlb memory control patches.
I post this as RFC because I'll have vacation in the next week and
hugetlb patches are not visible in linux-next yet.
So, I'm not in hurry. Please review when you have time.
I'll rebase this onto memcg-devel in the next post.
== BTW, memory cgroup's github is here == git://github.com/mstsxfx/memcg-devel.git
Since v1, Whole patch designs are changed. In this version, I didn't
remove ->pre_destroy() but make it succeed always. There are no
asynchronous operation and no big patches. But this introduces
2 changes to cgroup core.
After this series, if use_hierarchy==0, all resources will be moved
to root cgroup at rmdir() or force_empty().
Brief patch conents are
0001 : my version of compile-fix for linux-next, Aneesh will post his own version.
0002 : fix error code in hugetlb_force_memcg_empty
0003 : add res_counter_uncharge_until()
0004 : use res_counter_uncharge_until() at move_parent()
0005 : move charges to root cgroup at rmdir, if use_hierarchy=0
0006 : clean up mem_cgroup_move_account()
0007 : cgroup : avoid attaching task to cgroup where ->pre_destroy() is running.
0008 : cgroup : avoid creating a new cgroup under a cgroup where ->pre_destroy() is running.
0009 : remove -EINTR from memcg->pre_destroy().
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 17ae2e4..4dd6b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
int ret = 0, idx = 0;
do {
- if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+ if (cgroup_task_count(cgroup)
+ || !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
goto out;
+ }
/*
* If the task doing the cgroup_rmdir got a signal
* we don't really need to loop till the hugetlb resource
-- 1.7.4.1
At killing res_counter which is a child of other counter,
we need to do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)
This is not atomic and wasting cpu. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates
to ancestors until specified res_counter.
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..703103a 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -92,6 +92,14 @@ to work with it.
The _locked routines imply that the res_counter->lock is taken.
+ f. void res_counter_uncharge_until
+ (struct res_counter *rc, struct res_counter *top,
+ unsinged long val)
+
+ Almost same as res_cunter_uncharge() but propagation of uncharge
+ stops when rc == top. This is useful when kill a res_coutner in
+ child cgroup.
+
2.1 Other accounting routines
There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index da81af0..d11c1cd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -135,6 +135,9 @@ int __must_check res_counter_charge_nofail(struct res_counter *counter,
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d508363..f4ec411 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -66,6 +66,8 @@ done:
return ret;
}
+
+
int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
struct res_counter **limit_fail_at)
{
@@ -99,13 +101,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
counter->usage -= val;
}
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;
local_irq_save(flags);
- for (c = counter; c != NULL; c = c->parent) {
+ for (c = counter; c != top; c = c->parent) {
spin_lock(&c->lock);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
@@ -113,6 +117,11 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
local_irq_restore(flags);
}
/*
+ * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
+ * This is useful when moving usage to parent cgroup.
+ */
+static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ if (!mem_cgroup_is_root(memcg)) {
+ unsigned long bytes = nr_pages * PAGE_SIZE;
+
+ res_counter_uncharge_until(&memcg->res,
+ memcg->res.parent, bytes);
+ if (do_swap_account)
+ res_counter_uncharge_until(&memcg->memsw,
+ memcg->memsw.parent, bytes);
+ }
+}
+
+/*
* A helper function to get mem_cgroup from ID. must be called under
* rcu_read_lock(). The caller must check css_is_removed() or some if
* it's concern. (dropping refcnt from swap can be called against removed
@@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
nr_pages = hpage_nr_pages(page);
parent = mem_cgroup_from_cont(pcg);
- ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
- if (ret)
- goto put_back;
+ if (!parent->use_hierarchy) {
+ ret = __mem_cgroup_try_charge(NULL,
+ gfp_mask, nr_pages, &parent, false);
+ if (ret)
+ goto put_back;
+ }
if (nr_pages > 1)
flags = compound_lock_irqsave(page);
- ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);
+ if (parent->use_hierarchy) {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);
+ } else {
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, true);
+
+ if (ret)
+ __mem_cgroup_cancel_charge(parent, nr_pages);
+ }
if (!get_page_unless_zero(page))
goto out;
@@ -3305,28 +3336,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
goto err_out;
csize = PAGE_SIZE << compound_order(page);
- /*
- * If we have use_hierarchy set we can never fail here. So instead of
- * using res_counter_uncharge use the open-coded variant which just
- * uncharge the child res_counter. The parent will retain the charge.
- */
- if (parent->use_hierarchy) {
- unsigned long flags;
- struct res_counter *counter;
-
- counter = &memcg->hugepage[idx];
- spin_lock_irqsave(&counter->lock, flags);
- res_counter_uncharge_locked(counter, csize);
- spin_unlock_irqrestore(&counter->lock, flags);
- } else {
+ /* If parent->use_hierarchy == 0, we need to charge parent */
+ if (!parent->use_hierarchy) {
ret = res_counter_charge(&parent->hugepage[idx],
csize, &fail_res);
if (ret) {
ret = -EBUSY;
goto err_out;
}
- res_counter_uncharge(&memcg->hugepage[idx], csize);
}
+ counter = &memcg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
pc->mem_cgroup = parent;
err_out:
unlock_page_cgroup(pc);
-- 1.7.4.1
Now, at removal of cgroup, ->pre_destroy() is called and move charges
to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
is that the 'moving' hits parent's resource limitation. It happens only
when use_hierarchy=0. This was a mistake of original design.(it's me...)
Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
cannot justify moving charges to parent...parent and children are in
flat configuration, not hierarchical.
This patch modifes to move charges to root cgroup at rmdir/force_empty
if use_hierarchy==0. This will much simplify rmdir() and reduce error
in ->pre_destroy.
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 54c338d..82ce1ef 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)
-Such charges are freed or moved to their parent. At moving, both of RSS
-and CACHES are moved to parent.
-rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
+Such charges are freed or moved to their parent if use_hierarchy=1.
+if use_hierarchy=0, the charges will be moved to root cgroup.
Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.
+About use_hierarchy, see Section 6.
5. Misc. interfaces.
@@ -413,13 +413,15 @@ will be charged as a new owner of it.
Almost all pages tracked by this memory cgroup will be unmapped and freed.
Some pages cannot be freed because they are locked or in-use. Such pages are
- moved to parent and this cgroup will be empty. This may return -EBUSY if
- VM is too busy to free/move all pages immediately.
+ moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
+ cgroup will be empty.
Typical use case of this interface is that calling this before rmdir().
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
+ About use_hierarchy, see Section 6.
+
5.2 stat file
memory.stat file includes following statistics
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed53d64..62200f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,
nr_pages = hpage_nr_pages(page);
parent = mem_cgroup_from_cont(pcg);
- if (!parent->use_hierarchy) {
- ret = __mem_cgroup_try_charge(NULL,
- gfp_mask, nr_pages, &parent, false);
- if (ret)
- goto put_back;
- }
+ /*
+ * if use_hierarchy==0, move charges to root cgroup.
+ * in root cgroup, we don't touch res_counter
+ */
+ if (!parent->use_hierarchy)
+ parent = root_mem_cgroup;
if (nr_pages > 1)
flags = compound_lock_irqsave(page);
- if (parent->use_hierarchy) {
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, false);
- if (!ret)
- __mem_cgroup_cancel_local_charge(child, nr_pages);
- } else {
- ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, true);
-
- if (ret)
- __mem_cgroup_cancel_charge(parent, nr_pages);
- }
+ ret = mem_cgroup_move_account(page, nr_pages,
+ pc, child, parent, false);
+ if (!ret)
+ __mem_cgroup_cancel_local_charge(child, nr_pages);
if (nr_pages > 1)
compound_unlock_irqrestore(page, flags);
-put_back:
putback_lru_page(page);
put:
put_page(page);
@@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
csize = PAGE_SIZE << compound_order(page);
/* If parent->use_hierarchy == 0, we need to charge parent */
if (!parent->use_hierarchy) {
- ret = res_counter_charge(&parent->hugepage[idx],
- csize, &fail_res);
- if (ret) {
- ret = -EBUSY;
- goto err_out;
- }
+ parent = root_mem_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);
}
counter = &memcg->hugepage[idx];
res_counter_uncharge_until(counter, counter->parent, csize);
-- 1.7.4.1
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62200f1..2715223 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2589,23 +2589,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
* @pc: page_cgroup of the page.
* @from: mem_cgroup which the page is moved from.
* @to: mem_cgroup which the page is moved to. @from != @to.
- * @uncharge: whether we should call uncharge and css_put against @from.
*
* The caller must confirm following.
* - page is not on LRU (isolate_page() is useful.)
* - compound_lock is held when nr_pages > 1
*
- * This function doesn't do "charge" nor css_get to new cgroup. It should be
- * done by a caller(__mem_cgroup_try_charge would be useful). If @uncharge is
- * true, this function does "uncharge" from old cgroup, but it doesn't if
- * @uncharge is false, so a caller should do "uncharge".
+ * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
+ * from old cgroup.
*/
static int mem_cgroup_move_account(struct page *page,
unsigned int nr_pages,
struct page_cgroup *pc,
struct mem_cgroup *from,
- struct mem_cgroup *to,
- bool uncharge)
+ struct mem_cgroup *to)
{
unsigned long flags;
int ret;
@@ -2639,9 +2635,6 @@ static int mem_cgroup_move_account(struct page *page,
preempt_enable();
}
mem_cgroup_charge_statistics(from, anon, -nr_pages);
- if (uncharge)
- /* This is not "cancel", but cancel_charge does all we need. */
- __mem_cgroup_cancel_charge(from, nr_pages);
/* caller should have done css_get */
pc->mem_cgroup = to;
@@ -2706,7 +2699,7 @@ static int mem_cgroup_move_parent(struct page *page,
flags = compound_lock_irqsave(page);
ret = mem_cgroup_move_account(page, nr_pages,
- pc, child, parent, false);
+ pc, child, parent);
if (!ret)
__mem_cgroup_cancel_local_charge(child, nr_pages);
@@ -5724,8 +5717,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (!isolate_lru_page(page)) {
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
- pc, mc.from, mc.to,
- false)) {
+ pc, mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
mc.moved_charge += HPAGE_PMD_NR;
}
@@ -5755,7 +5747,7 @@ retry:
goto put;
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(page, 1, pc,
- mc.from, mc.to, false)) {
+ mc.from, mc.to)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
-- 1.7.4.1
attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
in rmdir() isn't called under cgroup_mutex().
It's better to avoid attaching a task to a cgroup which
is under pre_destroy(). Considering memcg, the attached task may
increase resource usage after memcg's pre_destroy() confirms that
memcg is empty. This is not good.
/*
* In general, subsystem has no css->refcnt after pre_destroy(). But
@@ -4193,6 +4195,7 @@ again:
* and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
*/
set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+ mutex_unlock(&cgroup_mutex);
When ->pre_destroy() is called, it should be guaranteed that
new child cgroup is not created under a cgroup, where pre_destroy()
is running. If not, ->pre_destroy() must check children and
return -EBUSY, which causes warning.
When force_empty() called by ->pre_destroy(), no memory reclaim happens
and it doesn't take very long time which requires signal_pending() check.
And if we return -EINTR from pre_destroy(), cgroup.c show warning.
This patch removes signal check in force_empty(). By this, ->pre_destroy()
returns success always.
Note: check for 'cgroup is empty' remains for force_empty interface.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4dd6b39..770f1642 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
int ret = 0, idx = 0;
do {
+ /* see memcontrol.c::mem_cgroup_force_empty() */
if (cgroup_task_count(cgroup)
|| !list_empty(&cgroup->children)) {
ret = -EBUSY;
goto out;
}
- /*
- * If the task doing the cgroup_rmdir got a signal
- * we don't really need to loop till the hugetlb resource
- * usage become zero.
- */
- if (signal_pending(current)) {
- ret = -EINTR;
- goto out;
- }
for_each_hstate(h) {
spin_lock(&hugetlb_lock);
list_for_each_entry(page, &h->hugepage_activelist, lru) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2715223..ee350c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
pc = lookup_page_cgroup(page);
ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
- if (ret == -ENOMEM || ret == -EINTR)
- break;
if (ret == -EBUSY || ret == -EINVAL) {
/* found lock contention or "pc" is obsolete. */
@@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
busy = NULL;
}
- if (!ret && !list_empty(list))
+ if (!loop)
return -EBUSY;
return ret;
}
@@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
move_account:
do {
ret = -EBUSY;
+ /*
+ * This never happens when this is called by ->pre_destroy().
+ * But we need to take care of force_empty interface.
+ */
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
- ret = -EINTR;
- if (signal_pending(current))
- goto out;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
@@ -3918,9 +3917,6 @@ move_account:
}
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
- /* it seems parent cgroup doesn't have enough mem */
- if (ret == -ENOMEM)
- goto try_to_free;
cond_resched();
/* "ret" should also be checked to ensure all lists are empty. */
} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
-- 1.7.4.1
On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
> /*
> * In general, subsystem has no css->refcnt after pre_destroy(). But
> @@ -4193,6 +4195,7 @@ again:
> * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
> */
> set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> + mutex_unlock(&cgroup_mutex);
On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a v2 patch for preventing failure in memcg->pre_destroy().
> With this patch, ->pre_destroy() will never return error code and
> users will not see warning at rmdir(). And this work will simplify
> memcg->pre_destroy(), largely.
> This patch is based on linux-next + hugetlb memory control patches.
Ergh... can you please set up a git branch somewhere for review
purposes?
On Fri, Apr 27, 2012 at 10:16 AM, Glauber Costa <glom...@parallels.com> wrote:
> On 04/27/2012 02:54 AM, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>> /*
>> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
>> + * This is useful when moving usage to parent cgroup.
>> + */
>> +static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
>> + unsigned int nr_pages)
>> +{
>> + if (!mem_cgroup_is_root(memcg)) {
>> + unsigned long bytes = nr_pages * PAGE_SIZE;
>> +
>> + res_counter_uncharge_until(&memcg->res,
>> + memcg->res.parent, bytes);
>> + if (do_swap_account)
>> + res_counter_uncharge_until(&memcg->memsw,
>> + memcg->memsw.parent, bytes);
>> + }
>> +}
> Kame, this is a nitpick, but I usually prefer to write this like:
> if (mem_cgroup_is_root(memcg))
> return;
> res_counter...
> Specially with memcg, where function names are bigger than average, in
> comparison.
> the code itself seems fine.
>> +/*
>> * A helper function to get mem_cgroup from ID. must be called under
>> * rcu_read_lock(). The caller must check css_is_removed() or some if
>> * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>> nr_pages = hpage_nr_pages(page);
>> parent = mem_cgroup_from_cont(pcg);
>> - ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);
>> - if (ret)
>> - goto put_back;
>> + if (!parent->use_hierarchy) {
> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.
>> + ret = __mem_cgroup_try_charge(NULL,
>> + gfp_mask, nr_pages,&parent, false);
>> + if (ret)
>> + goto put_back;
>> + }
> Why? If we are not hierarchical, we should not charge the parent, right?
This is how it is implemented today and I think he changed that to
move to root on the next patch.
On Thu, Apr 26, 2012 at 10:58 PM, KAMEZAWA Hiroyuki
<kamezawa.hir...@jp.fujitsu.com> wrote:
> Now, at removal of cgroup, ->pre_destroy() is called and move charges
> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
> is that the 'moving' hits parent's resource limitation. It happens only
> when use_hierarchy=0. This was a mistake of original design.(it's me...)
Nice patch, i can see how broken it is now with use_hierarchy=0...
> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
> cannot justify moving charges to parent...parent and children are in
> flat configuration, not hierarchical.
> This patch modifes to move charges to root cgroup at rmdir/force_empty
> if use_hierarchy==0. This will much simplify rmdir() and reduce error
> in ->pre_destroy.
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 54c338d..82ce1ef 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
> tasks have migrated away from it. (because we charge against pages, not
> against tasks.)
> -Such charges are freed or moved to their parent. At moving, both of RSS
> -and CACHES are moved to parent.
> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
> +Such charges are freed or moved to their parent if use_hierarchy=1.
> +if use_hierarchy=0, the charges will be moved to root cgroup.
It is more clear that we move the stats to root (if use_hierarchy==0)
or parent (if use_hierarchy==1), and no change on the charge except
uncharging from the child.
> Charges recorded in swap information is not updated at removal of cgroup.
> Recorded information is discarded and a cgroup which uses swap (swapcache)
> will be charged as a new owner of it.
> +About use_hierarchy, see Section 6.
> 5. Misc. interfaces.
> @@ -413,13 +413,15 @@ will be charged as a new owner of it.
> Almost all pages tracked by this memory cgroup will be unmapped and freed.
> Some pages cannot be freed because they are locked or in-use. Such pages are
> - moved to parent and this cgroup will be empty. This may return -EBUSY if
> - VM is too busy to free/move all pages immediately.
> + moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
> + cgroup will be empty.
> Typical use case of this interface is that calling this before rmdir().
> Because rmdir() moves all pages to parent, some out-of-use page caches can be
> moved to the parent. If you want to avoid that, force_empty will be useful.
> + About use_hierarchy, see Section 6.
> +
> 5.2 stat file
> memory.stat file includes following statistics
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed53d64..62200f1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,
> nr_pages = hpage_nr_pages(page);
> parent = mem_cgroup_from_cont(pcg);
> - if (!parent->use_hierarchy) {
> - ret = __mem_cgroup_try_charge(NULL,
> - gfp_mask, nr_pages, &parent, false);
> - if (ret)
> - goto put_back;
> - }
> + /*
> + * if use_hierarchy==0, move charges to root cgroup.
> + * in root cgroup, we don't touch res_counter
> + */
> + if (!parent->use_hierarchy)
> + parent = root_mem_cgroup;
> if (nr_pages > 1)
> flags = compound_lock_irqsave(page);
On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
Hmm... once memcg's pre_destroy() can't fail, I think what we should
do is marking a cgroup DEAD before calling pre_destroy() and the
existing cgroup_is_removed() check should be enough. Patches upto
this point already make ->pre_destroy() not fail, right?
On Fri, Apr 27, 2012 at 01:31:59PM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> > attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> > in rmdir() isn't called under cgroup_mutex().
> > It's better to avoid attaching a task to a cgroup which
> > is under pre_destroy(). Considering memcg, the attached task may
> > increase resource usage after memcg's pre_destroy() confirms that
> > memcg is empty. This is not good.
> Hmm... once memcg's pre_destroy() can't fail, I think what we should
> do is marking a cgroup DEAD before calling pre_destroy() and the
> existing cgroup_is_removed() check should be enough. Patches upto
> this point already make ->pre_destroy() not fail, right?
Ooh, I was confused about patch order. This patch, probably with the
update suggested by Frederic looks good as an interim solution for
now.
On Fri, Apr 27, 2012 at 03:04:14PM +0900, KAMEZAWA Hiroyuki wrote:
> When ->pre_destroy() is called, it should be guaranteed that
> new child cgroup is not created under a cgroup, where pre_destroy()
> is running. If not, ->pre_destroy() must check children and
> return -EBUSY, which causes warning.
Hmm... I'm getting confused more. Why do we need these cgroup changes
at all? cgroup still has cgrp->count check and
cgroup_clear_css_refs() after pre_destroy() calls. The order of
changes should be,
* Make memcg pre_destroy() not fail; however, pre_destroy() should
still be ready to be retried. That's the defined interface.
* cgroup core updated to drop pre_destroy() retrying and guarantee
that pre_destroy() invocation will happen only once.
* memcg and other cgroups can update their pre_destroy() if the "won't
be retried" part can simplify their implementations.
So, there's no reason to be updating cgroup pre_destroy() semantics at
this point and these updates actually break cgroup API as it currently
stands. The only change necessary is memcg's pre_destroy() not
returning zero.
<kamezawa.hir...@jp.fujitsu.com> wrote:
> When force_empty() called by ->pre_destroy(), no memory reclaim happens
> and it doesn't take very long time which requires signal_pending() check.
> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
> This patch removes signal check in force_empty(). By this, ->pre_destroy()
> returns success always.
> Note: check for 'cgroup is empty' remains for force_empty interface.
> return -EBUSY;
> return ret;
> }
> @@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
> move_account:
> do {
> ret = -EBUSY;
> + /*
> + * This never happens when this is called by ->pre_destroy().
> + * But we need to take care of force_empty interface.
> + */
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> goto out;
> - ret = -EINTR;
> - if (signal_pending(current))
> - goto out;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> drain_all_stock_sync(memcg);
> @@ -3918,9 +3917,6 @@ move_account:
> }
> mem_cgroup_end_move(memcg);
> memcg_oom_recover(memcg);
> - /* it seems parent cgroup doesn't have enough mem */
> - if (ret == -ENOMEM)
> - goto try_to_free;
> cond_resched();
> /* "ret" should also be checked to ensure all lists are empty. */
> } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
> --
> 1.7.4.1
On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
> On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
>> This is a v2 patch for preventing failure in memcg->pre_destroy().
>> With this patch, ->pre_destroy() will never return error code and
>> users will not see warning at rmdir(). And this work will simplify
>> memcg->pre_destroy(), largely.
>> This patch is based on linux-next + hugetlb memory control patches.
> Ergh... can you please set up a git branch somewhere for review
> purposes?
I'm sorry...I can't. (To do that, I need to pass many my company's check.)
I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
linux-next.
On Sat, Apr 28, 2012 at 2:08 AM, Glauber Costa <glom...@parallels.com> wrote:
> On 04/27/2012 02:53 AM, KAMEZAWA Hiroyuki wrote:
>> From bb0168d5c85f62f36434956e4728a67d0cc41e55 Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki<kamezawa.hir...@jp.fujitsu.com>
>> Date: Thu, 26 Apr 2012 18:48:07 +0900
>> Subject: [PATCH 3/9] memcg: add res_counter_uncharge_until()
>> At killing res_counter which is a child of other counter,
>> we need to do
>> res_counter_uncharge(child, xxx)
>> res_counter_charge(parent, xxx)
>> This is not atomic and wasting cpu. This patch adds
>> res_counter_uncharge_until(). This function's uncharge propagates
>> to ancestors until specified res_counter.
>> res_counter_uncharge_until(child, parent, xxx)
>> This ops is atomic and more efficient.
>> Originaly-written-by: Frederic Weisbecker<fweis...@gmail.com>
>> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hir...@jp.fujitsu.com>
> I have been carrying Frederic's patch itself in my code for a while now.
> Why not just use it? What are you doing differently to justify writing a
> patch yourself? It's a bit of credit giving as well
I don't need "charge" part for my purpose and have no justification to add it.
And task-limit cgroup has no justification, either.
On Sat, Apr 28, 2012 at 2:16 AM, Glauber Costa <glom...@parallels.com> wrote:
> On 04/27/2012 02:54 AM, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>> +/*
>> * A helper function to get mem_cgroup from ID. must be called under
>> * rcu_read_lock(). The caller must check css_is_removed() or some if
>> * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>> nr_pages = hpage_nr_pages(page);
>> parent = mem_cgroup_from_cont(pcg);
>> - ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);
>> - if (ret)
>> - goto put_back;
>> + if (!parent->use_hierarchy) {
> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.
We need to find parent even if use_hierarchy==0 in this patch.
I'll consider to use it in later patch, thank you for pointing out.