[PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

Andrea Righi

unread,

Mar 9, 2010, 6:10:03 PM3/9/10

to

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

- account dirty pages per cgroup
- limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
and memory.dirty_background_ratio / memory.dirty_background_bytes in
cgroupfs
- start to write-out (background or actively) when the cgroup limits are
exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v5 -> v6)
~~~~~~~~~~~~~~~~~~~~~~
* always disable/enable IRQs at lock/unlock_page_cgroup(): this allows to drop
the previous complicated locking scheme in favor of a simpler locking, even
if this obviously adds some overhead (see results below)
* drop FUSE and NILFS2 dirty pages accounting for now (this depends on
charging bounce pages per cgroup)

Results
~~~~~~~
I ran some tests using a kernel build (2.6.33 x86_64_defconfig) on a
Intel Core 2 @ 1.2GHz as testcase using different kernels:
- mmotm "vanilla"
- mmotm with cgroup-dirty-memory using the previous "complex" locking scheme
(my previous patchset + the fixes reported by Kame-san and Daisuke-san)
- mmotm with cgroup-dirty-memory using the simple locking scheme
(lock_page_cgroup() with IRQs disabled)

Following the results:
<before>
- mmotm "vanilla", root cgroup: 11m51.983s
- mmotm "vanilla", child cgroup: 11m56.596s

<after>
- mmotm, "complex" locking scheme, root cgroup: 11m53.037s
- mmotm, "complex" locking scheme, child cgroup: 11m57.896s

- mmotm, lock_page_cgroup+irq_disabled, root cgroup: 12m5.499s
- mmotm, lock_page_cgroup+irq_disabled, child cgroup: 12m9.920s

With the "complex" locking solution, the overhead introduced by the
cgroup dirty memory accounting is minimal (0.14%), compared with the overhead
introduced by the lock_page_cgroup+irq_disabled solution (1.90%).

The performance overhead is not so huge in both solutions, but the impact on
performance is even more reduced using a complicated solution...

Maybe we can go ahead with the simplest implementation for now and start to
think to an alternative implementation of the page_cgroup locking and
charge/uncharge of pages.

If someone is interested or want to repeat the tests (maybe on a bigger
machine) I can post also the other version of the patchset. Just let me know.

-Andrea

Documentation/cgroups/memory.txt | 36 +++
fs/nfs/write.c | 4 +
include/linux/memcontrol.h | 87 +++++++-
include/linux/page_cgroup.h | 42 ++++-
include/linux/writeback.h | 2 -
mm/filemap.c | 1 +
mm/memcontrol.c | 475 +++++++++++++++++++++++++++++++++-----
mm/page-writeback.c | 215 +++++++++++-------
mm/rmap.c | 4 +-
mm/truncate.c | 1 +
10 files changed, 722 insertions(+), 145 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Balbir Singh

unread,

Mar 9, 2010, 8:40:02 PM3/9/10

to

* Andrea Righi <ari...@develer.com> [2010-03-10 00:00:31]:

This is a cause for big concern, any chance you could test this on a
large system. I am concerned about root overhead the most.

--
Three Cheers,
Balbir

KAMEZAWA Hiroyuki

unread,

Mar 10, 2010, 7:50:01 PM3/10/10

to

Hmm....isn't this bigger than expected ?

> The performance overhead is not so huge in both solutions, but the impact on
> performance is even more reduced using a complicated solution...
>
> Maybe we can go ahead with the simplest implementation for now and start to
> think to an alternative implementation of the page_cgroup locking and
> charge/uncharge of pages.
>

maybe. But in this 2 years, one of our biggest concerns was the performance.
So, we do something complex in memcg. But complex-locking is , yes, complex.
Hmm..I don't want to bet we can fix locking scheme without something complex.

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Mar 10, 2010, 8:30:02 PM3/10/10

to

On Thu, 11 Mar 2010 09:39:13 +0900
KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > The performance overhead is not so huge in both solutions, but the impact on
> > performance is even more reduced using a complicated solution...
> >
> > Maybe we can go ahead with the simplest implementation for now and start to
> > think to an alternative implementation of the page_cgroup locking and
> > charge/uncharge of pages.
> >
>
> maybe. But in this 2 years, one of our biggest concerns was the performance.
> So, we do something complex in memcg. But complex-locking is , yes, complex.
> Hmm..I don't want to bet we can fix locking scheme without something complex.
>

But overall patch set seems good (to me.) And dirty_ratio and dirty_background_ratio
will give us much benefit (of performance) than we lose by small overheads.

IIUC, this series affects trgger for background-write-out.

Could you show some score which dirty_ratio give us benefit in the cases of

- copying a file in a memcg which hits limit
ex) copying a 100M file in 120MB limit. etc..

- kernel make performance in limited memcg.
ex) making a kernel in 100MB limit (too large ?)
etc....(when an application does many write and hits memcg's limit.)

But, please get enough ack for changes in generic codes of dirty_ratio.

Thank you for your work.

Regards,

Peter Zijlstra

unread,

Mar 11, 2010, 4:20:03 AM3/11/10

to

On Thu, 2010-03-11 at 10:17 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 11 Mar 2010 09:39:13 +0900
> KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > The performance overhead is not so huge in both solutions, but the impact on
> > > performance is even more reduced using a complicated solution...
> > >
> > > Maybe we can go ahead with the simplest implementation for now and start to
> > > think to an alternative implementation of the page_cgroup locking and
> > > charge/uncharge of pages.

FWIW bit spinlocks suck massive.

> >
> > maybe. But in this 2 years, one of our biggest concerns was the performance.
> > So, we do something complex in memcg. But complex-locking is , yes, complex.
> > Hmm..I don't want to bet we can fix locking scheme without something complex.
> >
> But overall patch set seems good (to me.) And dirty_ratio and dirty_background_ratio
> will give us much benefit (of performance) than we lose by small overheads.

Well, the !cgroup or root case should really have no performance impact.

> IIUC, this series affects trgger for background-write-out.

Not sure though, while this does the accounting the actual writeout is
still !cgroup aware and can definately impact performance negatively by
shrinking too much.

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 4:30:02 AM3/11/10

to

On Thu, 11 Mar 2010 10:14:25 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> On Thu, 2010-03-11 at 10:17 +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 11 Mar 2010 09:39:13 +0900
> > KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > > The performance overhead is not so huge in both solutions, but the impact on
> > > > performance is even more reduced using a complicated solution...
> > > >
> > > > Maybe we can go ahead with the simplest implementation for now and start to
> > > > think to an alternative implementation of the page_cgroup locking and
> > > > charge/uncharge of pages.
>
> FWIW bit spinlocks suck massive.
>
> > >
> > > maybe. But in this 2 years, one of our biggest concerns was the performance.
> > > So, we do something complex in memcg. But complex-locking is , yes, complex.
> > > Hmm..I don't want to bet we can fix locking scheme without something complex.
> > >
> > But overall patch set seems good (to me.) And dirty_ratio and dirty_background_ratio
> > will give us much benefit (of performance) than we lose by small overheads.
>
> Well, the !cgroup or root case should really have no performance impact.
>
> > IIUC, this series affects trgger for background-write-out.
>
> Not sure though, while this does the accounting the actual writeout is
> still !cgroup aware and can definately impact performance negatively by
> shrinking too much.
>

Ah, okay, your point is !cgroup (ROOT cgroup case.)
I don't think accounting these file cache status against root cgroup is necessary.

BTW, in other thread, I'm now proposing this style.
==
+void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
+{
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ if (trylock_page_cgroup(pc)) {
+ __mem_cgroup_update_stat(pc, idx, charge);
+ unlock_page_cgroup(pc);
+ }
+ return;
==

Then, it's not problem that check pc->mem_cgroup is root cgroup or not
without spinlock.
==
void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
{
pc = lookup_page_cgroup(page);
if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
return;
...
}
==
This can be handle in the same logic of "lock failure" path.
And we just do ignore accounting.

There are will be no spinlocks....to do more than this,
I think we have to use "struct page" rather than "struct page_cgroup".

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 4:50:02 AM3/11/10

to

On Thu, 11 Mar 2010 18:25:00 +0900
KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> Then, it's not problem that check pc->mem_cgroup is root cgroup or not
> without spinlock.
> ==
> void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
> {
> pc = lookup_page_cgroup(page);
> if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
> return;
> ...
> }
> ==
> This can be handle in the same logic of "lock failure" path.
> And we just do ignore accounting.
>
> There are will be no spinlocks....to do more than this,
> I think we have to use "struct page" rather than "struct page_cgroup".
>

Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED
status in root cgroup. This kind of change is not very good.
So, one way is to use this kind of function only for new parameters. Hmm.
==

From: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Now, file-mapped is maintaiend. But more generic update function
will be needed for dirty page accounting.

For accountig page status, we have to guarantee lock_page_cgroup()
will be never called under tree_lock held.
To guarantee that, we use trylock at updating status.
By this, we do fuzyy accounting, but in almost all case, it's correct.

Changelog:
- removed unnecessary preempt_disable()
- added root cgroup check. By this, we do no lock/account in root cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---
include/linux/memcontrol.h | 7 ++-
include/linux/page_cgroup.h | 15 +++++++
mm/memcontrol.c | 92 +++++++++++++++++++++++++++++++++-----------
mm/rmap.c | 4 -
4 files changed, 94 insertions(+), 24 deletions(-)

Index: mmotm-2.6.34-Mar9/mm/memcontrol.c
===================================================================
--- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c
+++ mmotm-2.6.34-Mar9/mm/memcontrol.c
@@ -1348,30 +1348,83 @@ bool mem_cgroup_handle_oom(struct mem_cg
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
*/
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge)
{
struct mem_cgroup *mem;
- struct page_cgroup *pc;
-
- pc = lookup_page_cgroup(page);
- if (unlikely(!pc))
- return;
+ int val;

- lock_page_cgroup(pc);
mem = pc->mem_cgroup;
- if (!mem)
- goto done;
+ if (!mem || !PageCgroupUsed(pc))
+ return;

- if (!PageCgroupUsed(pc))
- goto done;
+ if (charge)
+ val = 1;
+ else
+ val = -1;

+ switch (idx) {
+ case MEMCG_NR_FILE_MAPPED:
+ if (charge) {
+ if (!PageCgroupFileMapped(pc))
+ SetPageCgroupFileMapped(pc);
+ else
+ val = 0;
+ } else {
+ if (PageCgroupFileMapped(pc))
+ ClearPageCgroupFileMapped(pc);
+ else
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_MAPPED;
+ break;
+ default:
+ BUG();
+ break;
+ }
/*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
- __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
+ __this_cpu_add(mem->stat->count[idx], val);
+}

-done:
- unlock_page_cgroup(pc);

+void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
+{
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);

+ if (!pc || mem_cgroup_is_root(pc->mem_cgroup))

+ return;
+
+ if (trylock_page_cgroup(pc)) {
+ __mem_cgroup_update_stat(pc, idx, charge);
+ unlock_page_cgroup(pc);
+ }
+ return;

+}
+
+static void mem_cgroup_migrate_stat(struct page_cgroup *pc,
+ struct mem_cgroup *from, struct mem_cgroup *to)
+{
+ if (PageCgroupFileMapped(pc)) {
+ __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ if (!mem_cgroup_is_root(to)) {
+ __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ } else {
+ ClearPageCgroupFileMapped(pc);
+ }
+ }
+}
+
+static void
+__mem_cgroup_stat_fixup(struct page_cgroup *pc, struct mem_cgroup *mem)
+{
+ if (mem_cgroup_is_root(mem))
+ return;
+ /* We'are in uncharge() and lock_page_cgroup */
+ if (PageCgroupFileMapped(pc)) {
+ __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ ClearPageCgroupFileMapped(pc);
+ }
}

/*
@@ -1810,13 +1863,7 @@ static void __mem_cgroup_move_account(st
VM_BUG_ON(pc->mem_cgroup != from);

page = pc->page;
- if (page_mapped(page) && !PageAnon(page)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
- }
+ mem_cgroup_migrate_stat(pc, from, to);
mem_cgroup_charge_statistics(from, pc, false);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
@@ -2208,6 +2255,9 @@ __mem_cgroup_uncharge_common(struct page
__do_uncharge(mem, ctype);
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
+ if (unlikely(PCG_PageStatMask & pc->flags))
+ __mem_cgroup_stat_fixup(pc, mem);
+
mem_cgroup_charge_statistics(mem, pc, false);

ClearPageCgroupUsed(pc);
Index: mmotm-2.6.34-Mar9/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.34-Mar9.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.34-Mar9/include/linux/page_cgroup.h
@@ -39,6 +39,8 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
+ /* for cache-status accounting */
+ PCG_FILE_MAPPED,
};

#define TESTPCGFLAG(uname, lname) \
@@ -57,6 +59,10 @@ static inline void ClearPageCgroup##unam
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

+/* Page/File stat flag mask */
+#define PCG_PageStatMask ((1 << PCG_FILE_MAPPED))
+
+
TESTPCGFLAG(Locked, LOCK)

/* Cache flag is set only once (at allocation) */
@@ -73,6 +79,10 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

+TESTPCGFLAG(FileMapped, FILE_MAPPED)
+SETPCGFLAG(FileMapped, FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, FILE_MAPPED)
+
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
@@ -93,6 +103,11 @@ static inline void unlock_page_cgroup(st
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

+static inline int trylock_page_cgroup(struct page_cgroup *pc)
+{
+ return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

Index: mmotm-2.6.34-Mar9/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.34-Mar9.orig/include/linux/memcontrol.h
+++ mmotm-2.6.34-Mar9/include/linux/memcontrol.h
@@ -124,7 +124,12 @@ static inline bool mem_cgroup_disabled(v
return false;
}

-void mem_cgroup_update_file_mapped(struct page *page, int val);
+enum mem_cgroup_page_stat_item {
+ MEMCG_NR_FILE_MAPPED,
+ MEMCG_NR_FILE_NSTAT,
+};
+
+void mem_cgroup_update_stat(struct page *page, int idx, bool charge);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
Index: mmotm-2.6.34-Mar9/mm/rmap.c
===================================================================
--- mmotm-2.6.34-Mar9.orig/mm/rmap.c
+++ mmotm-2.6.34-Mar9/mm/rmap.c
@@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *pag
{
if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_update_file_mapped(page, 1);
+ mem_cgroup_update_stat(page, MEMCG_NR_FILE_MAPPED, true);
}
}

@@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
__dec_zone_page_state(page, NR_ANON_PAGES);
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_update_file_mapped(page, -1);
+ mem_cgroup_update_stat(page, MEMCG_NR_FILE_MAPPED, false);
}
/*
* It would be tidy to reset the PageAnon mapping here,

Vivek Goyal

unread,

Mar 11, 2010, 10:10:01 AM3/11/10

to

On Thu, Mar 11, 2010 at 06:25:00PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 11 Mar 2010 10:14:25 +0100
> Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Thu, 2010-03-11 at 10:17 +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 11 Mar 2010 09:39:13 +0900
> > > KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > > > The performance overhead is not so huge in both solutions, but the impact on
> > > > > performance is even more reduced using a complicated solution...
> > > > >
> > > > > Maybe we can go ahead with the simplest implementation for now and start to
> > > > > think to an alternative implementation of the page_cgroup locking and
> > > > > charge/uncharge of pages.
> >
> > FWIW bit spinlocks suck massive.
> >
> > > >
> > > > maybe. But in this 2 years, one of our biggest concerns was the performance.
> > > > So, we do something complex in memcg. But complex-locking is , yes, complex.
> > > > Hmm..I don't want to bet we can fix locking scheme without something complex.
> > > >
> > > But overall patch set seems good (to me.) And dirty_ratio and dirty_background_ratio
> > > will give us much benefit (of performance) than we lose by small overheads.
> >
> > Well, the !cgroup or root case should really have no performance impact.
> >
> > > IIUC, this series affects trgger for background-write-out.
> >
> > Not sure though, while this does the accounting the actual writeout is
> > still !cgroup aware and can definately impact performance negatively by
> > shrinking too much.
> >
>
> Ah, okay, your point is !cgroup (ROOT cgroup case.)
> I don't think accounting these file cache status against root cgroup is necessary.
>

I think what peter meant was that with memory cgroups created we will do
writeouts much more aggressively.

In balance_dirty_pages()

if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;

Now with Andrea's patches, we are calculating bdi_thres per memory cgroup
(almost)

bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction

But bdi_nr_reclaimable and bdi_nr_writeback stats are still global.

So for the same number of dirty pages system wide on this bdi, we will be
triggering writeouts much more aggressively if somebody has created few
memory cgroups and tasks are running in those cgroups.

I guess it might cause performance regressions in case of small file
writeouts because previously one could have written the file to cache and
be done with it but with this patch set, there are higher changes that
you will be throttled to write the pages back to disk.

I guess we need two pieces to resolve this.
- BDI stats per cgroup.
- Writeback of inodes from same cgroup.

I think BDI stats per cgroup will increase the complextiy.

I am still setting up the system to test whether we see any speedup in
writeout of large files with-in a memory cgroup with small memory limits.
I am assuming that we are expecting a speedup because we will start
writeouts early and background writeouts probably are faster than direct
reclaim?

Thanks
Vivek

Vivek Goyal

unread,

Mar 11, 2010, 1:20:03 PM3/11/10

to

On Wed, Mar 10, 2010 at 12:00:31AM +0100, Andrea Righi wrote:
> Control the maximum amount of dirty pages a cgroup can have at any given time.
>
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
>
> The overall design is the following:
>
> - account dirty pages per cgroup
> - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> and memory.dirty_background_ratio / memory.dirty_background_bytes in
> cgroupfs
> - start to write-out (background or actively) when the cgroup limits are
> exceeded
>
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
>

Hi Andrea,

I am doing a simple dd test of writting a 4G file. This machine has got
64G of memory and I have created one cgroup with 100M as limit_in_bytes.

I run following dd program both in root cgroup as well as test1/
cgroup(100M limit) one after the other.

In root cgroup
==============
dd if=/dev/zero of=/root/zerofile bs=4K count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 59.5571 s, 68.8 MB/s

In test1/ cgroup
===============
dd if=/dev/zero of=/root/zerofile bs=4K count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 20.6683 s, 198 MB/s

It is strange that we are throttling process in root group much more than
process in test1/ cgroup?

Thanks
Vivek

Andrea Righi

unread,

Mar 11, 2010, 5:30:02 PM3/11/10

to

Consider that I'm not running the kernel build on tmpfs, but on a fs
defined on /dev/sda. So the additional overhead should be normal
compared to the mmotm vanilla, where there's only FILE_MAPPED
accounting.

-Andrea

Andrea Righi

unread,

Mar 11, 2010, 5:30:03 PM3/11/10

to

On Thu, Mar 11, 2010 at 06:42:44PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 11 Mar 2010 18:25:00 +0900
> KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > Then, it's not problem that check pc->mem_cgroup is root cgroup or not
> > without spinlock.
> > ==
> > void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
> > {
> > pc = lookup_page_cgroup(page);
> > if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
> > return;
> > ...
> > }
> > ==
> > This can be handle in the same logic of "lock failure" path.
> > And we just do ignore accounting.
> >
> > There are will be no spinlocks....to do more than this,
> > I think we have to use "struct page" rather than "struct page_cgroup".
> >
> Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED
> status in root cgroup. This kind of change is not very good.
> So, one way is to use this kind of function only for new parameters. Hmm.

This kind of accouting shouldn't be a big problem for the dirty memory
write-out. The benefit in terms of performance is much more important I
think.

The missing accounting of root cgroup statistics could be an issue if we
move a lot of pages from root cgroup into a child cgroup (when migration
of file cache pages will be supported and enabled). But at worst we'll
continue to write-out pages using the global settings. Remember that
memcg dirty memory is always the min(memcg_dirty_memory, total_dirty_memory),
so even if we're leaking dirty memory accounting at worst we'll touch
the global dirty limit and fallback to the current write-out
implementation.

I'll merge this patch, re-run some tests (kernel build and large file
copy) and post a new version.

Unfortunately at the moment I've not a big machine to use for these
tests, but maybe I can get some help. Vivek has probably a nice hardware
to test this code.. ;)

Thanks!
-Andrea

Andrea Righi

unread,

Mar 11, 2010, 6:30:01 PM3/11/10

to

Correct. More exactly:

bdi_thresh = memcg dirty memory limit * BDI's share of the global dirty memory

Before:

bdi_thresh = global dirty memory limit * BDI's share of the global dirty memory

>
> So for the same number of dirty pages system wide on this bdi, we will be
> triggering writeouts much more aggressively if somebody has created few
> memory cgroups and tasks are running in those cgroups.

Right, if we don't touch per-cgroup dirty limits.

>
> I guess it might cause performance regressions in case of small file
> writeouts because previously one could have written the file to cache and
> be done with it but with this patch set, there are higher changes that
> you will be throttled to write the pages back to disk.
>
> I guess we need two pieces to resolve this.
> - BDI stats per cgroup.
> - Writeback of inodes from same cgroup.
>
> I think BDI stats per cgroup will increase the complextiy.

There'll be the opposite problem I think, the number of dirty pages
(system-wide) will increase, because in this way we'll consider BDI
shares of memcg dirty memory. So I think we need both: per memcg BDI
stats and system-wide BDI stats, then we need to take the min of the two
when evaluating bdi_thresh. Maybe... I'm not really sure about this, and
need to figure better this part. So I started with the simplest
implementation: global BDI stats, and per-memcg dirty memory.

I totally agree about the other point, writeback of inodes per cgroup is
another feature that we need.

> I am still setting up the system to test whether we see any speedup in
> writeout of large files with-in a memory cgroup with small memory limits.
> I am assuming that we are expecting a speedup because we will start
> writeouts early and background writeouts probably are faster than direct
> reclaim?

mmh... speedup? I think with a large file write + reduced dirty limits
you'll get a more uniform write-out (more frequent small writes),
respect to few and less frequent large writes. The system will be more
reactive, but I don't think you'll be able to see a speedup in the large
write itself.

Thanks,
-Andrea

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 6:50:02 PM3/11/10

to

hmm.

>
> bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction
>
> But bdi_nr_reclaimable and bdi_nr_writeback stats are still global.
>

Why bdi_thresh of ROOT cgroup doesn't depend on global number ?

> So for the same number of dirty pages system wide on this bdi, we will be
> triggering writeouts much more aggressively if somebody has created few
> memory cgroups and tasks are running in those cgroups.
>
> I guess it might cause performance regressions in case of small file
> writeouts because previously one could have written the file to cache and
> be done with it but with this patch set, there are higher changes that
> you will be throttled to write the pages back to disk.
>
> I guess we need two pieces to resolve this.
> - BDI stats per cgroup.
> - Writeback of inodes from same cgroup.
>
> I think BDI stats per cgroup will increase the complextiy.
>

Thank you for clarification. IIUC, dirty_limit implemanation shoul assume
there is I/O resource controller, maybe usual users will use I/O resource
controller and memcg at the same time.
Then, my question is what happens when used with I/O resource controller ?

> I am still setting up the system to test whether we see any speedup in
> writeout of large files with-in a memory cgroup with small memory limits.
> I am assuming that we are expecting a speedup because we will start
> writeouts early and background writeouts probably are faster than direct
> reclaim?
>

Yes. I think so.

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 7:00:01 PM3/11/10

to

Ah, sorry. I misunderstood something. But it's depends on dirty_ratio param.
If
background_dirty_ratio = 5
dirty_ratio = 100
under 100M cgroup, I think background write-out will be a help.
(nonsense ? ;)

And I wonder make -j can get better number....Hmm.

Thanks,
-Kame

Andrea Righi

unread,

Mar 11, 2010, 7:00:02 PM3/11/10

to

mmmh.. strange, on my side I get something as expected:

<root cgroup>
$ dd if=/dev/zero of=test bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.28377 s, 83.4 MB/s

<child cgroup with 100M memory.limit_in_bytes>
$ dd if=/dev/zero of=test bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 11.8884 s, 44.1 MB/s

Did you change the global /proc/sys/vm/dirty_* or memcg dirty
parameters?

-Andrea

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 7:10:02 PM3/11/10

to

what happens when bs=4k count=1000000 under 100M ? no changes ?

-Kame

Andrea Righi

unread,

Mar 11, 2010, 7:40:01 PM3/11/10

to

Very true. mem_cgroup_has_dirty_limit() must always return false in case
of root cgroup, so that global numbers are used.

Thanks,
-Andrea

Daisuke Nishimura

unread,

Mar 11, 2010, 8:30:02 PM3/11/10

to

On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> On Thu, 11 Mar 2010 18:25:00 +0900
> KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > Then, it's not problem that check pc->mem_cgroup is root cgroup or not
> > without spinlock.
> > ==
> > void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
> > {
> > pc = lookup_page_cgroup(page);
> > if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
> > return;
> > ...
> > }
> > ==
> > This can be handle in the same logic of "lock failure" path.
> > And we just do ignore accounting.
> >
> > There are will be no spinlocks....to do more than this,
> > I think we have to use "struct page" rather than "struct page_cgroup".
> >
> Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED
> status in root cgroup. This kind of change is not very good.
> So, one way is to use this kind of function only for new parameters. Hmm.

IMHO, if we disable accounting file stats in root cgroup, it would be better
not to show them in memory.stat to avoid confusing users.
But, hmm, I think accounting them in root cgroup isn't so meaningless.
Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup enough?

> ==
>
> From: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>
> Now, file-mapped is maintaiend. But more generic update function
> will be needed for dirty page accounting.
>
> For accountig page status, we have to guarantee lock_page_cgroup()
> will be never called under tree_lock held.
> To guarantee that, we use trylock at updating status.
> By this, we do fuzyy accounting, but in almost all case, it's correct.
>
> Changelog:
> - removed unnecessary preempt_disable()
> - added root cgroup check. By this, we do no lock/account in root cgroup.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Looks good overall.

Thanks,
Daisuke Nishimura.

KAMEZAWA Hiroyuki

unread,

Mar 11, 2010, 9:30:01 PM3/11/10

to

On Fri, 12 Mar 2010 10:14:11 +0900
Daisuke Nishimura <nish...@mxp.nes.nec.co.jp> wrote:

> On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > On Thu, 11 Mar 2010 18:25:00 +0900
> > KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > Then, it's not problem that check pc->mem_cgroup is root cgroup or not
> > > without spinlock.
> > > ==
> > > void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
> > > {
> > > pc = lookup_page_cgroup(page);
> > > if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
> > > return;
> > > ...
> > > }
> > > ==
> > > This can be handle in the same logic of "lock failure" path.
> > > And we just do ignore accounting.
> > >
> > > There are will be no spinlocks....to do more than this,
> > > I think we have to use "struct page" rather than "struct page_cgroup".
> > >
> > Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED
> > status in root cgroup. This kind of change is not very good.
> > So, one way is to use this kind of function only for new parameters. Hmm.
> IMHO, if we disable accounting file stats in root cgroup, it would be better
> not to show them in memory.stat to avoid confusing users.

agreed.

> But, hmm, I think accounting them in root cgroup isn't so meaningless.
> Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup enough?
>

The problem is spinlock overhead.

IMHO, there are 2 excuse for "not accounting" in root cgroup
1. Low overhead is always appreciated.
2. Root's statistics can be obtained by "total - sum of children".

And another thinking is that "how root cgroup is used when there are children ?"
What's benefit we have to place a task to "unlimited/no control" group even when
some important tasks are placed into children groups ?
I think administartors don't want to place tasks which they want to watch
into root cgroup because of lacks of accounting...

Yes, it's the best that root cgroup works as other children, but unfortunately we
know cgroup's accounting adds some burden.(and it's not avoidable.)

But there will be trade-off. If accounting is useful, it should be.
My concerns is the cost which we have to pay even when cgroup is _not_ mounted.

Thanks,
-Kame

Andrea Righi

unread,

Mar 12, 2010, 5:00:03 AM3/12/10

to

On Fri, Mar 12, 2010 at 09:03:26AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 12 Mar 2010 00:59:22 +0100
> Andrea Righi <ari...@develer.com> wrote:
>
> > On Thu, Mar 11, 2010 at 01:07:53PM -0500, Vivek Goyal wrote:
> > > On Wed, Mar 10, 2010 at 12:00:31AM +0100, Andrea Righi wrote:
>
> > mmmh.. strange, on my side I get something as expected:
> >
> > <root cgroup>
> > $ dd if=/dev/zero of=test bs=1M count=500
> > 500+0 records in
> > 500+0 records out
> > 524288000 bytes (524 MB) copied, 6.28377 s, 83.4 MB/s
> >
> > <child cgroup with 100M memory.limit_in_bytes>
> > $ dd if=/dev/zero of=test bs=1M count=500
> > 500+0 records in
> > 500+0 records out
> > 524288000 bytes (524 MB) copied, 11.8884 s, 44.1 MB/s
> >
> > Did you change the global /proc/sys/vm/dirty_* or memcg dirty
> > parameters?
> >
> what happens when bs=4k count=1000000 under 100M ? no changes ?

OK, I confirm the results found by Vivek. Repeating the tests 10 times:

root cgroup ~= 34.05 MB/s average
child cgroup (100M) ~= 38.80 MB/s average

So, actually the child cgroup with the 100M limit seems to perform
better in terms of throughput.

IIUC, with the large write and the 100M memory limit it happens that
direct write-out is enforced more frequently and a single write chunk is
enough to meet the bdi_thresh or the global background_thresh +
dirty_thresh limits. This means the task is never (or less) throttled
with io_schedule_timeout() in the balance_dirty_pages() loop. And the
child cgroup gets better performance over the root cgroup.

-Andrea

Andrea Righi

unread,

Mar 12, 2010, 5:10:02 AM3/12/10

to

On Fri, Mar 12, 2010 at 10:14:11AM +0900, Daisuke Nishimura wrote:
> On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > On Thu, 11 Mar 2010 18:25:00 +0900
> > KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > Then, it's not problem that check pc->mem_cgroup is root cgroup or not
> > > without spinlock.
> > > ==
> > > void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
> > > {
> > > pc = lookup_page_cgroup(page);
> > > if (unlikely(!pc) || mem_cgroup_is_root(pc->mem_cgroup))
> > > return;
> > > ...
> > > }
> > > ==
> > > This can be handle in the same logic of "lock failure" path.
> > > And we just do ignore accounting.
> > >
> > > There are will be no spinlocks....to do more than this,
> > > I think we have to use "struct page" rather than "struct page_cgroup".
> > >
> > Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED
> > status in root cgroup. This kind of change is not very good.
> > So, one way is to use this kind of function only for new parameters. Hmm.
> IMHO, if we disable accounting file stats in root cgroup, it would be better
> not to show them in memory.stat to avoid confusing users.

Or just show the same values that we show in /proc/meminfo.. (I mean,
not actually the same, but coherent with them).

> But, hmm, I think accounting them in root cgroup isn't so meaningless.
> Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup enough?

Agreed. Returning false from mem_cgroup_has_dirty_limit() is enough to
always use global stats for the writeback, so this shouldn't introduce
any overhead for the root cgroup (at least for this part).

-Andrea

Andrea Righi

unread,

Mar 12, 2010, 5:10:03 AM3/12/10

to

On Fri, Mar 12, 2010 at 08:52:44AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 12 Mar 2010 00:27:09 +0100
> Andrea Righi <ari...@develer.com> wrote:
>
> > On Thu, Mar 11, 2010 at 10:03:07AM -0500, Vivek Goyal wrote:
>
> > > I am still setting up the system to test whether we see any speedup in
> > > writeout of large files with-in a memory cgroup with small memory limits.
> > > I am assuming that we are expecting a speedup because we will start
> > > writeouts early and background writeouts probably are faster than direct
> > > reclaim?
> >
> > mmh... speedup? I think with a large file write + reduced dirty limits
> > you'll get a more uniform write-out (more frequent small writes),
> > respect to few and less frequent large writes. The system will be more
> > reactive, but I don't think you'll be able to see a speedup in the large
> > write itself.
> >
> Ah, sorry. I misunderstood something. But it's depends on dirty_ratio param.
> If
> background_dirty_ratio = 5
> dirty_ratio = 100
> under 100M cgroup, I think background write-out will be a help.

Right, in this case background flusher threads will help a lot to
write-out the cgroup dirty memory and it'll get better performance.

-Andrea

Vivek Goyal

unread,

Mar 15, 2010, 10:20:03 AM3/15/10

to

In the current form of patch, number of dirty pages system wide will not
increase. As following condition will be false more number of times and we
will be doing writeout more aggressively.

if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;

Once we implement per cgroup per BDI stats for bdi_nr_reclaimable and
bdi_nr_writeback, then in theory we will have same number of dirty pages
in system as of today.

Thanks
Vivek

Vivek Goyal

unread,

Mar 15, 2010, 10:50:01 AM3/15/10

to

No I did not change any memecg dirty parameters.

Vivek

Vivek Goyal

unread,

Mar 15, 2010, 10:50:02 AM3/15/10

to

I think in current implementation ROOT cgroup bdi_thres is always same
as global number. It is only for other child groups where it is different
from global number because of reduced dirytable_memory() limit. And we
don't seem to be allowing any control on root group.

But I am wondering, what happens in following case.

IIUC, with use_hierarhy=0, if I create two test groups test1 and test2, then
hierarchy looks as follows.

root test1 test2

Now root group's DIRTYABLE is still system wide but test1 and test2's
dirtyable will be reduced based on RES_LIMIT in those groups.

Conceptually, per cgroup dirty ratio is like fixing page cache share of
each group. So effectively we are saying that these limits apply to only
child group of root but not to root as such?

> > So for the same number of dirty pages system wide on this bdi, we will be
> > triggering writeouts much more aggressively if somebody has created few
> > memory cgroups and tasks are running in those cgroups.
> >
> > I guess it might cause performance regressions in case of small file
> > writeouts because previously one could have written the file to cache and
> > be done with it but with this patch set, there are higher changes that
> > you will be throttled to write the pages back to disk.
> >
> > I guess we need two pieces to resolve this.
> > - BDI stats per cgroup.
> > - Writeback of inodes from same cgroup.
> >
> > I think BDI stats per cgroup will increase the complextiy.
> >
> Thank you for clarification. IIUC, dirty_limit implemanation shoul assume
> there is I/O resource controller, maybe usual users will use I/O resource
> controller and memcg at the same time.
> Then, my question is what happens when used with I/O resource controller ?
>

Currently IO resource controller keep all the async IO queues in root
group so we can't measure exactly. But my guess is until and unless we
at least implement "writeback inodes from same cgroup" we will not see
increased flow of writes from one cgroup over other cgroup.

Thanks
Vivek

Vivek Goyal

unread,

Mar 15, 2010, 11:00:02 AM3/15/10

to

IIUC, Total sum of children works only if user_hierarchy=1? At the same time
it does not work if there tasks in root cgroup and not in children group.

Vivek

Andrea Righi

unread,

Mar 17, 2010, 6:40:03 PM3/17/10

to

On Mon, Mar 15, 2010 at 10:38:41AM -0400, Vivek Goyal wrote:
> > >
> > > bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction
> > >
> > > But bdi_nr_reclaimable and bdi_nr_writeback stats are still global.
> > >
> > Why bdi_thresh of ROOT cgroup doesn't depend on global number ?
> >
>
> I think in current implementation ROOT cgroup bdi_thres is always same
> as global number. It is only for other child groups where it is different
> from global number because of reduced dirytable_memory() limit. And we
> don't seem to be allowing any control on root group.
>
> But I am wondering, what happens in following case.
>
> IIUC, with use_hierarhy=0, if I create two test groups test1 and test2, then
> hierarchy looks as follows.
>
> root test1 test2
>
> Now root group's DIRTYABLE is still system wide but test1 and test2's
> dirtyable will be reduced based on RES_LIMIT in those groups.
>
> Conceptually, per cgroup dirty ratio is like fixing page cache share of
> each group. So effectively we are saying that these limits apply to only
> child group of root but not to root as such?

Correct. In this implementation root cgroup means "outside all cgroups".
I think this can be an acceptable behaviour since in general we don't
set any limit to the root cgroup.

>
> > > So for the same number of dirty pages system wide on this bdi, we will be
> > > triggering writeouts much more aggressively if somebody has created few
> > > memory cgroups and tasks are running in those cgroups.
> > >
> > > I guess it might cause performance regressions in case of small file
> > > writeouts because previously one could have written the file to cache and
> > > be done with it but with this patch set, there are higher changes that
> > > you will be throttled to write the pages back to disk.
> > >
> > > I guess we need two pieces to resolve this.
> > > - BDI stats per cgroup.
> > > - Writeback of inodes from same cgroup.
> > >
> > > I think BDI stats per cgroup will increase the complextiy.
> > >
> > Thank you for clarification. IIUC, dirty_limit implemanation shoul assume
> > there is I/O resource controller, maybe usual users will use I/O resource
> > controller and memcg at the same time.
> > Then, my question is what happens when used with I/O resource controller ?
> >
>
> Currently IO resource controller keep all the async IO queues in root
> group so we can't measure exactly. But my guess is until and unless we
> at least implement "writeback inodes from same cgroup" we will not see
> increased flow of writes from one cgroup over other cgroup.

Agreed. And I plan to look a the "writeback inodes per cgroup" feature
soon. I'm sorry but I've some deadlines this week, so probably I'll
start working on this in the next weekend.

-Andrea