> Changelog
> since v1: cancel to remove "recent_xxx" debug statistics as bilbir's
> mention
>
> ===========================================
>
> anon_scan_ratio feature doesn't only useful for global VM pressure
> analysis, but it also useful for memcg memroy pressure analysis.
>
> Then, this patch add anon_scan_ratio field to memory.stat file too.
>
> Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Thanks,
-Kame
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
===========================================
anon_scan_ratio feature doesn't only useful for global VM pressure
analysis, but it also useful for memcg memroy pressure analysis.
Then, this patch add anon_scan_ratio field to memory.stat file too.
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/memcontrol.c | 65 +++++++++++++++++++++++++++++++++++-------------------
1 files changed, 42 insertions(+), 23 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 325df12..7348edc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2950,6 +2950,11 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
{
struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
struct mcs_total_stat mystat;
+ struct zone *zone;
+ unsigned long total_anon = 0;
+ unsigned long total_scan_anon = 0;
+ unsigned long recent_rotated[2] = {0};
+ unsigned long recent_scanned[2] = {0};
int i;
memset(&mystat, 0, sizeof(mystat));
@@ -2978,33 +2983,47 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
cb->fill(cb, memcg_stat_strings[i].total_name, mystat.stat[i]);
}
-#ifdef CONFIG_DEBUG_VM
cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
- {
- int nid, zid;
+ for_each_populated_zone(zone) {
+ int nid = zone->zone_pgdat->node_id;
+ int zid = zone_idx(zone);
struct mem_cgroup_per_zone *mz;
- unsigned long recent_rotated[2] = {0, 0};
- unsigned long recent_scanned[2] = {0, 0};
-
- for_each_online_node(nid)
- for (zid = 0; zid < MAX_NR_ZONES; zid++) {
- mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-
- recent_rotated[0] +=
- mz->reclaim_stat.recent_rotated[0];
- recent_rotated[1] +=
- mz->reclaim_stat.recent_rotated[1];
- recent_scanned[0] +=
- mz->reclaim_stat.recent_scanned[0];
- recent_scanned[1] +=
- mz->reclaim_stat.recent_scanned[1];
- }
- cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
- cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
- cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
- cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
+ unsigned long anon;
+ unsigned long ratio;
+
+ mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+
+ anon = MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+ anon += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+
+ ratio = get_anon_scan_ratio(zone, mem_cont, mem_cont->swappiness);
+
+ /*
+ * We have per-zone anon-scan-ratio. but we don't hope display such
+ * value directly. Instead, we display following fomula.
+ *
+ * sum(anon * ratio/100)
+ * --------------------- * 100
+ * sum(anon)
+ */
+ total_anon += anon;
+ total_scan_anon += anon * ratio;
+
+#ifdef CONFIG_DEBUG_VM
+ recent_rotated[0] += mz->reclaim_stat.recent_rotated[0];
+ recent_rotated[1] += mz->reclaim_stat.recent_rotated[1];
+ recent_scanned[0] += mz->reclaim_stat.recent_scanned[0];
+ recent_scanned[1] += mz->reclaim_stat.recent_scanned[1];
+#endif
}
+ cb->fill(cb, "anon_scan_ratio", total_scan_anon / total_anon);
+
+#ifdef CONFIG_DEBUG_VM
+ cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
+ cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
+ cb->fill(cb, "recent_scanned_anon", recent_scanned[0]);
+ cb->fill(cb, "recent_scanned_file", recent_scanned[1]);
#endif
return 0;
--
1.6.5.2
======================================
Vmscan folks was asked "why does my system makes so much swap-out?"
in lkml at several times.
At that time, I made the debug patch to show recent_anon_{scanned/rorated}
parameter at least three times.
Thus, its parameter should be showed on /proc/zoneinfo. It help
vmscan folks debugging.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Reviewed-by: Rik van Riel <ri...@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---
include/linux/swap.h | 2 ++
mm/vmscan.c | 50 ++++++++++++++++++++++++++++++++++++--------------
mm/vmstat.c | 7 +++++--
3 files changed, 43 insertions(+), 16 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..e95d7ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -280,6 +280,8 @@ extern void scan_unevictable_unregister_node(struct node *node);
extern int kswapd_run(int nid);
extern void kswapd_stop(int nid);
+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness);
+
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 640486b..0900931 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1493,8 +1493,8 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
* percent[0] specifies how much pressure to put on ram/swap backed
* memory, while percent[1] determines pressure on the file LRUs.
*/
-static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
- unsigned long *percent)
+static void __get_scan_ratio(struct zone *zone, struct scan_control *sc,
+ int need_update, unsigned long *percent)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
@@ -1535,18 +1535,19 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
*
* anon in [0], file in [1]
*/
- if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
- spin_lock_irq(&zone->lru_lock);
- reclaim_stat->recent_scanned[0] /= 2;
- reclaim_stat->recent_rotated[0] /= 2;
- spin_unlock_irq(&zone->lru_lock);
- }
-
- if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
- spin_lock_irq(&zone->lru_lock);
- reclaim_stat->recent_scanned[1] /= 2;
- reclaim_stat->recent_rotated[1] /= 2;
- spin_unlock_irq(&zone->lru_lock);
+ if (need_update) {
+ if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
+ spin_lock_irq(&zone->lru_lock);
+ reclaim_stat->recent_scanned[0] /= 2;
+ reclaim_stat->recent_rotated[0] /= 2;
+ spin_unlock_irq(&zone->lru_lock);
+ }
+ if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
+ spin_lock_irq(&zone->lru_lock);
+ reclaim_stat->recent_scanned[1] /= 2;
+ reclaim_stat->recent_rotated[1] /= 2;
+ spin_unlock_irq(&zone->lru_lock);
+ }
}
/*
@@ -1572,6 +1573,27 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
percent[1] = 100 - percent[0];
}
+static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
+ unsigned long *percent)
+{
+ __get_scan_ratio(zone, sc, 1, percent);
+}
+
+unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness)
+{
+ unsigned long percent[2];
+ struct scan_control sc = {
+ .may_swap = 1,
+ .swappiness = swappiness,
+ .mem_cgroup = memcg,
+ };
+
+ __get_scan_ratio(zone, &sc, 0, percent);
+
+ return percent[0];
+}
+
+
/*
* Smallish @nr_to_scan's are deposited in @nr_saved_scan,
* until we collected @swap_cluster_max pages to scan.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..f690117 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
+#include <linux/swap.h>
#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -760,11 +761,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n all_unreclaimable: %u"
"\n prev_priority: %i"
"\n start_pfn: %lu"
- "\n inactive_ratio: %u",
+ "\n inactive_ratio: %u"
+ "\n anon_scan_ratio: %lu",
zone_is_all_unreclaimable(zone),
zone->prev_priority,
zone->zone_start_pfn,
- zone->inactive_ratio);
+ zone->inactive_ratio,
+ get_anon_scan_ratio(zone, NULL, vm_swappiness));
seq_putc(m, '\n');
[snip]
Looks good to me
Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>
On Wed, Jan 13, 2010 at 5:21 PM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
> Changelog
> from v1
> - get_anon_scan_ratio don't tak zone->lru_lock anymore
> because zoneinfo_show_print takes zone->lock.
When I saw this changelog first, I got confused.
That's because there is no relation between lru_lock and lock in zone.
You mean zoneinfo is allowed to have a stale data?
Tend to agree with it.
>
>
> ======================================
> Vmscan folks was asked "why does my system makes so much swap-out?"
> in lkml at several times.
> At that time, I made the debug patch to show recent_anon_{scanned/rorated}
> parameter at least three times.
>
> Thus, its parameter should be showed on /proc/zoneinfo. It help
> vmscan folks debugging.
I support this suggestion.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> ---
> include/linux/swap.h | 2 ++
> mm/vmscan.c | 50 ++++++++++++++++++++++++++++++++++++--------------
> mm/vmstat.c | 7 +++++--
> 3 files changed, 43 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..e95d7ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -280,6 +280,8 @@ extern void scan_unevictable_unregister_node(struct node *node);
> extern int kswapd_run(int nid);
> extern void kswapd_stop(int nid);
>
> +unsigned long get_anon_scan_ratio(struct zone *zone, struct mem_cgroup *memcg, int swappiness);
Today Andrew said to me. :)
"The vmscan.c code makes an effort to look nice in an 80-col display."
Why do you add new parameter 'need_update'?
Do you see any lru_lock heavy contention? (reclaim VS cat /proc/zoneinfo)
I think maybe not.
I am not sure no locking version is needed.
> + if (need_update) {
> + if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
> + spin_lock_irq(&zone->lru_lock);
> + reclaim_stat->recent_scanned[0] /= 2;
> + reclaim_stat->recent_rotated[0] /= 2;
> + spin_unlock_irq(&zone->lru_lock);
> + }
> + if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
> + spin_lock_irq(&zone->lru_lock);
> + reclaim_stat->recent_scanned[1] /= 2;
> + reclaim_stat->recent_rotated[1] /= 2;
> + spin_unlock_irq(&zone->lru_lock);
> + }
> }
>
> /*
> @@ -1572,6 +1573,27 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> percent[1] = 100 - percent[0];
> }
>
> +static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
> + unsigned long *percent)
> +{
> + __get_scan_ratio(zone, sc, 1, percent);
> +}
> +
If we really need this version and your changelog is right,
Let's add 'note'. ;-)
/* Caller have to hold zone->lock */
--
Kind regards,
Minchan Kim
> Hi, Kosaki.
>
> On Wed, Jan 13, 2010 at 5:21 PM, KOSAKI Motohiro
> <kosaki....@jp.fujitsu.com> wrote:
> > Changelog
> > from v1
> > - get_anon_scan_ratio don't tak zone->lru_lock anymore
> > because zoneinfo_show_print takes zone->lock.
>
> When I saw this changelog first, I got confused.
> That's because there is no relation between lru_lock and lock in zone.
> You mean zoneinfo is allowed to have a stale data?
> Tend to agree with it.
Well. zone->lock and zone->lru_lock should be not taked at the same time.
[1/4] of my last version removed zone->lock, then get_anon_scan_ratioo()
can take zone->lru_lock. but I dropped it. thus get_anon_scan_ration() can't
take zone->lru_lock.
Thus, I added need_update parameter.
This is not lock order issue. both zone->lock and zone->lru_lock are
hotpath lock. then, same tame grabbing might cause performance impact.
thanks.
I looked over the code since I am out of office.
I can't find any locking problem zone->lock and zone->lru_lock.
Do you know any locking order problem?
Could you explain it with call graph if you don't mind?
I am out of office by tomorrow so I can't reply quickly.
Sorry for late reponse.
--
Kind regards,
Minchan Kim
On Thu, Jan 14, 2010 at 2:18 PM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
>> > Well. zone->lock and zone->lru_lock should be not taked at the same time.
>>
>> I looked over the code since I am out of office.
>> I can't find any locking problem zone->lock and zone->lru_lock.
>> Do you know any locking order problem?
>> Could you explain it with call graph if you don't mind?
>>
>> I am out of office by tomorrow so I can't reply quickly.
>> Sorry for late reponse.
>
> This is not lock order issue. both zone->lock and zone->lru_lock are
> hotpath lock. then, same tame grabbing might cause performance impact.
Sorry for late response.
Your patch makes get_anon_scan_ratio of zoneinfo stale.
What you said about performance impact is effective when VM pressure high.
I think stale data is all right normally.
But when VM pressure is high and we want to see the information in zoneinfo(
this case is what you said), stale data is not a good, I think.
If it's not a strong argue, I want to use old get_scan_ratio
in get_anon_scan_ratio.
--
Kind regards,
Minchan Kim
please looks such function again.
usally we use recent_rotated/recent_scanned ratio. then following
decreasing doesn't change any scan-ratio meaning. it only prevent
stat overflow.
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
spin_lock_irq(&zone->lru_lock);
reclaim_stat->recent_scanned[0] /= 2;
reclaim_stat->recent_rotated[0] /= 2;
spin_unlock_irq(&zone->lru_lock);
}
So, I don't think current implementation can show stale data.
Thanks.
It has a primary role that floating average as well as prevenitng overflow. :)
So, It's important.
>
> if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
> spin_lock_irq(&zone->lru_lock);
> reclaim_stat->recent_scanned[0] /= 2;
> reclaim_stat->recent_rotated[0] /= 2;
> spin_unlock_irq(&zone->lru_lock);
> }
>
>
> So, I don't think current implementation can show stale data.
It can make stale data when high memory pressure happens.
>
> Thanks.
>
Moreever, I don't want to make complicate thing(ie, need_update)
than old if it doesn't have some benefit.(I think lru_lock isn't big overhead)
--
Kind regards,
Minchan Kim
?? why? and when?
I think it depend on what's stale mean.
Currently(i.e. before the patch), get_scan_ratio have following fomula.
in such region, recent_scanned is not protected by zone->lru_lock.
ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1);
ap /= reclaim_stat->recent_rotated[0] + 1;
fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
percent[0] = 100 * ap / (ap + fp + 1);
percent[1] = 100 - percent[0];
It mean, shrink_zone() doesn't use exactly recent_scanned value. then
zoneinfo can use the same unexactly value.
> Moreever, I don't want to make complicate thing(ie, need_update)
> than old if it doesn't have some benefit.(I think lru_lock isn't big overhead)
Hmm..
I think lru_lock can makes big overhead.
I believe the current code is intentional.
On Mon, Jan 18, 2010 at 10:54 AM, KOSAKI Motohiro
Absoultely right. I missed that. Thanks.
get_scan_ratio used lru_lock to get reclaim_stat->recent_xxxx.
But, it doesn't used lru_lock to get ap/fp.
Is it intentional? I think you or Rik know it. :)
I think if we want to get exact value, we have to use lru_lock until
getting ap/fp.
If it isn't, we don't need lru_lock when we get the reclaim_stat->recent_xxxx.
What do you think about it?
>
>
>> Moreever, I don't want to make complicate thing(ie, need_update)
>> than old if it doesn't have some benefit.(I think lru_lock isn't big overhead)
>
> Hmm..
> I think lru_lock can makes big overhead.
I don't want to argue strongly about this.
That's because i don't have seen that.
If you have a conern about lru_lock, I don't opposed your patch.
--
Kind regards,
Minchan Kim
> Absoultely right. I missed that. Thanks.
> get_scan_ratio used lru_lock to get reclaim_stat->recent_xxxx.
> But, it doesn't used lru_lock to get ap/fp.
>
> Is it intentional? I think you or Rik know it. :)
> I think if we want to get exact value, we have to use lru_lock until
> getting ap/fp.
> If it isn't, we don't need lru_lock when we get the reclaim_stat->recent_xxxx.
>
> What do you think about it?
This is definately not intentional.
Getting race conditions in this code could throw off the
statistics by a factor 2. I do not know how serious that
would be for the VM or whether (and how quickly) it would
self correct.
--
All rights reversed.
Okay. How about making patch to get exact ap/fp?
Although it were not serious or fast recoverable, I think it would be better
to protect lru_lock for consistency if lru_lock isn't big contention lock.
>
> --
> All rights reversed.
>
--
Kind regards,
Minchan Kim
Really?
So, I'll post next patch.
Thanks.