Not only can the system suffer enormous slowdowns because of
lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free. In Larry's case, this resulted in over
6000 processes fighting over locks in the page reclaim code, even
though the system already had 1.5GB of free memory.
It should be possible to avoid both of those issues at once, by
simply limiting how many processes are active in the page reclaim
code simultaneously.
If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().
On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.
Reported-by: Larry Woodman <lwoo...@redhat.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
This patch is against today's MMOTM tree. It has only been compile tested,
I do not have an AIM7 system standing by.
Larry, does this fix your issue?
Documentation/sysctl/vm.txt | 18 ++++++++++++++++++
include/linux/mmzone.h | 4 ++++
include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++++
mm/page_alloc.c | 3 +++
mm/vmscan.c | 38 ++++++++++++++++++++++++++++++++++++++
6 files changed, 71 insertions(+)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..5cf766f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- max_zone_concurrent_reclaim
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.
=============================================================
+max_zone_concurrent_reclaim:
+
+The number of processes that are allowed to simultaneously reclaim
+memory from a particular memory zone.
+
+With certain workloads, hundreds of processes end up in the page
+reclaim code simultaneously. This can cause large slowdowns due
+to lock contention, freeing of way too much memory and occasionally
+false OOM kills.
+
+To avoid these problems, only allow a smaller number of processes
+to reclaim pages from each memory zone simultaneously.
+
+The default value is 8.
+
+=============================================================
+
memory_failure_early_kill:
Control how to kill processes when uncorrected memory error (typically
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..ed614b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,10 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+ /* Number of processes running page reclaim code on this zone. */
+ atomic_t concurrent_reclaimers;
+ wait_queue_head_t reclaim_wait;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..661eec7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+extern int max_zone_concurrent_reclaimers;
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ff0ae6..89b919c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->prev_priority = DEF_PRIORITY;
+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..cf3ef29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/wait.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -129,6 +130,17 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */
+/*
+ * Maximum number of processes concurrently running the page
+ * reclaim code in a memory zone. Having too many processes
+ * just results in them burning CPU time waiting for locks,
+ * so we're better off limiting page reclaim to a sane number
+ * of processes at a time. We do this per zone so local node
+ * reclaim on one NUMA node will not block other nodes from
+ * making progress.
+ */
+int max_zone_concurrent_reclaimers = 8;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
@@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;
+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
+ max_zone_concurrent_reclaimers) {
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;
@@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
I like this. but why do you select default value as constant 8?
Do you have any reason?
I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.
Otherwise looks good to me.
Reviewed-by: Minchan Kim <minch...@gmail.com>
--
Kind regards,
Minchan Kim
>> +The default value is 8.
>> +
>> +=============================================================
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.
I rounded it down to the closest power of 2 to come up with
an arbitrary number that looked safe :)
However, this number is per zone - I imagine that really large
systems will have multiple memory zones, so they can run with
more than 8 processes in the pageout code simultaneously.
> Reviewed-by: Minchan Kim<minch...@gmail.com>
Thank you.
--
All rights reversed.
Thanks for giving precious information to me. :
We always have a question magic value with no comment.
Actually I don't want to add another magic value without comment.
so I would like to add the your good explanation in change log
or as comment on code.
>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
>
> However, this number is per zone - I imagine that really large
> systems will have multiple memory zones, so they can run with
> more than 8 processes in the pageout code simultaneously.
>
>> Reviewed-by: Minchan Kim<minch...@gmail.com>
>
> Thank you.
Thanks for quick reply. :)
> --
> All rights reversed.
>
--
Kind regards,
Minchan Kim
Will backport this patch & test it as well as testing it on latest kernel.
Larry
On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman <lwoo...@redhat.com> wrote:
> Minchan Kim wrote:
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>> Reviewed-by: Minchan Kim <minch...@gmail.com>
>>
>>
>
> This is a per-zone count so perhaps a reasonable default is the number of
> CPUs on the
> NUMA node that the zone resides on ?
For example, It assume one CPU per node.
It means your default value is 1.
On the CPU, process A try to reclaim HIGH zone.
Process B want to reclaim NORMAL zone.
But Process B can't enter reclaim path sincev throttle default value is 1
Even kswap can't reclaim.
I think it's really agressive throttle approach although it would
solve your problem.
I have another idea.
We make default value rather big and we provide latency vaule as knob.
So first many processes can enter reclaim path. When shrinking time exceeds
our konb(ex, some HZ), we can decrease default value of number of concurrent
reclaim process. If shrink time is still long alghouth we do it, we
can decrease
default vaule again. When shrink time is fast, we can allow to enter
reclaim path of another processes as increase the number.
It's like old pdflush mechanism. but it's more complex than Rik's one.
If Rik's approach solve this problem well, my approach is rather
overkill, I think.
I am looking forward to Rik's approach work well.
>
>
--
Kind regards,
Minchan Kim
One reason I made it tunable is so people can easily test
what a good value would be :)
--
All rights reversed.
1) the value is per zone, so process B can go ahead
2) kswapd is always excempt from this limit, since
there is only 1 kswapd per node anyway
--
All rights reversed.
Sorry. I misunderstood Larry's point.
I though Larry mentioned global limit not per zone.
> 2) kswapd is always excempt from this limit, since
> there is only 1 kswapd per node anyway
Larry could test with Rik's patch for what's good default value.
If it proves NR_CPU on node is proper as default value,
We can change default value with it.
> --
> All rights reversed.
>
--
Kind regards,
Minchan Kim
> Otherwise looks good to me.
>
> Reviewed-by: Minchan Kim<minch...@gmail.com>
OK, we found three issues with my patch :)
1) there is a typo in sysctl.c
2) there is another typo in Documentation/vm/sysctl.c
3) the code in vmscan.c has a bug, where tasks without
__GFP_IO or __GFP_FS can end up waiting for tasks
with __GFP_IO or __GFP_FS, leading to a deadlock
I will fix these issues and send out a new patch.
--
All rights reversed.
Not only can the system suffer enormous slowdowns because of
lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free.
It should be possible to avoid both of those issues at once, by
simply limiting how many processes are active in the page reclaim
code simultaneously.
If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().
On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.
Reported-by: Larry Woodman <lwoo...@redhat.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
v2:
- fix typos in sysctl.c and vm.txt
- move the code in sysctl.c out from under the ifdef
- only __GFP_FS|__GFP_IO tasks can wait
Documentation/sysctl/vm.txt | 18 ++++++++++++++
include/linux/mmzone.h | 4 +++
include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++
mm/page_alloc.c | 3 ++
mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
6 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..8bd1a96 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- max_zone_concurrent_reclaimers
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.
=============================================================
+max_zone_concurrent_reclaimers:
index 6ff0ae6..4ec17ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif
+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
/*
* NOTE: do not add new entries to this table unless you have read
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->prev_priority = DEF_PRIORITY;
+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..ecfe28c 100644
@@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;
+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
+ max_zone_concurrent_reclaimers &&
+ (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
+ (__GFP_IO|__GFP_FS)) {
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;
@@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}
/*
On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel <ri...@redhat.com> wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
I am worried about one.
Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
If OOM happens, OOM will kill many innocent processes since
uninterruptible task
can't handle kill signal until the processes free from reclaim_wait list.
I think reclaim_wait list staying time might be long if VM pressure is heavy.
Is this a exaggeration?
If it is serious problem, how about this?
We add new PF_RECLAIM_BLOCK flag and don't pick the process
in select_bad_process.
--
Kind regards,
Minchan Kim
> On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel<ri...@redhat.com> wrote:
>> If too many processes are active doing page reclaim in one zone,
>> simply go to sleep in shrink_zone().
> I am worried about one.
>
> Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
> If OOM happens, OOM will kill many innocent processes since
> uninterruptible task
> can't handle kill signal until the processes free from reclaim_wait list.
>
> I think reclaim_wait list staying time might be long if VM pressure is heavy.
> Is this a exaggeration?
>
> If it is serious problem, how about this?
>
> We add new PF_RECLAIM_BLOCK flag and don't pick the process
> in select_bad_process.
A simpler solution may be to use sleep_on_interruptible, and
simply have the process continue into shrink_zone() if it
gets a signal.
--
All rights reversed.
I thought it but I was not sure.
Okay. If it is possible, It' more simple.
Could you repost patch with that?
Sorry but I have one requesting.
===
+The default value is 8.
+
+=============================================================
I like this. but why do you select default value as constant 8?
Do you have any reason?
I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.
Otherwise looks good to me.
Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.
I rounded it down to the closest power of 2 to come up with
an arbitrary number that looked safe :)
===
We discussed above.
I want to add your desciption into changelog.
That's because after long time, We don't know why we select '8' as
default value.
Your desciption in changelog will explain it to follow-up people. :)
Sorry for bothering you.
> --
> All rights reversed.
>
--
Kind regards,
Minchan Kim
>> A simpler solution may be to use sleep_on_interruptible, and
>> simply have the process continue into shrink_zone() if it
>> gets a signal.
>
> I thought it but I was not sure.
> Okay. If it is possible, It' more simple.
> Could you repost patch with that?
Sure, not a problem.
> +The default value is 8.
> +
> +=============================================================
>
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
>
>
> Pessimistically, I assume that the pageout code spends maybe
> 10% of its time on locking (we have seen far, far worse than
> this with thousands of processes in the pageout code). That
> means if we have more than 10 threads in the pageout code,
> we could end up spending more time on locking and less doing
> real work - slowing everybody down.
>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
> ===
>
> We discussed above.
> I want to add your desciption into changelog.
The thing is, I don't know if 8 is the best value for
the default, which is a reason I made it tunable in
the first place.
There are a lot of assumptions in my reasoning, and
they may be wrong. I suspect that documenting something
wrong is probably worse than letting people wonder out
the default (and maybe finding a better one).
--
All rights reversed.
--
Indeed. But whenever I see magic values in kernel, I have a question
about that.
Actually I used to doubt the value because I guess
"that value was determined by server or desktop experiments.
so probably it don't proper small system."
At least if we put the logical why which might be wrong,
later people can think that value is not proper any more now or his
system(ex, small or super computer and so on) by based on our
description.
so they can improve it.
I think it isn't important your reasoning is right or wrong.
Most important thing is which logical reason determines that value.
I want to not bother you if you mind my suggestion.
Pz, think it was just nitpick. :)
> --
> All rights reversed.
>
--
Kind regards,
Minchan Kim
This patch seems very similar to my old effort. afaik, there are another
two benefit.
1. Improve resource gurantee
if thousands tasks start to vmscan at the same time, they eat all memory for
PF_MEMALLOC. it might cause another dangerous problem. some filesystem
and io device don't handle allocation failure properly.
2. Improve OOM contidion behavior
Currently, vmscan don't handle SIGKILL at all. then if the system
have hevy memory pressure, OOM killer can't kill the target process
soon. it might cause OOM killer kill next innocent process.
This patch can fix it.
> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers &&
> + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> + (__GFP_IO|__GFP_FS)) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);
Oops. this is bug. sleep_on() is not SMP safe.
I made few fixing patch today. I'll post it soon.
btw, following is mesurement result by hackbench.
================
unit: sec
parameter old new
130 (5200 processes) 5.463 4.442
140 (5600 processes) 479.357 7.792
150 (6000 processes) 729.640 20.529
sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
use it on new code.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
include/linux/wait.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index a48e16b..bf76627 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -427,11 +427,11 @@ static inline void remove_wait_queue_locked(wait_queue_head_t *q,
* They are racy. DO NOT use them, use the wait_event* interfaces above.
* We plan to remove these interfaces.
*/
-extern void sleep_on(wait_queue_head_t *q);
-extern long sleep_on_timeout(wait_queue_head_t *q,
+extern void __deprecated sleep_on(wait_queue_head_t *q);
+extern long __deprecated sleep_on_timeout(wait_queue_head_t *q,
signed long timeout);
-extern void interruptible_sleep_on(wait_queue_head_t *q);
-extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,
+extern void __deprecated interruptible_sleep_on(wait_queue_head_t *q);
+extern long __deprecated interruptible_sleep_on_timeout(wait_queue_head_t *q,
signed long timeout);
/*
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 58 +++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 35 insertions(+), 23 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ecfe28c..74c36fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1597,25 +1597,11 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
return nr;
}
-/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
- */
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
{
- unsigned long nr[NR_LRU_LISTS];
- unsigned long nr_to_scan;
- unsigned long percent[2]; /* anon @ 0; file @ 1 */
- enum lru_list l;
- unsigned long nr_reclaimed = sc->nr_reclaimed;
- unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- int noswap = 0;
-
- if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
- max_zone_concurrent_reclaimers &&
- (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
- (__GFP_IO|__GFP_FS)) {
+ if (!current_is_kswapd() &&
+ atomic_read(&zone->concurrent_reclaimers) > max_zone_concurrent_reclaimers &&
+ (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == (__GFP_IO|__GFP_FS)) {
/*
* Do not add to the lock contention if this zone has
* enough processes doing page reclaim already, since
@@ -1630,12 +1616,40 @@ static void shrink_zone(int priority, struct zone *zone,
if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
0, 0)) {
wake_up(&zone->reclaim_wait);
- sc->nr_reclaimed += nr_to_reclaim;
- return;
+ sc->nr_reclaimed += sc->nr_to_reclaim;
+ return -ERESTARTSYS;
}
}
atomic_inc(&zone->concurrent_reclaimers);
+ return 0;
+}
+
+static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
+{
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
+}
+
+/*
+ * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ */
+static void shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ unsigned long nr[NR_LRU_LISTS];
+ unsigned long nr_to_scan;
+ unsigned long percent[2]; /* anon @ 0; file @ 1 */
+ enum lru_list l;
+ unsigned long nr_reclaimed = sc->nr_reclaimed;
+ unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ int noswap = 0;
+ int ret;
+
+ ret = shrink_zone_begin(zone, sc);
+ if (ret)
+ return;
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1692,9 +1706,7 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
throttle_vm_writeout(sc->gfp_mask);
-
- atomic_dec(&zone->concurrent_reclaimers);
- wake_up(&zone->reclaim_wait);
+ shrink_zone_end(zone, sc);
}
/*
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 40 ++++++++++++++++++++++++++++++----------
1 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74c36fe..3be5345 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1599,15 +1599,32 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
{
- if (!current_is_kswapd() &&
- atomic_read(&zone->concurrent_reclaimers) > max_zone_concurrent_reclaimers &&
- (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == (__GFP_IO|__GFP_FS)) {
- /*
- * Do not add to the lock contention if this zone has
- * enough processes doing page reclaim already, since
- * we would just make things slower.
- */
- sleep_on(&zone->reclaim_wait);
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wq = &zone->reclaim_wait;
+
+ if (current_is_kswapd())
+ goto out;
+
+ /*
+ * GFP_NOIO and GFP_NOFS mean caller may have some lock implicitly.
+ * Thus, we can't wait here. otherwise it might cause deadlock.
+ */
+ if ((sc->gfp_mask & (__GFP_IO|__GFP_FS)) != (__GFP_IO|__GFP_FS))
+ goto out;
+
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ for (;;) {
+ prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+
+ if (atomic_read(&zone->concurrent_reclaimers) <=
+ max_zone_concurrent_reclaimers)
+ break;
+
+ schedule();
/*
* If other processes freed enough memory while we waited,
@@ -1615,12 +1632,15 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
*/
if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
0, 0)) {
- wake_up(&zone->reclaim_wait);
+ wake_up(wq);
+ finish_wait(wq, &wait);
sc->nr_reclaimed += sc->nr_to_reclaim;
return -ERESTARTSYS;
}
}
+ finish_wait(wq, &wait);
+ out:
atomic_inc(&zone->concurrent_reclaimers);
return 0;
}
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/page_alloc.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca9cae1..8a9cbaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1878,6 +1878,14 @@ rebalance:
goto got_pg;
/*
+ * If the allocation is for userland page and we have fatal signal,
+ * there isn't any reason to continue allocation. instead, the task
+ * should exit soon.
+ */
+ if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
+ goto nopage;
+
+ /*
* If we failed to make any progress reclaiming, then we are
* running out of options and have to consider going OOM
*/
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0880668..bf229d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_zone(int priority, struct zone *zone,
+static int shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
@@ -1669,7 +1669,7 @@ static void shrink_zone(int priority, struct zone *zone,
ret = shrink_zone_begin(zone, sc);
if (ret)
- return;
+ return ret;
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1692,6 +1692,7 @@ static void shrink_zone(int priority, struct zone *zone,
&reclaim_stat->nr_saved_scan[l]);
}
+ ret = 0;
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1712,8 +1713,10 @@ static void shrink_zone(int priority, struct zone *zone,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
- if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
+ if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) {
+ ret = -ERESTARTSYS;
break;
+ }
}
sc->nr_reclaimed = nr_reclaimed;
@@ -1727,6 +1730,8 @@ static void shrink_zone(int priority, struct zone *zone,
throttle_vm_writeout(sc->gfp_mask);
shrink_zone_end(zone, sc);
+
+ return ret;
}
/*
@@ -1751,6 +1756,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
struct zone *zone;
+ int ret;
sc->all_unreclaimable = 1;
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
@@ -1780,7 +1786,9 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
priority);
}
- shrink_zone(priority, zone, sc);
+ ret = shrink_zone(priority, zone, sc);
+ if (ret)
+ return;
}
}
--
1.6.5.2
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 20 +++++++++++++-------
1 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf229d3..e537361 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1618,7 +1618,10 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
* we would just make things slower.
*/
for (;;) {
- prepare_to_wait_exclusive(wq, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait_exclusive(wq, &wait, TASK_KILLABLE);
+
+ if (fatal_signal_pending(current))
+ goto stop_reclaim;
if (atomic_read(&zone->concurrent_reclaimers) <=
max_zone_concurrent_reclaimers)
@@ -1631,18 +1634,21 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
* break out of the loop and go back to the allocator.
*/
if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
- 0, 0)) {
- wake_up_all(wq);
- finish_wait(wq, &wait);
- sc->nr_reclaimed += sc->nr_to_reclaim;
- return -ERESTARTSYS;
- }
+ 0, 0))
+ goto found_lots_memory;
}
finish_wait(wq, &wait);
out:
atomic_inc(&zone->concurrent_reclaimers);
return 0;
+
+ found_lots_memory:
+ wake_up_all(wq);
+ stop_reclaim:
+ finish_wait(wq, &wait);
+ sc->nr_reclaimed += sc->nr_to_reclaim;
+ return -ERESTARTSYS;
}
static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3562a2d..0880668 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
max_zone_concurrent_reclaimers)
break;
- schedule();
+ io_schedule();
/*
* If other processes freed enough memory while we waited,
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0cb834..3562a2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1618,7 +1618,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
* we would just make things slower.
*/
for (;;) {
- prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait_exclusive(wq, &wait, TASK_UNINTERRUPTIBLE);
if (atomic_read(&zone->concurrent_reclaimers) <=
max_zone_concurrent_reclaimers)
@@ -1632,7 +1632,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
*/
if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
0, 0)) {
- wake_up(wq);
+ wake_up_all(wq);
finish_wait(wq, &wait);
sc->nr_reclaimed += sc->nr_to_reclaim;
return -ERESTARTSYS;
--
1.6.5.2
old mean mmotm1208
new mean mmotm1208 + Rik patch + my patch
And the best way to archive this is to remove the function.
In Linus' current tree I find:
- 5 instances of sleep_on(), all in old and obscure block drivers
- 2 instances of sleep_on_timeout(), both in old and obscure drivers
- 28 instances of interruptible_sleep_on_timeout(), mostly in obscure
drivers with a high concentration in the old oss core which should be
killed anyway. And unfortunately a few relatively recent additions
like the SGI xpc driver or usbvision driver
- tons of instances of interruptible_sleep_on all over the drivers code
So I think you're be better off just killing the first two ASAP instead
of just deprecating them.
for the other two deprecating and removing some of the horrible drivers
still using them might be best.
> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
I don't like the hardcoded number. Is the same number good for a 128MB
embedded system as for as 1TB server? Seems doubtful.
This should be perhaps scaled with memory size and number of CPUs?
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
__read_mostly
> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);
wait_event()? sleep_on is a really deprecated racy interface.
This would still badly thunder the herd if not enough memory is freed
, won't it? It would be better to only wake up a single process if memory got freed.
How about for each page freed do a wake up for one thread?
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
This sounds like a very good argument against using direct reclaim at
all. It reminds a bit of the issue we had in XFS with lots of processes
pushing the AIL and causing massive slowdowns due to lock contention
and cacheline bonucing. Moving all the AIL pushing into a dedicated
thread solved that nicely. In the VM we already have that dedicated
per-node kswapd thread, so moving off as much as possible work to
should be equivalent.
Of course any of this kind of tuning really requires a lot of testing
and benchrmarking to verify those assumptions.
Remember this a per-zone number.
--
Some of the new systems have 16 CPUs per-node.
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
> if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> 0, 0)) {
> - wake_up(wq);
> + wake_up_all(wq);
> finish_wait(wq,&wait);
> sc->nr_reclaimed += sc->nr_to_reclaim;
> return -ERESTARTSYS;
I believe we want to wake the processes up one at a time
here. If the queue of waiting processes is very large
and the amount of excess free memory is fairly low, the
first processes that wake up can take the amount of free
memory back down below the threshold. The rest of the
waiters should stay asleep when this happens.
--
All rights reversed.
I'm not sure we really are in io wait when waiting on this
queue, but there's a fair chance we may be so this is a
reasonable change.
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
Acked-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
The limit is per _zone_, so the number of concurrent reclaimers
is automatically scaled by the number of memory zones in the
system.
Scaling up the per-zone value as well looks like it could lead
to the kind of lock contention we are aiming to avoid in the
first place.
--
All rights reversed.
As with patch 4/8 I am not convinced that wake_up_all is
the correct thing to do.
Other than that:
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
IIRC there is one reason - keeping equal pageout pressure
between zones.
However, it may be enough if just kswapd keeps evening out
the pressure, now that we limit the number of concurrent
direct reclaimers in the system.
Since kswapd does not use shrink_zones ...
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
Reviewed-by: Rik van Riel <ri...@redhat.com>
--
All rights reversed.
>> It should be possible to avoid both of those issues at once, by
>> simply limiting how many processes are active in the page reclaim
>> code simultaneously.
>>
>
> This sounds like a very good argument against using direct reclaim at
> all.
Not completely possible. Processes that call try_to_free_pages
without __GFP_FS or __GFP_IO in their gfp_flags may be holding
some kind of lock that kswapd could end up waiting on.
That means those tasks cannot wait on kswapd, because a deadlock
could happen.
Having said that, maybe we can get away with making direct
reclaim a limited subset of what it does today. Kswapd could
be the only process scanning and unmapping ptes, submitting
IOs and scanning the active list. Direct reclaim could limit
itself to reclaiming clean inactive pages and sleeping in
congestion_wait().
--
All rights reversed.
> On Mon, Dec 14, 2009 at 09:24:40PM +0900, KOSAKI Motohiro wrote:
> >
> >
> > sleep_on() function is SMP and/or kernel preemption unsafe. we
> > shouldn't use it on new code.
>
> And the best way to archive this is to remove the function.
>
> In Linus' current tree I find:
>
> - 5 instances of sleep_on(), all in old and obscure block drivers
> - 2 instances of sleep_on_timeout(), both in old and obscure drivers
these should both die; the sleep_on() ones using BROKEN in Kconfig..
.. sleep_on() has not worked in the 2.6 series ever.... ;)
>
> - 28 instances of interruptible_sleep_on_timeout(), mostly in obscure
> drivers with a high concentration in the old oss core which should
> be killed anyway. And unfortunately a few relatively recent additions
> like the SGI xpc driver or usbvision driver
can we also make sure that checkpatch.pl catches any new addition?
(not saying checkpatch.pl is the end-all, but the people who do run it
at least have now have a chance ;-)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
A zone could be 64MB or 32GB. And the system could have 1 or 1024 CPUs.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Rik, the latest patch appears to have a problem although I dont know
what the problem is yet. When the system ran out of memory we see
thousands of runnable processes and 100% system time:
9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
100 0 0 0
9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
100 0 0 0
9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
100 0 0 0
The system would not respond so I dont know whats going on yet. I'll
add debug code to figure out why its in that state as soon as I get
access to the hardware.
Larry
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <lwoo...@redhat.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>
>
> ---
> v2:
> - fix typos in sysctl.c and vm.txt
> - move the code in sysctl.c out from under the ifdef
> - only __GFP_FS|__GFP_IO tasks can wait
>
> Documentation/sysctl/vm.txt | 18 ++++++++++++++
> include/linux/mmzone.h | 4 +++
> include/linux/swap.h | 1 +
> kernel/sysctl.c | 7 +++++
> mm/page_alloc.c | 3 ++
> mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
> 6 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..8bd1a96 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- max_zone_concurrent_reclaimers
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>
> =============================================================
>
> +max_zone_concurrent_reclaimers:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
> memory_failure_early_kill:
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + /* Number of processes running page reclaim code on this zone. */
> + atomic_t concurrent_reclaimers;
> + wait_queue_head_t reclaim_wait;
> +
> /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..4ec17ed 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
> .extra2 = &one,
> },
> #endif
> + {
> + .procname = "max_zone_concurrent_reclaimers",
> + .data = &max_zone_concurrent_reclaimers,
> + .maxlen = sizeof(max_zone_concurrent_reclaimers),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
>
> /*
> * NOTE: do not add new entries to this table unless you have read
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> zone->prev_priority = DEF_PRIORITY;
>
> + atomic_set(&zone->concurrent_reclaimers, 0);
> + init_waitqueue_head(&zone->reclaim_wait);
> +
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..ecfe28c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/wait.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
> int vm_swappiness = 60;
> long vm_total_pages; /* The total number of pages which the VM controls */
>
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>
> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers &&
> + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> + (__GFP_IO|__GFP_FS)) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);
> +
> + /*
> + * If other processes freed enough memory while we waited,
> + * break out of the loop and go back to the allocator.
> + */
> + if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> + 0, 0)) {
> + wake_up(&zone->reclaim_wait);
> + sc->nr_reclaimed += nr_to_reclaim;
> + return;
> + }
> + }
> +
> + atomic_inc(&zone->concurrent_reclaimers);
> +
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {
> noswap = 1;
> @@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,
> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> throttle_vm_writeout(sc->gfp_mask);
> +
> + atomic_dec(&zone->concurrent_reclaimers);
> + wake_up(&zone->reclaim_wait);
> }
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
> concurrent_reclaimers limitation related code made mess to shrink_zone.
> To introduce helper function increase readability.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minch...@gmail.com>
Through looking over patch series, it's nice clean up.
--
Kind regards,
Minchan Kim
>
>
> sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
> use it on new code.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
We would be better to remove this function.
But it's enough to that in this patch series.
We have to remove sleep_on with another patch series.
Reviewed-by: Minchan Kim <minch...@gmail.com>
> sleep_on() is SMP and/or kernel preemption unsafe. This patch
> replace it with safe code.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Reviwed-by: Minchan Kim <minch...@gmail.com>
I think it's typo. The description in changelog says we want "wake_up".
Otherwise, looks good to me.
Reviewed-by: Minchan Kim <minch...@gmail.com>
--
Kind regards,
Minchan Kim
> All task sleeping point in vmscan (e.g. congestion_wait) use
> io_schedule. then shrink_zone_begin use it too.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3562a2d..0880668 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
> max_zone_concurrent_reclaimers)
> break;
>
> - schedule();
> + io_schedule();
Hmm. We have many cond_resched which is not io_schedule in vmscan.c.
In addition, if system doesn't have swap device space and out of page cache
due to heavy memory pressue, VM might scan & drop pages until priority is zero
or zone is unreclaimable.
I think it would be not a IO wait.
>
> /*
> * If other processes freed enough memory while we waited,
> --
> 1.6.5.2
>
>
>
--
Kind regards,
Minchan Kim
> When fork bomb invoke OOM Killer, almost task might start to reclaim and
> sleep on shrink_zone_begin(). if we use TASK_UNINTERRUPTIBLE, OOM killer
> can't kill such task. it mean we never recover from fork bomb.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
I doubt wake_up_all, too. I think it's typo.
Otherwise, Looks good to me.
Reviewed-by: Minchan Kim <minch...@gmail.com>
--
Sure. That's exactly my point.
plus, balance_pgdat() scan only one node. then zone balancing is
meaingfull. but shrink_zones() scan all zone in all node. we don't
need inter node balancing. it's vmscan's buisiness.
> > Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
Thanks.
> In OOM case, almost processes may be in vmscan. There isn't any reason
> the killed process continue allocation. process exiting free lots pages
> rather than greedy vmscan.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> mm/page_alloc.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca9cae1..8a9cbaa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1878,6 +1878,14 @@ rebalance:
> goto got_pg;
>
> /*
> + * If the allocation is for userland page and we have fatal signal,
> + * there isn't any reason to continue allocation. instead, the task
> + * should exit soon.
> + */
> + if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
> + goto nopage;
If we jump nopage, we meets dump_stack and show_mem.
Even, we can meet OOM which might kill innocent process.
> +
> + /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> */
> --
> 1.6.5.2
>
>
>
--
Kind regards,
Minchan Kim
>
> From latency view, There isn't any reason shrink_zones() continue to
> reclaim another zone's page if the task reclaimed enough lots pages.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> mm/vmscan.c | 16 ++++++++++++----
> 1 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0880668..bf229d3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1654,7 +1654,7 @@ static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> -static void shrink_zone(int priority, struct zone *zone,
> +static int shrink_zone(int priority, struct zone *zone,
> struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> @@ -1669,7 +1669,7 @@ static void shrink_zone(int priority, struct zone *zone,
>
> ret = shrink_zone_begin(zone, sc);
> if (ret)
> - return;
> + return ret;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {
> @@ -1692,6 +1692,7 @@ static void shrink_zone(int priority, struct zone *zone,
> &reclaim_stat->nr_saved_scan[l]);
> }
>
> + ret = 0;
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
> @@ -1712,8 +1713,10 @@ static void shrink_zone(int priority, struct zone *zone,
> * with multiple processes reclaiming pages, the total
> * freeing target can get unreasonably large.
> */
> - if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
> + if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) {
> + ret = -ERESTARTSYS;
Just nitpick.
shrink_zone's return value is matter?
shrink_zones never handle that.
> break;
> + }
> }
>
> sc->nr_reclaimed = nr_reclaimed;
> @@ -1727,6 +1730,8 @@ static void shrink_zone(int priority, struct zone *zone,
>
> throttle_vm_writeout(sc->gfp_mask);
> shrink_zone_end(zone, sc);
> +
> + return ret;
> }
>
> /*
> @@ -1751,6 +1756,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
> struct zoneref *z;
> struct zone *zone;
> + int ret;
>
> sc->all_unreclaimable = 1;
> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> @@ -1780,7 +1786,9 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> priority);
> }
>
> - shrink_zone(priority, zone, sc);
> + ret = shrink_zone(priority, zone, sc);
> + if (ret)
> + return;
> }
> }
> --
> 1.6.5.2
>
>
>
As a matter of fact, I am worried about this patch.
My opinion is we put aside this patch until we can solve Larry's problem.
We could apply this patch in future.
I don't want to see the side effect while we focus Larry's problem.
But If you mind my suggestion, I also will not bother you by this nitpick.
Thanks for great cleanup and improving VM, Kosaki. :)
shrink_zones() stop vmscan quickly if ret isn't !0.
if we already scanned rather than nr_to_reclaim, we can stop vmscan.
> As a matter of fact, I am worried about this patch.
>
> My opinion is we put aside this patch until we can solve Larry's problem.
> We could apply this patch in future.
>
> I don't want to see the side effect while we focus Larry's problem.
> But If you mind my suggestion, I also will not bother you by this nitpick.
>
> Thanks for great cleanup and improving VM, Kosaki. :)
I agree with Larry's issue is highest priority.
There are 9421 running processces. it mean concurrent task limitation
don't works well. hmm?
OK.
Actually, wake_up() and wake_up_all() aren't different so much.
Although we use wake_up(), the task wake up next task before
try to alloate memory. then, it's similar to wake_up_all().
However, there are few difference. recent scheduler latency improvement
effort reduce default scheduler latency target. it mean, if we have
lots tasks of running state, the task have very few time slice. too
frequently context switch decrease VM efficiency.
Thank you, Rik. I didn't notice wake_up() makes better performance than
wake_up_all() on current kernel.
Subject: [PATCH 9/8] replace wake_up_all with wake_up
Fix typo.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5adb7a..b3b4e77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
return 0;
found_lots_memory:
- wake_up_all(wq);
+ wake_up(wq);
stop_reclaim:
finish_wait(wq, &wait);
sc->nr_reclaimed += sc->nr_to_reclaim;
--
1.6.5.2
Which point you oppose? noprint is better?
cond_resched don't mean sleep on wait queue. it's similar to yield.
> In addition, if system doesn't have swap device space and out of page cache
> due to heavy memory pressue, VM might scan & drop pages until priority is zero
> or zone is unreclaimable.
>
> I think it would be not a IO wait.
two point.
1. For long time, Administrator usually watch iowait% at heavy memory pressure. I
don't hope change this without reasonable reason. 2. iowait makes scheduler
bonus a bit, vmscan task should have many time slice than memory consume
task. it improve VM stabilization.
but I agree the benefit isn't so big. if we have reasonable reason, I
don't oppose use schedule().
> > > /*
> > > + * If the allocation is for userland page and we have fatal signal,
> > > + * there isn't any reason to continue allocation. instead, the task
> > > + * should exit soon.
> > > + */
> > > + if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
> > > + goto nopage;
> >
> > If we jump nopage, we meets dump_stack and show_mem.
> > Even, we can meet OOM which might kill innocent process.
>
> Which point you oppose? noprint is better?
>
>
Sorry fot not clarity.
My point was following as.
First,
I don't want to print.
Why do we print stack and mem when the process receives the SIGKILL?
Second,
1) A process try to allocate anon page in do_anonymous_page.
2) A process receives SIGKILL.
3) kernel doesn't allocate page to A process by your patch.
4) do_anonymous_page returns VF_FAULT_OOM.
5) call mm_fault_error
6) call out_of_memory
7) It migth kill innocent task.
If I missed something, Pz, corret me. :)
--
Kind regards,
Minchan Kim
Doh, you are complely right. I had forgot recent meaning change of VM_FAULT_OOM.
yes, this patch is crap. I need to remake it.
I confused it.
Thanks for correcting me. :)
>
>> In addition, if system doesn't have swap device space and out of page cache
>> due to heavy memory pressue, VM might scan & drop pages until priority is zero
>> or zone is unreclaimable.
>>
>> I think it would be not a IO wait.
>
> two point.
> 1. For long time, Administrator usually watch iowait% at heavy memory pressure. I
> don't hope change this without reasonable reason. 2. iowait makes scheduler
> bonus a bit, vmscan task should have many time slice than memory consume
> task. it improve VM stabilization.
AFAIK, CFS scheduler doesn't give the bonus to I/O wait task any more.
>
> but I agree the benefit isn't so big. if we have reasonable reason, I
> don't oppose use schedule().
>
>
>
>
--
Kind regards,
Minchan Kim
What happens to waiters should running tasks not allocate for a while?
> However, there are few difference. recent scheduler latency improvement
> effort reduce default scheduler latency target. it mean, if we have
> lots tasks of running state, the task have very few time slice. too
> frequently context switch decrease VM efficiency.
> Thank you, Rik. I didn't notice wake_up() makes better performance than
> wake_up_all() on current kernel.
Perhaps this is a spot where an explicit wake_up_all_nopreempt() would
be handy. Excessive wakeup preemption from wake_up_all() has long been
annoying when there are many waiters, but converting it to only have the
first wakeup be preemptive proved harmful to performance. Recent tweaks
will have aggravated the problem somewhat, but it's not new.
-Mike
Maybe something like below. I can also imagine that under _heavy_ vm
pressure, it'd likely be good for throughput to not provide for sleeper
fairness for these wakeups as well, as that increases vruntime spread,
and thus increases preemption with no benefit in sight.
---
include/linux/sched.h | 1 +
include/linux/wait.h | 3 +++
kernel/sched.c | 21 +++++++++++++++++++++
kernel/sched_fair.c | 2 +-
4 files changed, 26 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1065,6 +1065,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_NOPREEMPT 0x04 /* wakeup is not preemptive */
struct sched_class {
const struct sched_class *next;
Index: linux-2.6/include/linux/wait.h
===================================================================
--- linux-2.6.orig/include/linux/wait.h
+++ linux-2.6/include/linux/wait.h
@@ -140,6 +140,7 @@ static inline void __remove_wait_queue(w
}
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
+void __wake_up_nopreempt(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr,
void *key);
@@ -154,8 +155,10 @@ int out_of_line_wait_on_bit_lock(void *,
wait_queue_head_t *bit_waitqueue(void *, int);
#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
+#define wake_up_nopreempt(x) __wake_up_nopreempt(x, TASK_NORMAL, 1, NULL)
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
+#define wake_up_all_nopreempt(x) __wake_up_nopreempt(x, TASK_NORMAL, 0, NULL)
#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)
#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5682,6 +5682,27 @@ void __wake_up(wait_queue_head_t *q, uns
}
EXPORT_SYMBOL(__wake_up);
+/**
+ * __wake_up_nopreempt - wake up threads blocked on a waitqueue.
+ * @q: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+void __wake_up_nopreempt(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __wake_up_common(q, mode, nr_exclusive, WF_NOPREEMPT, key);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(__wake_up_nopreempt);
+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1709,7 +1709,7 @@ static void check_preempt_wakeup(struct
pse->avg_overlap < sysctl_sched_migration_cost)
goto preempt;
- if (!sched_feat(WAKEUP_PREEMPT))
+ if (!sched_feat(WAKEUP_PREEMPT) || (wake_flags & WF_NOPREEMPT))
return;
update_curr(cfs_rq);
Copy/pasting some methods, and hardcoding futexes, where I know vmark
loads to the point of ridiculous on my little box, it's good for ~17%
throughput boost. Used prudently, something along these lines could
save some thrashing when core code knows it's handling a surge. It
would have a very negative effect at low to modest load though.
Hohum.
---
include/linux/completion.h | 2
include/linux/sched.h | 10 ++-
include/linux/wait.h | 3
kernel/sched.c | 140 ++++++++++++++++++++++++++++++++++++---------
kernel/sched_fair.c | 32 +++++-----
kernel/sched_idletask.c | 2
kernel/sched_rt.c | 6 -
7 files changed, 146 insertions(+), 49 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1065,12 +1065,16 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_BATCH 0x04 /* batch wakeup, not preemptive */
+#define WF_REQUEUE 0x00 /* task requeue */
+#define WF_WAKE 0x10 /* task waking */
+#define WF_SLEEP 0x20 /* task going to sleep */
struct sched_class {
const struct sched_class *next;
- void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
- void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
+ void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
+ void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
@@ -2028,6 +2032,8 @@ extern void do_timer(unsigned long ticks
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
+extern int wake_up_state_batch(struct task_struct *tsk, unsigned int state);
+extern int wake_up_process_batch(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
#ifdef CONFIG_SMP
Index: linux-2.6/include/linux/wait.h
===================================================================
--- linux-2.6.orig/include/linux/wait.h
+++ linux-2.6/include/linux/wait.h
@@ -140,6 +140,7 @@ static inline void __remove_wait_queue(w
}
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
+void __wake_up_batch(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr,
void *key);
@@ -154,8 +155,10 @@ int out_of_line_wait_on_bit_lock(void *,
wait_queue_head_t *bit_waitqueue(void *, int);
#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
+#define wake_up_batch(x) __wake_up_batch(x, TASK_NORMAL, 1, NULL)
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
+#define wake_up_all_batch(x) __wake_up_batch(x, TASK_NORMAL, 0, NULL)
#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)
#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1392,7 +1392,7 @@ static const u32 prio_to_wmult[40] = {
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};
-static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);
+static void activate_task(struct rq *rq, struct task_struct *p, int flags);
/*
* runqueue iterator, to support SMP load-balancing between different
@@ -1962,24 +1962,24 @@ static int effective_prio(struct task_st
/*
* activate_task - move a task to the runqueue.
*/
-static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
+static void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
- enqueue_task(rq, p, wakeup);
+ enqueue_task(rq, p, flags);
inc_nr_running(rq);
}
/*
* deactivate_task - remove a task from the runqueue.
*/
-static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
+static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible++;
- dequeue_task(rq, p, sleep);
+ dequeue_task(rq, p, flags);
dec_nr_running(rq);
}
@@ -2415,7 +2415,7 @@ out_activate:
schedstat_inc(p, se.nr_wakeups_local);
else
schedstat_inc(p, se.nr_wakeups_remote);
- activate_task(rq, p, 1);
+ activate_task(rq, p, wake_flags);
success = 1;
/*
@@ -2474,13 +2474,35 @@ out:
*/
int wake_up_process(struct task_struct *p)
{
- return try_to_wake_up(p, TASK_ALL, 0);
+ return try_to_wake_up(p, TASK_ALL, WF_WAKE);
}
EXPORT_SYMBOL(wake_up_process);
+/**
+ * wake_up_process_batch - Wake up a specific process
+ * @p: The process to be woken up.
+ *
+ * Attempt to wake up the nominated process and move it to the set of runnable
+ * processes. Returns 1 if the process was woken up, 0 if it was already
+ * running.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+int wake_up_process_batch(struct task_struct *p)
+{
+ return try_to_wake_up(p, TASK_ALL, WF_WAKE|WF_BATCH);
+}
+EXPORT_SYMBOL(wake_up_process_batch);
+
int wake_up_state(struct task_struct *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, WF_WAKE);
+}
+
+int wake_up_state_batch(struct task_struct *p, unsigned int state)
+{
+ return try_to_wake_up(p, state, WF_WAKE|WF_BATCH);
}
/*
@@ -2628,7 +2650,7 @@ void wake_up_new_task(struct task_struct
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
update_rq_clock(rq);
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_WAKE|WF_FORK);
trace_sched_wakeup_new(rq, p, 1);
check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
@@ -3156,9 +3178,9 @@ void sched_exec(void)
static void pull_task(struct rq *src_rq, struct task_struct *p,
struct rq *this_rq, int this_cpu)
{
- deactivate_task(src_rq, p, 0);
+ deactivate_task(src_rq, p, WF_REQUEUE);
set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ activate_task(this_rq, p, WF_REQUEUE);
check_preempt_curr(this_rq, p, 0);
}
@@ -5468,7 +5490,7 @@ need_resched_nonpreemptible:
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
- deactivate_task(rq, prev, 1);
+ deactivate_task(rq, prev, WF_SLEEP);
switch_count = &prev->nvcsw;
}
@@ -5634,7 +5656,7 @@ asmlinkage void __sched preempt_schedule
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
void *key)
{
- return try_to_wake_up(curr->private, mode, wake_flags);
+ return try_to_wake_up(curr->private, mode, wake_flags|WF_WAKE);
}
EXPORT_SYMBOL(default_wake_function);
@@ -5677,22 +5699,43 @@ void __wake_up(wait_queue_head_t *q, uns
unsigned long flags;
spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive, 0, key);
+ __wake_up_common(q, mode, nr_exclusive, WF_WAKE, key);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(__wake_up);
+/**
+ * __wake_up_batch - wake up threads blocked on a waitqueue.
+ * @q: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+void __wake_up_batch(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __wake_up_common(q, mode, nr_exclusive, WF_WAKE|WF_BATCH, key);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(__wake_up_batch);
+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
{
- __wake_up_common(q, mode, 1, 0, NULL);
+ __wake_up_common(q, mode, 1, WF_WAKE, NULL);
}
void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
{
- __wake_up_common(q, mode, 1, 0, key);
+ __wake_up_common(q, mode, 1, WF_WAKE, key);
}
/**
@@ -5716,7 +5759,7 @@ void __wake_up_sync_key(wait_queue_head_
int nr_exclusive, void *key)
{
unsigned long flags;
- int wake_flags = WF_SYNC;
+ int wake_flags = WF_WAKE|WF_SYNC;
if (unlikely(!q))
return;
@@ -5757,12 +5800,35 @@ void complete(struct completion *x)
spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_NORMAL, 1, 0, NULL);
+ __wake_up_common(&x->wait, TASK_NORMAL, 1, WF_WAKE, NULL);
spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete);
/**
+ * complete_batch: - signals a single thread waiting on this completion
+ * @x: holds the state of this particular completion
+ *
+ * This will wake up a single thread waiting on this completion. Threads will be
+ * awakened in the same order in which they were queued.
+ *
+ * See also complete_all(), wait_for_completion() and related routines.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+void complete_batch(struct completion *x)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&x->wait.lock, flags);
+ x->done++;
+ __wake_up_common(&x->wait, TASK_NORMAL, 1, WF_WAKE|WF_BATCH, NULL);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+EXPORT_SYMBOL(complete_batch);
+
+/**
* complete_all: - signals all threads waiting on this completion
* @x: holds the state of this particular completion
*
@@ -5777,11 +5843,31 @@ void complete_all(struct completion *x)
spin_lock_irqsave(&x->wait.lock, flags);
x->done += UINT_MAX/2;
- __wake_up_common(&x->wait, TASK_NORMAL, 0, 0, NULL);
+ __wake_up_common(&x->wait, TASK_NORMAL, 0, WF_WAKE, NULL);
spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete_all);
+/**
+ * complete_all_batch: - signals all threads waiting on this completion
+ * @x: holds the state of this particular completion
+ *
+ * This will wake up all threads waiting on this particular completion event.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+void complete_all_batch(struct completion *x)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&x->wait.lock, flags);
+ x->done += UINT_MAX/2;
+ __wake_up_common(&x->wait, TASK_NORMAL, 0, WF_WAKE|WF_BATCH, NULL);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+EXPORT_SYMBOL(complete_all_batch);
+
static inline long __sched
do_wait_for_common(struct completion *x, long timeout, int state)
{
@@ -6344,7 +6430,7 @@ recheck:
on_rq = p->se.on_rq;
running = task_current(rq, p);
if (on_rq)
- deactivate_task(rq, p, 0);
+ deactivate_task(rq, p, WF_REQUEUE);
if (running)
p->sched_class->put_prev_task(rq, p);
@@ -6356,7 +6442,7 @@ recheck:
if (running)
p->sched_class->set_curr_task(rq);
if (on_rq) {
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);
check_class_changed(rq, p, prev_class, oldprio, running);
}
@@ -7172,11 +7258,11 @@ static int __migrate_task(struct task_st
on_rq = p->se.on_rq;
if (on_rq)
- deactivate_task(rq_src, p, 0);
+ deactivate_task(rq_src, p, WF_REQUEUE);
set_task_cpu(p, dest_cpu);
if (on_rq) {
- activate_task(rq_dest, p, 0);
+ activate_task(rq_dest, p, WF_REQUEUE);
check_preempt_curr(rq_dest, p, 0);
}
done:
@@ -7368,7 +7454,7 @@ void sched_idle_next(void)
__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
update_rq_clock(rq);
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
@@ -7707,7 +7793,7 @@ migration_call(struct notifier_block *nf
/* Idle task back to normal (off runqueue, low prio) */
raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
- deactivate_task(rq, rq->idle, 0);
+ deactivate_task(rq, rq->idle, WF_REQUEUE);
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
@@ -9698,10 +9784,10 @@ static void normalize_task(struct rq *rq
update_rq_clock(rq);
on_rq = p->se.on_rq;
if (on_rq)
- deactivate_task(rq, p, 0);
+ deactivate_task(rq, p, WF_REQUEUE);
__setscheduler(rq, p, SCHED_NORMAL, 0);
if (on_rq) {
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);
resched_task(rq->curr);
}
}
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -722,7 +722,7 @@ static void check_spread(struct cfs_rq *
}
static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vruntime = cfs_rq->min_vruntime;
@@ -732,11 +732,11 @@ place_entity(struct cfs_rq *cfs_rq, stru
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
- if (initial && sched_feat(START_DEBIT))
+ if (flags & WF_FORK && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);
/* sleeps up to a single latency don't count. */
- if (!initial && sched_feat(FAIR_SLEEPERS)) {
+ if (!(flags & (WF_FORK|WF_BATCH)) && sched_feat(FAIR_SLEEPERS)) {
unsigned long thresh = sysctl_sched_latency;
/*
@@ -766,7 +766,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
}
static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the 'current'.
@@ -774,8 +774,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);
- if (wakeup) {
- place_entity(cfs_rq, se, 0);
+ if (flags & WF_WAKE) {
+ place_entity(cfs_rq, se, flags);
enqueue_sleeper(cfs_rq, se);
}
@@ -801,7 +801,7 @@ static void clear_buddies(struct cfs_rq
}
static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the 'current'.
@@ -809,7 +809,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
update_curr(cfs_rq);
update_stats_dequeue(cfs_rq, se);
- if (sleep) {
+ if (flags & WF_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@@ -1034,7 +1034,7 @@ static inline void hrtick_update(struct
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
-static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
@@ -1043,8 +1043,8 @@ static void enqueue_task_fair(struct rq
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, wakeup);
- wakeup = 1;
+ enqueue_entity(cfs_rq, se, flags);
+ flags |= WF_WAKE;
}
hrtick_update(rq);
@@ -1055,18 +1055,18 @@ static void enqueue_task_fair(struct rq
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- dequeue_entity(cfs_rq, se, sleep);
+ dequeue_entity(cfs_rq, se, flags);
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
break;
- sleep = 1;
+ flags |= WF_SLEEP;
}
hrtick_update(rq);
@@ -1709,7 +1709,7 @@ static void check_preempt_wakeup(struct
pse->avg_overlap < sysctl_sched_migration_cost)
goto preempt;
- if (!sched_feat(WAKEUP_PREEMPT))
+ if (!sched_feat(WAKEUP_PREEMPT) || (wake_flags & WF_BATCH))
return;
update_curr(cfs_rq);
@@ -1964,7 +1964,7 @@ static void task_fork_fair(struct task_s
if (curr)
se->vruntime = curr->vruntime;
- place_entity(cfs_rq, se, 1);
+ place_entity(cfs_rq, se, WF_FORK);
if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
/*
Index: linux-2.6/include/linux/completion.h
===================================================================
--- linux-2.6.orig/include/linux/completion.h
+++ linux-2.6/include/linux/completion.h
@@ -88,6 +88,8 @@ extern bool completion_done(struct compl
extern void complete(struct completion *);
extern void complete_all(struct completion *);
+extern void complete_batch(struct completion *);
+extern void complete_all_batch(struct completion *);
/**
* INIT_COMPLETION: - reinitialize a completion structure
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -878,11 +878,11 @@ static void dequeue_rt_entity(struct sch
/*
* Adding/removing a task to/from a priority array:
*/
-static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
+static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
- if (wakeup)
+ if (flags & WF_WAKE)
rt_se->timeout = 0;
enqueue_rt_entity(rt_se);
@@ -891,7 +891,7 @@ static void enqueue_task_rt(struct rq *r
enqueue_pushable_task(rq, p);
}
-static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -32,7 +32,7 @@ static struct task_struct *pick_next_tas
* message if some code attempts to do it:
*/
static void
-dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep)
+dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
{
raw_spin_unlock_irq(&rq->lock);
pr_err("bad: scheduling from the idle thread!\n");
>> Actually, wake_up() and wake_up_all() aren't different so much.
>> Although we use wake_up(), the task wake up next task before
>> try to alloate memory. then, it's similar to wake_up_all().
That is a good point. Maybe processes need to wait a little
in this if() condition, before the wake_up(). That would give
the previous process a chance to allocate memory and we can
avoid waking up too many processes.
> What happens to waiters should running tasks not allocate for a while?
When a waiter is woken up, it will either:
1) see that there is enough free memory and wake up the next guy, or
2) run shrink_zone and wake up the next guy
Either way, the processes that just got woken up will ensure that
the sleepers behind them in the queue will get woken up.
--
All rights reversed.
OK, that more or less covers my worry. From the scheduler standpoint
though, you're better off turning them all loose and letting them race,
_with_ the caveat than thundering herds do indeed make thunder (reason
for patch). Turning them loose piecemeal spreads things out over time,
which prolongs surge operations, possibly much longer than necessary.
We had the same long ago with everyone waiting for kswapd to do all the
work. Sticky problem, this roll-down to inevitable wait.
-Mike
Pondering, I think I'd at least wake NR_CPUS. If there's not enough to
go round, oh darn, but if there is, you have full utilization quicker.
$.02.
-Mike
That depends on what the other CPUs in the system are doing.
If they were doing work, you've just wasted some resources.
--
All rights reversed.
if we really need wait a bit, Mike's wake_up_batch is best, I think.
It mean
- if another CPU is idle, wake up one process soon. iow, it don't
make meaningless idle.
- if another CPU is busy, woken process don't start to run awhile.
then, zone_watermark_ok() can calculate correct value.
> if we really need wait a bit, Mike's wake_up_batch is best, I think.
> It mean
> - if another CPU is idle, wake up one process soon. iow, it don't
> make meaningless idle.
> - if another CPU is busy, woken process don't start to run awhile.
> then, zone_watermark_ok() can calculate correct value.
Agreed, that should work great.
Along those lines, there's also NEWIDLE balancing considerations. That
idle may result in a task being pulled, which may or may not hurt a bit.
'course, if you're jamming up on memory allocation, that's the least of
your worries, but every idle avoided is potentially a pull avoided.
Just a thought.
-Mike
Yes, sorry for the delay but I dont have direct or exclusive access to
these large systems
and workloads. As far as I can tell this patch series does help prevent
total system
hangs running AIM7. I did have trouble with the early postings mostly
due to using sleep_on()
and wakeup() but those appear to be fixed.
However, I did add more debug code and see ~10000 processes blocked in
shrink_zone_begin().
This is expected but bothersome, practically all of the processes remain
runnable for the entire
duration of these AIM runs. Collectively all these runnable processes
overwhelm the VM system.
There are many more runnable processes now than were ever seen before,
~10000 now versus
~100 on RHEL5(2.6.18 based). So, we have also been experimenting around
with some of the
CFS scheduler tunables to see of this is responsible...
> plus, I integrated page_referenced() improvement patch series and
> limit concurrent reclaimers patch series privately. I plan to post it
> to lkml at this week end. comments are welcome.
>
The only problem I noticed with the page_referenced patch was an
increase in the
try_to_unmap() failures which causes more re-activations. This is very
obvious with
the using tracepoints I have posted over the past few months but they
were never
included. I didnt get a chance to figure out the exact cause due to
access to the hardware
and workload. This patch series also seems to help the overall stalls
in the VM system.
>
> changelog from last post:
> - remake limit concurrent reclaimers series and sort out its patch order
> - change default max concurrent reclaimers from 8 to num_online_cpu().
> it mean, Andi only talked negative feeling comment in last post.
> he dislike constant default value. plus, over num_online_cpu() is
> really silly. iow, it is really low risk.
> (probably we might change default value. as far as I mesure, small
> value makes better benchmark result. but I'm not sure small value
> don't make regression)
> - Improve OOM and SIGKILL behavior.
> (because RHEL5's vmscan has TIF_MEMDIE recovering logic, but
> current mainline doesn't. I don't hope RHEL6 has regression)
>
>
>
>
>> On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:
>>
>> Rik, the latest patch appears to have a problem although I dont know
>> what the problem is yet. When the system ran out of memory we see
>> thousands of runnable processes and 100% system time:
>>
>>
>> 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
>> 100 0 0 0
>> 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
>> 100 0 0 0
>> 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
>> 100 0 0 0
>>
>> The system would not respond so I dont know whats going on yet. I'll
>> add debug code to figure out why its in that state as soon as I get
>> access to the hardware.
>>
This was in response to Rik's first patch and seems to be fixed by the
latest path set.
Finally, having said all that, the system still struggles reclaiming
memory with
~10000 processes trying at the same time, you fix one bottleneck and it
moves
somewhere else. The latest run showed all but one running process
spinning in
page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
are
~5000 vma's linked to one anon_vma, this seems excessive!!!
I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
read_lock() so multiple callers could execute the page_reference_anon code.
This seems to help quite a bit.
>> Larry
>>> The system would not respond so I dont know whats going on yet. I'll
>>> add debug code to figure out why its in that state as soon as I get
>>> access to the hardware.
>
> This was in response to Rik's first patch and seems to be fixed by the
> latest path set.
>
> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!
I have some ideas on how to keep processes waiting better
on the per zone reclaim_wait waitqueue.
For one, we should probably only do the lots-free wakeup
if we have more than zone->pages_high free pages in the
zone - having each of the waiters free some memory one
after another should not be a problem as long as we do
not have too much free memory in the zone.
Currently it's a hair trigger, with the threshold for
processes going into the page reclaim path and processes
exiting it "plenty free" being exactly the same.
Some hysteresis there could help.
--
All rights reversed.
Larry Woodman wrote:
> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
> are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>
> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
> read_lock() so multiple callers could execute the page_reference_anon code.
> This seems to help quite a bit.
The system has 10000 processes, all of which are child
processes of the same parent.
Pretty much all memory is anonymous memory.
This means that pretty much every anonymous page in the
system:
1) belongs to just one process, but
2) belongs to an anon_vma which is attached to 10,000 VMAs!
This results in page_referenced scanning 10,000 VMAs for
every page, despite the fact that each page is typically
only mapped into one process.
This seems to be our real scalability issue.
The only way out I can think is to have a new anon_vma
when we start a child process and to have COW place new
pages in the new anon_vma.
However, this is a bit of a paradigm shift in our object
rmap system and I am wondering if somebody else has a
better idea :)
Please first clarify whether what Larry is running is actually
a workload that people need to behave well in real life.
From time to time such cases have been constructed, but we've
usually found better things to do than solve them, because
they've been no more than academic problems.
I'm not asserting that this one is purely academic, but I do
think we need more than an artificial case to worry much about it.
An rwlock there has been proposed on several occasions, but
we resist because that change benefits this case but performs
worse on more common cases (I believe: no numbers to back that up).
Substitute a MAP_SHARED file underneath those 10000 vmas,
and don't you have an equal problem with the prio_tree,
which would be harder to solve than the anon_vma case?
Hugh
AIM7 is fairly artificial, but real life workloads
like Oracle, PostgreSQL and Apache can also fork off
large numbers of child processes, which also cause
the system to end up with lots of VMAs attached to
the anon_vmas which all the anonymous pages belong
to.
10,000 is fairly extreme, but very large Oracle
workloads can get up to 1,000 or 2,000 today.
This number is bound to grow in the future.
What point you bother? throughput, latency or somethingelse? Actually,
unfairness itself is right thing from VM view. because perfectly fairness
VM easily makes livelock. (e.g. process-A swap out process-B's page, parocess-B
swap out process-A's page). swap token solve above simplest case. but run
many process easily makes similar circulation dependency. recovering from
heavy memory pressure need lots unfairness.
Of cource, if the unfairness makes performance regression, it's bug. it should be
fixed.
> The only problem I noticed with the page_referenced patch was an
> increase in the
> try_to_unmap() failures which causes more re-activations. This is very
> obvious with
> the using tracepoints I have posted over the past few months but they
> were never
> included. I didnt get a chance to figure out the exact cause due to
> access to the hardware
> and workload. This patch series also seems to help the overall stalls
> in the VM system.
I (and many VM developer) don't forget your tracepoint effort. we only
hope to solve the regression at first.
Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
performance characteristics, plenty read_lock block write_lock for very
long time.
and I would like to confirm one thing. anon_vma design didn't change
for long year. Is this really performance regression? Do we strike
right regression point?
Counting semaphore?
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
>> Finally, having said all that, the system still struggles reclaiming
>> memory with
>> ~10000 processes trying at the same time, you fix one bottleneck and it
>> moves
>> somewhere else. The latest run showed all but one running process
>> spinning in
>> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
>> are
>> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>>
>> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
>> read_lock() so multiple callers could execute the page_reference_anon code.
>> This seems to help quite a bit.
>
> Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
> performance characteristics, plenty read_lock block write_lock for very
> long time.
>
> and I would like to confirm one thing. anon_vma design didn't change
> for long year. Is this really performance regression? Do we strike
> right regression point?
In 2.6.9 and 2.6.18 the system would hit different contention
points before getting to the anon_vma lock. Now that we've
gotten the other contention points out of the way, this one
has finally been exposed.
--
All rights reversed.
>> + /* Number of processes running page reclaim code on this zone. */
>> + atomic_t concurrent_reclaimers;
>> + wait_queue_head_t reclaim_wait;
>> +
>
> Counting semaphore?
I don't see a safe way to adjust a counting semaphore
through /proc/sys.
--
All rights reversed.
True. Maybe it's worthwhile to add such a way if this pattern has more
uses.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
Anything with 10000 connections using a connection-per-thread/process
model, should use threads if good performance are expected, processes
not. Most things that are using multi-process design will never use
one-connection-per-process design (yes there are exceptions and
no we can't expect to fix those as they're proprietary ;). So I'm not
particularly worried.
Also make sure this also happens on older kernels, newer kernels uses
rmap chains and mangle over ptes even when there's no VM pressure for
no good reason. Older kernels would only hit on the anon_vma chain on
any anon page, only after this anon page was converted to swapcache
and swap was hit, so it makes a whole lot of difference. Anon_vma
chains should only be touched after we are I/O bound if anybody is to
expect decent performance out of the kernel.
> I'm not asserting that this one is purely academic, but I do
> think we need more than an artificial case to worry much about it.
Tend to agree.
> An rwlock there has been proposed on several occasions, but
> we resist because that change benefits this case but performs
> worse on more common cases (I believe: no numbers to back that up).
I think rwlock for anon_vma is a must. Whatever higher overhead of the
fast path with no contention is practically zero, and in large smp it
allows rmap on long chains to run in parallel, so very much worth it
because downside is practically zero and upside may be measurable
instead in certain corner cases. I don't think it'll be enough, but I
definitely like it.
> Substitute a MAP_SHARED file underneath those 10000 vmas,
> and don't you have an equal problem with the prio_tree,
> which would be harder to solve than the anon_vma case?
That is a very good point.
Rik suggested to me to have a cowed newly allocated page to use its
own anon_vma. Conceptually Rik's idea is fine one, but the only
complication then is how to chain the same vma into multiple anon_vma
(in practice insert/removal will be slower and more metadata will be
needed for additional anon_vmas and vams queued in more than
anon_vma). But this only will help if the mapcount of the page is 1,
if the mapcount is 10000 no change to anon_vma or prio_tree will solve
this, and we've to start breaking the rmap loop after 64
test_and_clear_young instead to mitigate the inefficiency on pages
that are used and never will go into swap and so where wasting 10000
cachelines just because this used page eventually is in the tail
position of the lru uis entirely wasted.
>> An rwlock there has been proposed on several occasions, but
>> we resist because that change benefits this case but performs
>> worse on more common cases (I believe: no numbers to back that up).
>
> I think rwlock for anon_vma is a must. Whatever higher overhead of the
> fast path with no contention is practically zero, and in large smp it
> allows rmap on long chains to run in parallel, so very much worth it
> because downside is practically zero and upside may be measurable
> instead in certain corner cases. I don't think it'll be enough, but I
> definitely like it.
I agree, changing the anon_vma lock to an rwlock should
work a lot better than what we have today. The tradeoff
is a tiny slowdown in medium contention cases, at the
benefit of avoiding catastrophic slowdown in some cases.
With Nick Piggin's fair rwlocks, there should be no issue
at all.
> Rik suggested to me to have a cowed newly allocated page to use its
> own anon_vma. Conceptually Rik's idea is fine one, but the only
> complication then is how to chain the same vma into multiple anon_vma
> (in practice insert/removal will be slower and more metadata will be
> needed for additional anon_vmas and vams queued in more than
> anon_vma). But this only will help if the mapcount of the page is 1,
> if the mapcount is 10000 no change to anon_vma or prio_tree will solve
> this,
It's even more complex than this for anonymous pages.
Anonymous pages get COW copied in child (and parent)
processes, potentially resulting in one page, at each
offset into the anon_vma, for every process attached
to the anon_vma.
As a result, with 10000 child processes, page_referenced
can end up searching through 10000 VMAs even for pages
with a mapcount of 1!
--
All rights reversed.