Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] vmscan: limit concurrent reclaimers in shrink_zone

7 views
Skip to first unread message

Rik van Riel

unread,
Dec 11, 2009, 12:00:02 AM12/11/09
to
Under very heavy multi-process workloads, like AIM7, the VM can
get into trouble in a variety of ways. The trouble start when
there are hundreds, or even thousands of processes active in the
page reclaim code.

Not only can the system suffer enormous slowdowns because of
lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free. In Larry's case, this resulted in over
6000 processes fighting over locks in the page reclaim code, even
though the system already had 1.5GB of free memory.

It should be possible to avoid both of those issues at once, by
simply limiting how many processes are active in the page reclaim
code simultaneously.

If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().

On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.

Reported-by: Larry Woodman <lwoo...@redhat.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>

---
This patch is against today's MMOTM tree. It has only been compile tested,
I do not have an AIM7 system standing by.

Larry, does this fix your issue?

Documentation/sysctl/vm.txt | 18 ++++++++++++++++++
include/linux/mmzone.h | 4 ++++
include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++++
mm/page_alloc.c | 3 +++
mm/vmscan.c | 38 ++++++++++++++++++++++++++++++++++++++
6 files changed, 71 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..5cf766f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- max_zone_concurrent_reclaim
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.

=============================================================

+max_zone_concurrent_reclaim:
+
+The number of processes that are allowed to simultaneously reclaim
+memory from a particular memory zone.
+
+With certain workloads, hundreds of processes end up in the page
+reclaim code simultaneously. This can cause large slowdowns due
+to lock contention, freeing of way too much memory and occasionally
+false OOM kills.
+
+To avoid these problems, only allow a smaller number of processes
+to reclaim pages from each memory zone simultaneously.
+
+The default value is 8.
+
+=============================================================
+
memory_failure_early_kill:

Control how to kill processes when uncorrected memory error (typically
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..ed614b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,10 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

+ /* Number of processes running page reclaim code on this zone. */
+ atomic_t concurrent_reclaimers;
+ wait_queue_head_t reclaim_wait;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..661eec7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+extern int max_zone_concurrent_reclaimers;

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6ff0ae6..89b919c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->prev_priority = DEF_PRIORITY;

+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..cf3ef29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
#include <linux/memcontrol.h>
#include <linux/delayacct.h>
#include <linux/sysctl.h>
+#include <linux/wait.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -129,6 +130,17 @@ struct scan_control {
int vm_swappiness = 60;
long vm_total_pages; /* The total number of pages which the VM controls */

+/*
+ * Maximum number of processes concurrently running the page
+ * reclaim code in a memory zone. Having too many processes
+ * just results in them burning CPU time waiting for locks,
+ * so we're better off limiting page reclaim to a sane number
+ * of processes at a time. We do this per zone so local node
+ * reclaim on one NUMA node will not block other nodes from
+ * making progress.
+ */
+int max_zone_concurrent_reclaimers = 8;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

@@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;

+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
+ max_zone_concurrent_reclaimers) {
+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;
@@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Minchan Kim

unread,
Dec 11, 2009, 2:10:01 AM12/11/09
to
Hi, Rik.

I like this. but why do you select default value as constant 8?
Do you have any reason?

I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.

Otherwise looks good to me.

Reviewed-by: Minchan Kim <minch...@gmail.com>

--
Kind regards,
Minchan Kim

Rik van Riel

unread,
Dec 11, 2009, 3:30:02 AM12/11/09
to
On 12/10/2009 09:03 PM, Minchan Kim wrote:

>> +The default value is 8.
>> +
>> +=============================================================
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.

Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.

I rounded it down to the closest power of 2 to come up with
an arbitrary number that looked safe :)

However, this number is per zone - I imagine that really large
systems will have multiple memory zones, so they can run with
more than 8 processes in the pageout code simultaneously.

> Reviewed-by: Minchan Kim<minch...@gmail.com>

Thank you.

--
All rights reversed.

Minchan Kim

unread,
Dec 11, 2009, 3:50:01 AM12/11/09
to
On Fri, Dec 11, 2009 at 12:19 PM, Rik van Riel <ri...@redhat.com> wrote:
> On 12/10/2009 09:03 PM, Minchan Kim wrote:
>
>>> +The default value is 8.
>>> +
>>> +=============================================================
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>
> Pessimistically, I assume that the pageout code spends maybe
> 10% of its time on locking (we have seen far, far worse than
> this with thousands of processes in the pageout code).  That
> means if we have more than 10 threads in the pageout code,
> we could end up spending more time on locking and less doing
> real work - slowing everybody down.

Thanks for giving precious information to me. :

We always have a question magic value with no comment.

Actually I don't want to add another magic value without comment.
so I would like to add the your good explanation in change log
or as comment on code.

>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
>
> However, this number is per zone - I imagine that really large
> systems will have multiple memory zones, so they can run with
> more than 8 processes in the pageout code simultaneously.
>
>> Reviewed-by: Minchan Kim<minch...@gmail.com>
>
> Thank you.

Thanks for quick reply. :)

> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

Larry Woodman

unread,
Dec 11, 2009, 12:00:01 PM12/11/09
to
FYI everyone, I put together a similar patch(doesnt block explicitly,
relies on vm_throttle) for
RHEL5(2.6.18 based kernel) and it fixes hangs due to several CPUs
spinning on zone->lru_lock.
This might eliminate the need to add complexity to the reclaim code.
Its impossible to make it parallel
enough to allow hundreds or even thousands of processes in this code at
the same time.

Will backport this patch & test it as well as testing it on latest kernel.

Larry

Larry Woodman

unread,
Dec 11, 2009, 12:20:02 PM12/11/09
to
Minchan Kim wrote:
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
>
> Reviewed-by: Minchan Kim <minch...@gmail.com>
>
>
This is a per-zone count so perhaps a reasonable default is the number
of CPUs on the
NUMA node that the zone resides on ?

Minchan Kim

unread,
Dec 11, 2009, 1:50:02 PM12/11/09
to
Hi, Larry.

On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman <lwoo...@redhat.com> wrote:
> Minchan Kim wrote:
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>> Reviewed-by: Minchan Kim <minch...@gmail.com>
>>
>>
>
> This is a per-zone count so perhaps a reasonable default is the number of
> CPUs on the
> NUMA node that the zone resides on ?

For example, It assume one CPU per node.
It means your default value is 1.
On the CPU, process A try to reclaim HIGH zone.
Process B want to reclaim NORMAL zone.
But Process B can't enter reclaim path sincev throttle default value is 1
Even kswap can't reclaim.

I think it's really agressive throttle approach although it would
solve your problem.

I have another idea.

We make default value rather big and we provide latency vaule as knob.
So first many processes can enter reclaim path. When shrinking time exceeds
our konb(ex, some HZ), we can decrease default value of number of concurrent
reclaim process. If shrink time is still long alghouth we do it, we
can decrease
default vaule again. When shrink time is fast, we can allow to enter
reclaim path of another processes as increase the number.

It's like old pdflush mechanism. but it's more complex than Rik's one.
If Rik's approach solve this problem well, my approach is rather
overkill, I think.

I am looking forward to Rik's approach work well.

>
>

--
Kind regards,
Minchan Kim

Rik van Riel

unread,
Dec 11, 2009, 1:50:03 PM12/11/09
to
On 12/11/2009 07:07 AM, Larry Woodman wrote:
> Minchan Kim wrote:
>>
>> I like this. but why do you select default value as constant 8?
>> Do you have any reason?
>>
>> I think it would be better to select the number proportional to NR_CPU.
>> ex) NR_CPU * 2 or something.
>>
>> Otherwise looks good to me.
>>
>> Reviewed-by: Minchan Kim <minch...@gmail.com>
>>
> This is a per-zone count so perhaps a reasonable default is the number
> of CPUs on the
> NUMA node that the zone resides on ?

One reason I made it tunable is so people can easily test
what a good value would be :)

--
All rights reversed.

Rik van Riel

unread,
Dec 11, 2009, 2:00:02 PM12/11/09
to
On 12/11/2009 08:41 AM, Minchan Kim wrote:
> Hi, Larry.
>
> On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman<lwoo...@redhat.com> wrote:
>> Minchan Kim wrote:
>>>
>>> I like this. but why do you select default value as constant 8?
>>> Do you have any reason?
>>>
>>> I think it would be better to select the number proportional to NR_CPU.
>>> ex) NR_CPU * 2 or something.
>>>
>>> Otherwise looks good to me.
>>>
>>> Reviewed-by: Minchan Kim<minch...@gmail.com>
>>>
>>>
>>
>> This is a per-zone count so perhaps a reasonable default is the number of
>> CPUs on the
>> NUMA node that the zone resides on ?
>
> For example, It assume one CPU per node.
> It means your default value is 1.
> On the CPU, process A try to reclaim HIGH zone.
> Process B want to reclaim NORMAL zone.
> But Process B can't enter reclaim path sincev throttle default value is 1
> Even kswap can't reclaim.

1) the value is per zone, so process B can go ahead

2) kswapd is always excempt from this limit, since
there is only 1 kswapd per node anyway

--
All rights reversed.

Minchan Kim

unread,
Dec 11, 2009, 2:10:01 PM12/11/09
to
On Fri, Dec 11, 2009 at 10:51 PM, Rik van Riel <ri...@redhat.com> wrote:
> On 12/11/2009 08:41 AM, Minchan Kim wrote:
>>
>> Hi, Larry.
>>
>> On Fri, Dec 11, 2009 at 9:07 PM, Larry Woodman<lwoo...@redhat.com>
>>  wrote:
>>>
>>> Minchan Kim wrote:
>>>>
>>>> I like this. but why do you select default value as constant 8?
>>>> Do you have any reason?
>>>>
>>>> I think it would be better to select the number proportional to NR_CPU.
>>>> ex) NR_CPU * 2 or something.
>>>>
>>>> Otherwise looks good to me.
>>>>
>>>> Reviewed-by: Minchan Kim<minch...@gmail.com>
>>>>
>>>>
>>>
>>> This is a per-zone count so perhaps a reasonable default is the number of
>>> CPUs on the
>>> NUMA node that the zone resides on ?
>>
>> For example, It assume one CPU per node.
>> It means your default value is 1.
>> On the CPU, process A try to reclaim HIGH zone.
>> Process B want to reclaim NORMAL zone.
>> But Process B can't enter reclaim path sincev throttle default value is 1
>> Even kswap can't reclaim.
>
> 1) the value is per zone, so process B can go ahead

Sorry. I misunderstood Larry's point.
I though Larry mentioned global limit not per zone.

> 2) kswapd is always excempt from this limit, since
>   there is only 1 kswapd per node anyway

Larry could test with Rik's patch for what's good default value.
If it proves NR_CPU on node is proper as default value,
We can change default value with it.


> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

Rik van Riel

unread,
Dec 11, 2009, 9:30:01 PM12/11/09
to
On 12/10/2009 09:03 PM, Minchan Kim wrote:
> On Fri, Dec 11, 2009 at 8:56 AM, Rik van Riel<ri...@redhat.com> wrote:
>> Under very heavy multi-process workloads, like AIM7, the VM can
>> get into trouble in a variety of ways. The trouble start when
>> there are hundreds, or even thousands of processes active in the
>> page reclaim code.

> Otherwise looks good to me.
>
> Reviewed-by: Minchan Kim<minch...@gmail.com>

OK, we found three issues with my patch :)

1) there is a typo in sysctl.c

2) there is another typo in Documentation/vm/sysctl.c

3) the code in vmscan.c has a bug, where tasks without
__GFP_IO or __GFP_FS can end up waiting for tasks
with __GFP_IO or __GFP_FS, leading to a deadlock

I will fix these issues and send out a new patch.

--
All rights reversed.

Rik van Riel

unread,
Dec 11, 2009, 9:50:03 PM12/11/09
to
Under very heavy multi-process workloads, like AIM7, the VM can
get into trouble in a variety of ways. The trouble start when
there are hundreds, or even thousands of processes active in the
page reclaim code.

Not only can the system suffer enormous slowdowns because of


lock contention (and conditional reschedules) between thousands
of processes in the page reclaim code, but each process will try
to free up to SWAP_CLUSTER_MAX pages, even when the system already
has lots of memory free.

It should be possible to avoid both of those issues at once, by


simply limiting how many processes are active in the page reclaim
code simultaneously.

If too many processes are active doing page reclaim in one zone,
simply go to sleep in shrink_zone().

On wakeup, check whether enough memory has been freed already
before jumping into the page reclaim code ourselves. We want
to use the same threshold here that is used in the page allocator
for deciding whether or not to call the page reclaim code in the
first place, otherwise some unlucky processes could end up freeing
memory for the rest of the system.

Reported-by: Larry Woodman <lwoo...@redhat.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>

---
v2:
- fix typos in sysctl.c and vm.txt
- move the code in sysctl.c out from under the ifdef
- only __GFP_FS|__GFP_IO tasks can wait

Documentation/sysctl/vm.txt | 18 ++++++++++++++
include/linux/mmzone.h | 4 +++


include/linux/swap.h | 1 +
kernel/sysctl.c | 7 +++++

mm/page_alloc.c | 3 ++
mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
6 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..8bd1a96 100644


--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count

+- max_zone_concurrent_reclaimers


- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -278,6 +279,23 @@ The default value is 65536.

=============================================================

+max_zone_concurrent_reclaimers:

index 6ff0ae6..4ec17ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
#endif


+ {
+ .procname = "max_zone_concurrent_reclaimers",
+ .data = &max_zone_concurrent_reclaimers,
+ .maxlen = sizeof(max_zone_concurrent_reclaimers),
+ .mode = 0644,

+ .proc_handler = &proc_dointvec,
+ },

/*
* NOTE: do not add new entries to this table unless you have read


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11ae66e..ca9cae1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->prev_priority = DEF_PRIORITY;

+ atomic_set(&zone->concurrent_reclaimers, 0);
+ init_waitqueue_head(&zone->reclaim_wait);
+
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c

index 2bbee91..ecfe28c 100644

@@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,


struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int noswap = 0;

+ if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >

+ max_zone_concurrent_reclaimers &&
+ (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
+ (__GFP_IO|__GFP_FS)) {


+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */
+ sleep_on(&zone->reclaim_wait);
+
+ /*
+ * If other processes freed enough memory while we waited,
+ * break out of the loop and go back to the allocator.
+ */
+ if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
+ 0, 0)) {
+ wake_up(&zone->reclaim_wait);
+ sc->nr_reclaimed += nr_to_reclaim;
+ return;
+ }
+ }
+
+ atomic_inc(&zone->concurrent_reclaimers);
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
noswap = 1;

@@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,


shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);
+
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);
}

/*

Minchan Kim

unread,
Dec 14, 2009, 12:20:03 AM12/14/09
to
Hi, Rik.

On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel <ri...@redhat.com> wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways.  The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves.  We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.

I am worried about one.

Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
If OOM happens, OOM will kill many innocent processes since
uninterruptible task
can't handle kill signal until the processes free from reclaim_wait list.

I think reclaim_wait list staying time might be long if VM pressure is heavy.
Is this a exaggeration?

If it is serious problem, how about this?

We add new PF_RECLAIM_BLOCK flag and don't pick the process
in select_bad_process.

--
Kind regards,
Minchan Kim

Rik van Riel

unread,
Dec 14, 2009, 4:20:01 AM12/14/09
to
On 12/13/2009 07:14 PM, Minchan Kim wrote:

> On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel<ri...@redhat.com> wrote:

>> If too many processes are active doing page reclaim in one zone,
>> simply go to sleep in shrink_zone().

> I am worried about one.


>
> Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE state.
> If OOM happens, OOM will kill many innocent processes since
> uninterruptible task
> can't handle kill signal until the processes free from reclaim_wait list.
>
> I think reclaim_wait list staying time might be long if VM pressure is heavy.
> Is this a exaggeration?
>
> If it is serious problem, how about this?
>
> We add new PF_RECLAIM_BLOCK flag and don't pick the process
> in select_bad_process.

A simpler solution may be to use sleep_on_interruptible, and
simply have the process continue into shrink_zone() if it
gets a signal.

--
All rights reversed.

Minchan Kim

unread,
Dec 14, 2009, 4:30:02 AM12/14/09
to
On Mon, Dec 14, 2009 at 1:09 PM, Rik van Riel <ri...@redhat.com> wrote:
> On 12/13/2009 07:14 PM, Minchan Kim wrote:
>
>> On Sat, Dec 12, 2009 at 6:46 AM, Rik van Riel<ri...@redhat.com>  wrote:
>
>>> If too many processes are active doing page reclaim in one zone,
>>> simply go to sleep in shrink_zone().
>
>> I am worried about one.
>>
>> Now, we can put too many processes reclaim_wait with NR_UNINTERRUBTIBLE
>> state.
>> If OOM happens, OOM will kill many innocent processes since
>> uninterruptible task
>> can't handle kill signal until the processes free from reclaim_wait list.
>>
>> I think reclaim_wait list staying time might be long if VM pressure is
>> heavy.
>> Is this a exaggeration?
>>
>> If it is serious problem, how about this?
>>
>> We add new PF_RECLAIM_BLOCK flag and don't pick the process
>> in select_bad_process.
>
> A simpler solution may be to use sleep_on_interruptible, and
> simply have the process continue into shrink_zone() if it
> gets a signal.

I thought it but I was not sure.
Okay. If it is possible, It' more simple.
Could you repost patch with that?


Sorry but I have one requesting.


===

+The default value is 8.
+
+=============================================================

I like this. but why do you select default value as constant 8?
Do you have any reason?

I think it would be better to select the number proportional to NR_CPU.
ex) NR_CPU * 2 or something.

Otherwise looks good to me.


Pessimistically, I assume that the pageout code spends maybe
10% of its time on locking (we have seen far, far worse than
this with thousands of processes in the pageout code). That
means if we have more than 10 threads in the pageout code,
we could end up spending more time on locking and less doing
real work - slowing everybody down.

I rounded it down to the closest power of 2 to come up with


an arbitrary number that looked safe :)

===

We discussed above.
I want to add your desciption into changelog.
That's because after long time, We don't know why we select '8' as
default value.
Your desciption in changelog will explain it to follow-up people. :)

Sorry for bothering you.


> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

Rik van Riel

unread,
Dec 14, 2009, 4:40:01 AM12/14/09
to
On 12/13/2009 11:19 PM, Minchan Kim wrote:
> On Mon, Dec 14, 2009 at 1:09 PM, Rik van Riel<ri...@redhat.com> wrote:

>> A simpler solution may be to use sleep_on_interruptible, and
>> simply have the process continue into shrink_zone() if it
>> gets a signal.
>
> I thought it but I was not sure.
> Okay. If it is possible, It' more simple.
> Could you repost patch with that?

Sure, not a problem.

> +The default value is 8.
> +
> +=============================================================
>
>
> I like this. but why do you select default value as constant 8?
> Do you have any reason?
>
> I think it would be better to select the number proportional to NR_CPU.
> ex) NR_CPU * 2 or something.
>
> Otherwise looks good to me.
>
>
> Pessimistically, I assume that the pageout code spends maybe
> 10% of its time on locking (we have seen far, far worse than
> this with thousands of processes in the pageout code). That
> means if we have more than 10 threads in the pageout code,
> we could end up spending more time on locking and less doing
> real work - slowing everybody down.
>
> I rounded it down to the closest power of 2 to come up with
> an arbitrary number that looked safe :)
> ===
>
> We discussed above.
> I want to add your desciption into changelog.

The thing is, I don't know if 8 is the best value for
the default, which is a reason I made it tunable in
the first place.

There are a lot of assumptions in my reasoning, and
they may be wrong. I suspect that documenting something
wrong is probably worse than letting people wonder out
the default (and maybe finding a better one).

--
All rights reversed.
--

Minchan Kim

unread,
Dec 14, 2009, 5:10:02 AM12/14/09
to

Indeed. But whenever I see magic values in kernel, I have a question
about that.
Actually I used to doubt the value because I guess
"that value was determined by server or desktop experiments.
so probably it don't proper small system."
At least if we put the logical why which might be wrong,
later people can think that value is not proper any more now or his
system(ex, small or super computer and so on) by based on our
description.
so they can improve it.

I think it isn't important your reasoning is right or wrong.
Most important thing is which logical reason determines that value.

I want to not bother you if you mind my suggestion.
Pz, think it was just nitpick. :)


> --
> All rights reversed.
>

--
Kind regards,
Minchan Kim

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:30:01 PM12/14/09
to
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.

This patch seems very similar to my old effort. afaik, there are another
two benefit.

1. Improve resource gurantee
if thousands tasks start to vmscan at the same time, they eat all memory for
PF_MEMALLOC. it might cause another dangerous problem. some filesystem
and io device don't handle allocation failure properly.

2. Improve OOM contidion behavior
Currently, vmscan don't handle SIGKILL at all. then if the system
have hevy memory pressure, OOM killer can't kill the target process
soon. it might cause OOM killer kill next innocent process.
This patch can fix it.

> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers &&
> + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> + (__GFP_IO|__GFP_FS)) {
> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);

Oops. this is bug. sleep_on() is not SMP safe.


I made few fixing patch today. I'll post it soon.


btw, following is mesurement result by hackbench.
================

unit: sec

parameter old new
130 (5200 processes) 5.463 4.442
140 (5600 processes) 479.357 7.792
150 (6000 processes) 729.640 20.529

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:30:02 PM12/14/09
to

sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
use it on new code.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
include/linux/wait.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index a48e16b..bf76627 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -427,11 +427,11 @@ static inline void remove_wait_queue_locked(wait_queue_head_t *q,
* They are racy. DO NOT use them, use the wait_event* interfaces above.
* We plan to remove these interfaces.
*/
-extern void sleep_on(wait_queue_head_t *q);
-extern long sleep_on_timeout(wait_queue_head_t *q,
+extern void __deprecated sleep_on(wait_queue_head_t *q);
+extern long __deprecated sleep_on_timeout(wait_queue_head_t *q,
signed long timeout);
-extern void interruptible_sleep_on(wait_queue_head_t *q);
-extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,
+extern void __deprecated interruptible_sleep_on(wait_queue_head_t *q);
+extern long __deprecated interruptible_sleep_on_timeout(wait_queue_head_t *q,
signed long timeout);

/*
--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:30:02 PM12/14/09
to
concurrent_reclaimers limitation related code made mess to shrink_zone.
To introduce helper function increase readability.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 58 +++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ecfe28c..74c36fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1597,25 +1597,11 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
return nr;
}

-/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
- */
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
{
- unsigned long nr[NR_LRU_LISTS];
- unsigned long nr_to_scan;
- unsigned long percent[2]; /* anon @ 0; file @ 1 */
- enum lru_list l;
- unsigned long nr_reclaimed = sc->nr_reclaimed;
- unsigned long nr_to_reclaim = sc->nr_to_reclaim;
- struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- int noswap = 0;
-
- if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
- max_zone_concurrent_reclaimers &&
- (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
- (__GFP_IO|__GFP_FS)) {
+ if (!current_is_kswapd() &&
+ atomic_read(&zone->concurrent_reclaimers) > max_zone_concurrent_reclaimers &&
+ (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == (__GFP_IO|__GFP_FS)) {
/*


* Do not add to the lock contention if this zone has

* enough processes doing page reclaim already, since

@@ -1630,12 +1616,40 @@ static void shrink_zone(int priority, struct zone *zone,


if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),

0, 0)) {
wake_up(&zone->reclaim_wait);
- sc->nr_reclaimed += nr_to_reclaim;
- return;
+ sc->nr_reclaimed += sc->nr_to_reclaim;
+ return -ERESTARTSYS;
}
}

atomic_inc(&zone->concurrent_reclaimers);
+ return 0;
+}
+
+static void shrink_zone_end(struct zone *zone, struct scan_control *sc)


+{
+ atomic_dec(&zone->concurrent_reclaimers);
+ wake_up(&zone->reclaim_wait);

+}
+
+/*
+ * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ */
+static void shrink_zone(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ unsigned long nr[NR_LRU_LISTS];
+ unsigned long nr_to_scan;
+ unsigned long percent[2]; /* anon @ 0; file @ 1 */
+ enum lru_list l;
+ unsigned long nr_reclaimed = sc->nr_reclaimed;
+ unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ int noswap = 0;
+ int ret;
+
+ ret = shrink_zone_begin(zone, sc);
+ if (ret)
+ return;



/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {

@@ -1692,9 +1706,7 @@ static void shrink_zone(int priority, struct zone *zone,


shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

throttle_vm_writeout(sc->gfp_mask);

-
- atomic_dec(&zone->concurrent_reclaimers);
- wake_up(&zone->reclaim_wait);
+ shrink_zone_end(zone, sc);
}

/*
--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:30:01 PM12/14/09
to
sleep_on() is SMP and/or kernel preemption unsafe. This patch
replace it with safe code.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 40 ++++++++++++++++++++++++++++++----------
1 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74c36fe..3be5345 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1599,15 +1599,32 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,



static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
{

- if (!current_is_kswapd() &&
- atomic_read(&zone->concurrent_reclaimers) > max_zone_concurrent_reclaimers &&
- (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == (__GFP_IO|__GFP_FS)) {
- /*
- * Do not add to the lock contention if this zone has
- * enough processes doing page reclaim already, since
- * we would just make things slower.
- */
- sleep_on(&zone->reclaim_wait);
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wq = &zone->reclaim_wait;
+
+ if (current_is_kswapd())
+ goto out;
+
+ /*
+ * GFP_NOIO and GFP_NOFS mean caller may have some lock implicitly.
+ * Thus, we can't wait here. otherwise it might cause deadlock.
+ */
+ if ((sc->gfp_mask & (__GFP_IO|__GFP_FS)) != (__GFP_IO|__GFP_FS))
+ goto out;
+


+ /*
+ * Do not add to the lock contention if this zone has
+ * enough processes doing page reclaim already, since
+ * we would just make things slower.
+ */

+ for (;;) {
+ prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+
+ if (atomic_read(&zone->concurrent_reclaimers) <=
+ max_zone_concurrent_reclaimers)
+ break;
+
+ schedule();

/*


* If other processes freed enough memory while we waited,

@@ -1615,12 +1632,15 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
*/


if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),

0, 0)) {
- wake_up(&zone->reclaim_wait);
+ wake_up(wq);
+ finish_wait(wq, &wait);


sc->nr_reclaimed += sc->nr_to_reclaim;

return -ERESTARTSYS;
}
}
+ finish_wait(wq, &wait);

+ out:
atomic_inc(&zone->concurrent_reclaimers);
return 0;
}
--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:40:02 PM12/14/09
to
In OOM case, almost processes may be in vmscan. There isn't any reason
the killed process continue allocation. process exiting free lots pages
rather than greedy vmscan.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/page_alloc.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca9cae1..8a9cbaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1878,6 +1878,14 @@ rebalance:
goto got_pg;

/*
+ * If the allocation is for userland page and we have fatal signal,
+ * there isn't any reason to continue allocation. instead, the task
+ * should exit soon.
+ */
+ if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
+ goto nopage;
+
+ /*
* If we failed to make any progress reclaiming, then we are
* running out of options and have to consider going OOM
*/
--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:40:02 PM12/14/09
to

From latency view, There isn't any reason shrink_zones() continue to
reclaim another zone's page if the task reclaimed enough lots pages.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0880668..bf229d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
/*


* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.

*/
-static void shrink_zone(int priority, struct zone *zone,
+static int shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
@@ -1669,7 +1669,7 @@ static void shrink_zone(int priority, struct zone *zone,

ret = shrink_zone_begin(zone, sc);
if (ret)
- return;
+ return ret;



/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {

@@ -1692,6 +1692,7 @@ static void shrink_zone(int priority, struct zone *zone,
&reclaim_stat->nr_saved_scan[l]);
}

+ ret = 0;
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1712,8 +1713,10 @@ static void shrink_zone(int priority, struct zone *zone,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
- if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
+ if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) {
+ ret = -ERESTARTSYS;
break;
+ }
}

sc->nr_reclaimed = nr_reclaimed;
@@ -1727,6 +1730,8 @@ static void shrink_zone(int priority, struct zone *zone,

throttle_vm_writeout(sc->gfp_mask);
shrink_zone_end(zone, sc);
+
+ return ret;
}

/*
@@ -1751,6 +1756,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
struct zone *zone;
+ int ret;

sc->all_unreclaimable = 1;
for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
@@ -1780,7 +1786,9 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
priority);
}

- shrink_zone(priority, zone, sc);
+ ret = shrink_zone(priority, zone, sc);


+ if (ret)
+ return;
}
}

--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:40:02 PM12/14/09
to
When fork bomb invoke OOM Killer, almost task might start to reclaim and
sleep on shrink_zone_begin(). if we use TASK_UNINTERRUPTIBLE, OOM killer
can't kill such task. it mean we never recover from fork bomb.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 20 +++++++++++++-------
1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf229d3..e537361 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1618,7 +1618,10 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)


* we would just make things slower.

*/
for (;;) {
- prepare_to_wait_exclusive(wq, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait_exclusive(wq, &wait, TASK_KILLABLE);
+
+ if (fatal_signal_pending(current))
+ goto stop_reclaim;

if (atomic_read(&zone->concurrent_reclaimers) <=
max_zone_concurrent_reclaimers)
@@ -1631,18 +1634,21 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)


* break out of the loop and go back to the allocator.

*/


if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),

- 0, 0)) {
- wake_up_all(wq);
- finish_wait(wq, &wait);
- sc->nr_reclaimed += sc->nr_to_reclaim;
- return -ERESTARTSYS;
- }
+ 0, 0))
+ goto found_lots_memory;
}
finish_wait(wq, &wait);



out:
atomic_inc(&zone->concurrent_reclaimers);
return 0;

+
+ found_lots_memory:
+ wake_up_all(wq);
+ stop_reclaim:
+ finish_wait(wq, &wait);
+ sc->nr_reclaimed += sc->nr_to_reclaim;
+ return -ERESTARTSYS;


}

static void shrink_zone_end(struct zone *zone, struct scan_control *sc)

--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:40:01 PM12/14/09
to
All task sleeping point in vmscan (e.g. congestion_wait) use
io_schedule. then shrink_zone_begin use it too.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3562a2d..0880668 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
max_zone_concurrent_reclaimers)
break;

- schedule();
+ io_schedule();

/*


* If other processes freed enough memory while we waited,

--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:40:03 PM12/14/09
to
if we don't use exclusive queue, wake_up() function wake _all_ waited
task. This is simply cpu wasting.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0cb834..3562a2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1618,7 +1618,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)


* we would just make things slower.

*/
for (;;) {
- prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait_exclusive(wq, &wait, TASK_UNINTERRUPTIBLE);



if (atomic_read(&zone->concurrent_reclaimers) <=
max_zone_concurrent_reclaimers)

@@ -1632,7 +1632,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
*/


if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),

0, 0)) {
- wake_up(wq);
+ wake_up_all(wq);
finish_wait(wq, &wait);


sc->nr_reclaimed += sc->nr_to_reclaim;

return -ERESTARTSYS;
--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 14, 2009, 12:50:02 PM12/14/09
to
> btw, following is mesurement result by hackbench.
> ================
>
> unit: sec
>
> parameter old new
> 130 (5200 processes) 5.463 4.442
> 140 (5600 processes) 479.357 7.792
> 150 (6000 processes) 729.640 20.529

old mean mmotm1208
new mean mmotm1208 + Rik patch + my patch

Christoph Hellwig

unread,
Dec 14, 2009, 1:10:02 PM12/14/09
to
On Mon, Dec 14, 2009 at 09:24:40PM +0900, KOSAKI Motohiro wrote:
>
>
> sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
> use it on new code.

And the best way to archive this is to remove the function.

In Linus' current tree I find:

- 5 instances of sleep_on(), all in old and obscure block drivers
- 2 instances of sleep_on_timeout(), both in old and obscure drivers

- 28 instances of interruptible_sleep_on_timeout(), mostly in obscure
drivers with a high concentration in the old oss core which should be
killed anyway. And unfortunately a few relatively recent additions
like the SGI xpc driver or usbvision driver
- tons of instances of interruptible_sleep_on all over the drivers code

So I think you're be better off just killing the first two ASAP instead
of just deprecating them.

for the other two deprecating and removing some of the horrible drivers
still using them might be best.

Andi Kleen

unread,
Dec 14, 2009, 1:10:02 PM12/14/09
to
Rik van Riel <ri...@redhat.com> writes:

> +max_zone_concurrent_reclaim:


> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +

> +The default value is 8.

I don't like the hardcoded number. Is the same number good for a 128MB
embedded system as for as 1TB server? Seems doubtful.

This should be perhaps scaled with memory size and number of CPUs?

> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;

__read_mostly

> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>

> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,


> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> + max_zone_concurrent_reclaimers) {


> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since

> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);

wait_event()? sleep_on is a really deprecated racy interface.

This would still badly thunder the herd if not enough memory is freed
, won't it? It would be better to only wake up a single process if memory got freed.

How about for each page freed do a wake up for one thread?


-Andi
--
a...@linux.intel.com -- Speaking for myself only.

Christoph Hellwig

unread,
Dec 14, 2009, 1:20:02 PM12/14/09
to
On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free. In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>

This sounds like a very good argument against using direct reclaim at
all. It reminds a bit of the issue we had in XFS with lots of processes
pushing the AIL and causing massive slowdowns due to lock contention
and cacheline bonucing. Moving all the AIL pushing into a dedicated
thread solved that nicely. In the VM we already have that dedicated
per-node kswapd thread, so moving off as much as possible work to
should be equivalent.

Of course any of this kind of tuning really requires a lot of testing
and benchrmarking to verify those assumptions.

Larry Woodman

unread,
Dec 14, 2009, 2:30:02 PM12/14/09
to
On Mon, 2009-12-14 at 14:08 +0100, Andi Kleen wrote:
> Rik van Riel <ri...@redhat.com> writes:
>
> > +max_zone_concurrent_reclaim:
> > +
> > +The number of processes that are allowed to simultaneously reclaim
> > +memory from a particular memory zone.
> > +
> > +With certain workloads, hundreds of processes end up in the page
> > +reclaim code simultaneously. This can cause large slowdowns due
> > +to lock contention, freeing of way too much memory and occasionally
> > +false OOM kills.
> > +
> > +To avoid these problems, only allow a smaller number of processes
> > +to reclaim pages from each memory zone simultaneously.
> > +
> > +The default value is 8.
>
> I don't like the hardcoded number. Is the same number good for a 128MB
> embedded system as for as 1TB server? Seems doubtful.
>
> This should be perhaps scaled with memory size and number of CPUs?

Remember this a per-zone number.

--

Larry Woodman

unread,
Dec 14, 2009, 2:30:02 PM12/14/09
to
On Mon, 2009-12-14 at 08:14 -0500, Christoph Hellwig wrote:
> On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:
> > Under very heavy multi-process workloads, like AIM7, the VM can
> > get into trouble in a variety of ways. The trouble start when
> > there are hundreds, or even thousands of processes active in the
> > page reclaim code.
> >
> > Not only can the system suffer enormous slowdowns because of
> > lock contention (and conditional reschedules) between thousands
> > of processes in the page reclaim code, but each process will try
> > to free up to SWAP_CLUSTER_MAX pages, even when the system already
> > has lots of memory free. In Larry's case, this resulted in over
> > 6000 processes fighting over locks in the page reclaim code, even
> > though the system already had 1.5GB of free memory.
> >
> > It should be possible to avoid both of those issues at once, by
> > simply limiting how many processes are active in the page reclaim
> > code simultaneously.
> >
>
> This sounds like a very good argument against using direct reclaim at
> all. It reminds a bit of the issue we had in XFS with lots of processes
> pushing the AIL and causing massive slowdowns due to lock contention
> and cacheline bonucing. Moving all the AIL pushing into a dedicated
> thread solved that nicely. In the VM we already have that dedicated
> per-node kswapd thread, so moving off as much as possible work to
> should be equivalent.

Some of the new systems have 16 CPUs per-node.

Rik van Riel

unread,
Dec 14, 2009, 2:40:02 PM12/14/09
to
On 12/14/2009 07:29 AM, KOSAKI Motohiro wrote:
> sleep_on() is SMP and/or kernel preemption unsafe. This patch
> replace it with safe code.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:40:02 PM12/14/09
to
On 12/14/2009 07:23 AM, KOSAKI Motohiro wrote:
> concurrent_reclaimers limitation related code made mess to shrink_zone.
> To introduce helper function increase readability.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:40:02 PM12/14/09
to
On 12/14/2009 07:24 AM, KOSAKI Motohiro wrote:
>
>
> sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
> use it on new code.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:40:02 PM12/14/09
to
On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> if we don't use exclusive queue, wake_up() function wake _all_ waited
> task. This is simply cpu wasting.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

> if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),


> 0, 0)) {
> - wake_up(wq);
> + wake_up_all(wq);
> finish_wait(wq,&wait);
> sc->nr_reclaimed += sc->nr_to_reclaim;
> return -ERESTARTSYS;

I believe we want to wake the processes up one at a time
here. If the queue of waiting processes is very large
and the amount of excess free memory is fairly low, the
first processes that wake up can take the amount of free
memory back down below the threshold. The rest of the
waiters should stay asleep when this happens.

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:40:02 PM12/14/09
to
On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> All task sleeping point in vmscan (e.g. congestion_wait) use
> io_schedule. then shrink_zone_begin use it too.

I'm not sure we really are in io wait when waiting on this
queue, but there's a fair chance we may be so this is a
reasonable change.

> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:50:02 PM12/14/09
to
On 12/14/2009 07:32 AM, KOSAKI Motohiro wrote:
> In OOM case, almost processes may be in vmscan. There isn't any reason
> the killed process continue allocation. process exiting free lots pages
> rather than greedy vmscan.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:50:02 PM12/14/09
to
On 12/14/2009 08:08 AM, Andi Kleen wrote:
> Rik van Riel<ri...@redhat.com> writes:
>
>> +max_zone_concurrent_reclaim:
>> +
>> +The number of processes that are allowed to simultaneously reclaim
>> +memory from a particular memory zone.
>> +
>> +With certain workloads, hundreds of processes end up in the page
>> +reclaim code simultaneously. This can cause large slowdowns due
>> +to lock contention, freeing of way too much memory and occasionally
>> +false OOM kills.
>> +
>> +To avoid these problems, only allow a smaller number of processes
>> +to reclaim pages from each memory zone simultaneously.
>> +
>> +The default value is 8.
>
> I don't like the hardcoded number. Is the same number good for a 128MB
> embedded system as for as 1TB server? Seems doubtful.
>
> This should be perhaps scaled with memory size and number of CPUs?

The limit is per _zone_, so the number of concurrent reclaimers
is automatically scaled by the number of memory zones in the
system.

Scaling up the per-zone value as well looks like it could lead
to the kind of lock contention we are aiming to avoid in the
first place.

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:50:02 PM12/14/09
to
On 12/14/2009 07:32 AM, KOSAKI Motohiro wrote:
> When fork bomb invoke OOM Killer, almost task might start to reclaim and
> sleep on shrink_zone_begin(). if we use TASK_UNINTERRUPTIBLE, OOM killer
> can't kill such task. it mean we never recover from fork bomb.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

As with patch 4/8 I am not convinced that wake_up_all is
the correct thing to do.

Other than that:

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 2:50:02 PM12/14/09
to
On 12/14/2009 07:31 AM, KOSAKI Motohiro wrote:
>
> From latency view, There isn't any reason shrink_zones() continue to
> reclaim another zone's page if the task reclaimed enough lots pages.

IIRC there is one reason - keeping equal pageout pressure
between zones.

However, it may be enough if just kswapd keeps evening out
the pressure, now that we limit the number of concurrent
direct reclaimers in the system.

Since kswapd does not use shrink_zones ...

> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,
Dec 14, 2009, 3:00:02 PM12/14/09
to
On 12/14/2009 08:14 AM, Christoph Hellwig wrote:
> On Thu, Dec 10, 2009 at 06:56:26PM -0500, Rik van Riel wrote:

>> It should be possible to avoid both of those issues at once, by
>> simply limiting how many processes are active in the page reclaim
>> code simultaneously.
>>
>
> This sounds like a very good argument against using direct reclaim at
> all.

Not completely possible. Processes that call try_to_free_pages
without __GFP_FS or __GFP_IO in their gfp_flags may be holding
some kind of lock that kswapd could end up waiting on.

That means those tasks cannot wait on kswapd, because a deadlock
could happen.

Having said that, maybe we can get away with making direct
reclaim a limited subset of what it does today. Kswapd could
be the only process scanning and unmapping ptes, submitting
IOs and scanning the active list. Direct reclaim could limit
itself to reclaiming clean inactive pages and sleeping in
congestion_wait().

--
All rights reversed.

Arjan van de Ven

unread,
Dec 14, 2009, 4:10:02 PM12/14/09
to
On Mon, 14 Dec 2009 08:03:02 -0500
Christoph Hellwig <h...@infradead.org> wrote:

> On Mon, Dec 14, 2009 at 09:24:40PM +0900, KOSAKI Motohiro wrote:
> >
> >
> > sleep_on() function is SMP and/or kernel preemption unsafe. we
> > shouldn't use it on new code.
>
> And the best way to archive this is to remove the function.
>
> In Linus' current tree I find:
>
> - 5 instances of sleep_on(), all in old and obscure block drivers
> - 2 instances of sleep_on_timeout(), both in old and obscure drivers

these should both die; the sleep_on() ones using BROKEN in Kconfig..
.. sleep_on() has not worked in the 2.6 series ever.... ;)


>
> - 28 instances of interruptible_sleep_on_timeout(), mostly in obscure
> drivers with a high concentration in the old oss core which should
> be killed anyway. And unfortunately a few relatively recent additions
> like the SGI xpc driver or usbvision driver

can we also make sure that checkpatch.pl catches any new addition?
(not saying checkpatch.pl is the end-all, but the people who do run it
at least have now have a chance ;-)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Andi Kleen

unread,
Dec 14, 2009, 4:30:02 PM12/14/09
to
On Mon, Dec 14, 2009 at 09:23:19AM -0500, Larry Woodman wrote:
> On Mon, 2009-12-14 at 14:08 +0100, Andi Kleen wrote:
> > Rik van Riel <ri...@redhat.com> writes:
> >
> > > +max_zone_concurrent_reclaim:
> > > +
> > > +The number of processes that are allowed to simultaneously reclaim
> > > +memory from a particular memory zone.
> > > +
> > > +With certain workloads, hundreds of processes end up in the page
> > > +reclaim code simultaneously. This can cause large slowdowns due
> > > +to lock contention, freeing of way too much memory and occasionally
> > > +false OOM kills.
> > > +
> > > +To avoid these problems, only allow a smaller number of processes
> > > +to reclaim pages from each memory zone simultaneously.
> > > +
> > > +The default value is 8.
> >
> > I don't like the hardcoded number. Is the same number good for a 128MB
> > embedded system as for as 1TB server? Seems doubtful.
> >
> > This should be perhaps scaled with memory size and number of CPUs?
>
> Remember this a per-zone number.

A zone could be 64MB or 32GB. And the system could have 1 or 1024 CPUs.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Larry Woodman

unread,
Dec 14, 2009, 5:10:01 PM12/14/09
to
On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:

Rik, the latest patch appears to have a problem although I dont know
what the problem is yet. When the system ran out of memory we see
thousands of runnable processes and 100% system time:


9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
100 0 0 0
9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
100 0 0 0
9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
100 0 0 0

The system would not respond so I dont know whats going on yet. I'll
add debug code to figure out why its in that state as soon as I get
access to the hardware.

Larry


> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>

> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>

> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <lwoo...@redhat.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>
>
> ---
> v2:
> - fix typos in sysctl.c and vm.txt
> - move the code in sysctl.c out from under the ifdef
> - only __GFP_FS|__GFP_IO tasks can wait
>
> Documentation/sysctl/vm.txt | 18 ++++++++++++++
> include/linux/mmzone.h | 4 +++
> include/linux/swap.h | 1 +
> kernel/sysctl.c | 7 +++++
> mm/page_alloc.c | 3 ++
> mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++
> 6 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..8bd1a96 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- max_zone_concurrent_reclaimers
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>
> =============================================================
>
> +max_zone_concurrent_reclaimers:


> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously. This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.

> +
> +=============================================================
> +
> memory_failure_early_kill:
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + /* Number of processes running page reclaim code on this zone. */
> + atomic_t concurrent_reclaimers;
> + wait_queue_head_t reclaim_wait;
> +
> /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..4ec17ed 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1271,6 +1271,13 @@ static struct ctl_table vm_table[] = {
> .extra2 = &one,
> },
> #endif
> + {
> + .procname = "max_zone_concurrent_reclaimers",
> + .data = &max_zone_concurrent_reclaimers,
> + .maxlen = sizeof(max_zone_concurrent_reclaimers),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
>
> /*
> * NOTE: do not add new entries to this table unless you have read
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> zone->prev_priority = DEF_PRIORITY;
>
> + atomic_set(&zone->concurrent_reclaimers, 0);
> + init_waitqueue_head(&zone->reclaim_wait);
> +
> zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..ecfe28c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
> #include <linux/memcontrol.h>
> #include <linux/delayacct.h>
> #include <linux/sysctl.h>
> +#include <linux/wait.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
> int vm_swappiness = 60;
> long vm_total_pages; /* The total number of pages which the VM controls */


>
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone. Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time. We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;

> +
> static LIST_HEAD(shrinker_list);
> static DECLARE_RWSEM(shrinker_rwsem);
>

> @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone,


> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int noswap = 0;
>
> + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >

> + max_zone_concurrent_reclaimers &&
> + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) ==
> + (__GFP_IO|__GFP_FS)) {


> + /*
> + * Do not add to the lock contention if this zone has
> + * enough processes doing page reclaim already, since
> + * we would just make things slower.
> + */
> + sleep_on(&zone->reclaim_wait);

> +
> + /*
> + * If other processes freed enough memory while we waited,
> + * break out of the loop and go back to the allocator.
> + */
> + if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> + 0, 0)) {
> + wake_up(&zone->reclaim_wait);
> + sc->nr_reclaimed += nr_to_reclaim;
> + return;
> + }
> + }
> +
> + atomic_inc(&zone->concurrent_reclaimers);
> +


> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {

> noswap = 1;
> @@ -1655,6 +1692,9 @@ static void shrink_zone(int priority, struct zone *zone,


> shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>
> throttle_vm_writeout(sc->gfp_mask);

> +
> + atomic_dec(&zone->concurrent_reclaimers);
> + wake_up(&zone->reclaim_wait);
> }
>

> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Minchan Kim

unread,
Dec 14, 2009, 11:00:02 PM12/14/09
to
On Mon, 14 Dec 2009 21:23:46 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> concurrent_reclaimers limitation related code made mess to shrink_zone.
> To introduce helper function increase readability.
>

> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minch...@gmail.com>

Through looking over patch series, it's nice clean up.

--
Kind regards,
Minchan Kim

Minchan Kim

unread,
Dec 14, 2009, 11:00:02 PM12/14/09
to
On Mon, 14 Dec 2009 21:24:40 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

>
>
> sleep_on() function is SMP and/or kernel preemption unsafe. we shouldn't
> use it on new code.
>

> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

We would be better to remove this function.
But it's enough to that in this patch series.
We have to remove sleep_on with another patch series.

Reviewed-by: Minchan Kim <minch...@gmail.com>

Minchan Kim

unread,
Dec 14, 2009, 11:00:03 PM12/14/09
to
On Mon, 14 Dec 2009 21:29:28 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> sleep_on() is SMP and/or kernel preemption unsafe. This patch

> replace it with safe code.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Reviwed-by: Minchan Kim <minch...@gmail.com>

Minchan Kim

unread,
Dec 14, 2009, 11:10:02 PM12/14/09
to

I think it's typo. The description in changelog says we want "wake_up".
Otherwise, looks good to me.

Reviewed-by: Minchan Kim <minch...@gmail.com>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,
Dec 15, 2009, 12:00:01 AM12/15/09
to
On Mon, 14 Dec 2009 21:30:54 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> All task sleeping point in vmscan (e.g. congestion_wait) use
> io_schedule. then shrink_zone_begin use it too.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3562a2d..0880668 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
> max_zone_concurrent_reclaimers)
> break;
>
> - schedule();
> + io_schedule();

Hmm. We have many cond_resched which is not io_schedule in vmscan.c.
In addition, if system doesn't have swap device space and out of page cache
due to heavy memory pressue, VM might scan & drop pages until priority is zero
or zone is unreclaimable.

I think it would be not a IO wait.

>
> /*
> * If other processes freed enough memory while we waited,
> --
> 1.6.5.2
>
>
>


--
Kind regards,
Minchan Kim

Minchan Kim

unread,
Dec 15, 2009, 12:00:01 AM12/15/09
to
On Mon, 14 Dec 2009 21:32:18 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> When fork bomb invoke OOM Killer, almost task might start to reclaim and
> sleep on shrink_zone_begin(). if we use TASK_UNINTERRUPTIBLE, OOM killer
> can't kill such task. it mean we never recover from fork bomb.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

I doubt wake_up_all, too. I think it's typo.
Otherwise, Looks good to me.

Reviewed-by: Minchan Kim <minch...@gmail.com>

--

KOSAKI Motohiro

unread,
Dec 15, 2009, 12:00:03 AM12/15/09
to
> On 12/14/2009 07:31 AM, KOSAKI Motohiro wrote:
> >
> > From latency view, There isn't any reason shrink_zones() continue to
> > reclaim another zone's page if the task reclaimed enough lots pages.
>
> IIRC there is one reason - keeping equal pageout pressure
> between zones.
>
> However, it may be enough if just kswapd keeps evening out
> the pressure, now that we limit the number of concurrent
> direct reclaimers in the system.
>
> Since kswapd does not use shrink_zones ...

Sure. That's exactly my point.
plus, balance_pgdat() scan only one node. then zone balancing is
meaingfull. but shrink_zones() scan all zone in all node. we don't
need inter node balancing. it's vmscan's buisiness.


> > Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
>
> Reviewed-by: Rik van Riel <ri...@redhat.com>

Thanks.

Minchan Kim

unread,
Dec 15, 2009, 12:10:03 AM12/15/09
to
On Mon, 14 Dec 2009 21:32:58 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> In OOM case, almost processes may be in vmscan. There isn't any reason
> the killed process continue allocation. process exiting free lots pages
> rather than greedy vmscan.
>

> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---

> mm/page_alloc.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca9cae1..8a9cbaa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1878,6 +1878,14 @@ rebalance:
> goto got_pg;
>
> /*
> + * If the allocation is for userland page and we have fatal signal,
> + * there isn't any reason to continue allocation. instead, the task
> + * should exit soon.
> + */
> + if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
> + goto nopage;

If we jump nopage, we meets dump_stack and show_mem.
Even, we can meet OOM which might kill innocent process.

> +
> + /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> */
> --
> 1.6.5.2
>
>
>


--
Kind regards,
Minchan Kim

Minchan Kim

unread,
Dec 15, 2009, 12:20:02 AM12/15/09
to
On Mon, 14 Dec 2009 21:31:36 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

>
> From latency view, There isn't any reason shrink_zones() continue to
> reclaim another zone's page if the task reclaimed enough lots pages.
>

> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---

> mm/vmscan.c | 16 ++++++++++++----
> 1 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0880668..bf229d3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1654,7 +1654,7 @@ static void shrink_zone_end(struct zone *zone, struct scan_control *sc)
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> -static void shrink_zone(int priority, struct zone *zone,
> +static int shrink_zone(int priority, struct zone *zone,
> struct scan_control *sc)
> {
> unsigned long nr[NR_LRU_LISTS];
> @@ -1669,7 +1669,7 @@ static void shrink_zone(int priority, struct zone *zone,
>
> ret = shrink_zone_begin(zone, sc);
> if (ret)
> - return;
> + return ret;


>
> /* If we have no swap space, do not bother scanning anon pages. */
> if (!sc->may_swap || (nr_swap_pages <= 0)) {

> @@ -1692,6 +1692,7 @@ static void shrink_zone(int priority, struct zone *zone,
> &reclaim_stat->nr_saved_scan[l]);
> }
>
> + ret = 0;
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
> @@ -1712,8 +1713,10 @@ static void shrink_zone(int priority, struct zone *zone,
> * with multiple processes reclaiming pages, the total
> * freeing target can get unreasonably large.
> */
> - if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
> + if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) {
> + ret = -ERESTARTSYS;

Just nitpick.

shrink_zone's return value is matter?
shrink_zones never handle that.

> break;
> + }
> }
>
> sc->nr_reclaimed = nr_reclaimed;
> @@ -1727,6 +1730,8 @@ static void shrink_zone(int priority, struct zone *zone,
>
> throttle_vm_writeout(sc->gfp_mask);
> shrink_zone_end(zone, sc);
> +
> + return ret;
> }
>
> /*
> @@ -1751,6 +1756,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
> struct zoneref *z;
> struct zone *zone;
> + int ret;
>
> sc->all_unreclaimable = 1;
> for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> @@ -1780,7 +1786,9 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> priority);
> }
>
> - shrink_zone(priority, zone, sc);
> + ret = shrink_zone(priority, zone, sc);
> + if (ret)
> + return;
> }
> }


> --
> 1.6.5.2
>
>
>

As a matter of fact, I am worried about this patch.

My opinion is we put aside this patch until we can solve Larry's problem.
We could apply this patch in future.

I don't want to see the side effect while we focus Larry's problem.
But If you mind my suggestion, I also will not bother you by this nitpick.


Thanks for great cleanup and improving VM, Kosaki. :)

KOSAKI Motohiro

unread,
Dec 15, 2009, 12:40:01 AM12/15/09
to

shrink_zones() stop vmscan quickly if ret isn't !0.
if we already scanned rather than nr_to_reclaim, we can stop vmscan.


> As a matter of fact, I am worried about this patch.
>
> My opinion is we put aside this patch until we can solve Larry's problem.
> We could apply this patch in future.
>
> I don't want to see the side effect while we focus Larry's problem.
> But If you mind my suggestion, I also will not bother you by this nitpick.
>
> Thanks for great cleanup and improving VM, Kosaki. :)

I agree with Larry's issue is highest priority.

KOSAKI Motohiro

unread,
Dec 15, 2009, 12:50:01 AM12/15/09
to
> On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:
>
> Rik, the latest patch appears to have a problem although I dont know
> what the problem is yet. When the system ran out of memory we see
> thousands of runnable processes and 100% system time:
>
>
> 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
> 100 0 0 0
> 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
> 100 0 0 0
> 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
> 100 0 0 0
>
> The system would not respond so I dont know whats going on yet. I'll
> add debug code to figure out why its in that state as soon as I get
> access to the hardware.
>
> Larry

There are 9421 running processces. it mean concurrent task limitation
don't works well. hmm?

KOSAKI Motohiro

unread,
Dec 15, 2009, 12:50:02 AM12/15/09
to
> On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> > if we don't use exclusive queue, wake_up() function wake _all_ waited
> > task. This is simply cpu wasting.
> >
> > Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
>
> > if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> > 0, 0)) {
> > - wake_up(wq);
> > + wake_up_all(wq);
> > finish_wait(wq,&wait);
> > sc->nr_reclaimed += sc->nr_to_reclaim;
> > return -ERESTARTSYS;
>
> I believe we want to wake the processes up one at a time
> here. If the queue of waiting processes is very large
> and the amount of excess free memory is fairly low, the
> first processes that wake up can take the amount of free
> memory back down below the threshold. The rest of the
> waiters should stay asleep when this happens.

OK.

Actually, wake_up() and wake_up_all() aren't different so much.
Although we use wake_up(), the task wake up next task before
try to alloate memory. then, it's similar to wake_up_all().

However, there are few difference. recent scheduler latency improvement
effort reduce default scheduler latency target. it mean, if we have
lots tasks of running state, the task have very few time slice. too
frequently context switch decrease VM efficiency.
Thank you, Rik. I didn't notice wake_up() makes better performance than
wake_up_all() on current kernel.


Subject: [PATCH 9/8] replace wake_up_all with wake_up

Fix typo.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5adb7a..b3b4e77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
return 0;

found_lots_memory:
- wake_up_all(wq);
+ wake_up(wq);
stop_reclaim:
finish_wait(wq, &wait);


sc->nr_reclaimed += sc->nr_to_reclaim;

--
1.6.5.2

KOSAKI Motohiro

unread,
Dec 15, 2009, 1:00:01 AM12/15/09
to
> > /*
> > + * If the allocation is for userland page and we have fatal signal,
> > + * there isn't any reason to continue allocation. instead, the task
> > + * should exit soon.
> > + */
> > + if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
> > + goto nopage;
>
> If we jump nopage, we meets dump_stack and show_mem.
> Even, we can meet OOM which might kill innocent process.

Which point you oppose? noprint is better?

KOSAKI Motohiro

unread,
Dec 15, 2009, 1:00:01 AM12/15/09
to
> On Mon, 14 Dec 2009 21:30:54 +0900 (JST)
> KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
>
> > All task sleeping point in vmscan (e.g. congestion_wait) use
> > io_schedule. then shrink_zone_begin use it too.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> > ---
> > mm/vmscan.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3562a2d..0880668 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
> > max_zone_concurrent_reclaimers)
> > break;
> >
> > - schedule();
> > + io_schedule();
>
> Hmm. We have many cond_resched which is not io_schedule in vmscan.c.

cond_resched don't mean sleep on wait queue. it's similar to yield.

> In addition, if system doesn't have swap device space and out of page cache
> due to heavy memory pressue, VM might scan & drop pages until priority is zero
> or zone is unreclaimable.
>
> I think it would be not a IO wait.

two point.
1. For long time, Administrator usually watch iowait% at heavy memory pressure. I
don't hope change this without reasonable reason. 2. iowait makes scheduler
bonus a bit, vmscan task should have many time slice than memory consume
task. it improve VM stabilization.

but I agree the benefit isn't so big. if we have reasonable reason, I
don't oppose use schedule().

Minchan Kim

unread,
Dec 15, 2009, 1:10:02 AM12/15/09
to
On Tue, 15 Dec 2009 09:50:47 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> > > /*
> > > + * If the allocation is for userland page and we have fatal signal,
> > > + * there isn't any reason to continue allocation. instead, the task
> > > + * should exit soon.
> > > + */
> > > + if (fatal_signal_pending(current) && (gfp_mask & __GFP_HIGHMEM))
> > > + goto nopage;
> >
> > If we jump nopage, we meets dump_stack and show_mem.
> > Even, we can meet OOM which might kill innocent process.
>
> Which point you oppose? noprint is better?
>
>

Sorry fot not clarity.
My point was following as.

First,
I don't want to print.
Why do we print stack and mem when the process receives the SIGKILL?

Second,
1) A process try to allocate anon page in do_anonymous_page.
2) A process receives SIGKILL.
3) kernel doesn't allocate page to A process by your patch.
4) do_anonymous_page returns VF_FAULT_OOM.
5) call mm_fault_error
6) call out_of_memory
7) It migth kill innocent task.

If I missed something, Pz, corret me. :)

--
Kind regards,
Minchan Kim

KOSAKI Motohiro

unread,
Dec 15, 2009, 1:20:01 AM12/15/09
to

Doh, you are complely right. I had forgot recent meaning change of VM_FAULT_OOM.
yes, this patch is crap. I need to remake it.

Minchan Kim

unread,
Dec 15, 2009, 1:20:01 AM12/15/09
to
On Tue, Dec 15, 2009 at 9:56 AM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
>> On Mon, 14 Dec 2009 21:30:54 +0900 (JST)
>> KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
>>
>> > All task sleeping point in vmscan (e.g. congestion_wait) use
>> > io_schedule. then shrink_zone_begin use it too.
>> >
>> > Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
>> > ---
>> >  mm/vmscan.c |    2 +-
>> >  1 files changed, 1 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 3562a2d..0880668 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1624,7 +1624,7 @@ static int shrink_zone_begin(struct zone *zone, struct scan_control *sc)
>> >                 max_zone_concurrent_reclaimers)
>> >                     break;
>> >
>> > -           schedule();
>> > +           io_schedule();
>>
>> Hmm. We have many cond_resched which is not io_schedule in vmscan.c.
>
> cond_resched don't mean sleep on wait queue. it's similar to yield.

I confused it.
Thanks for correcting me. :)

>
>> In addition, if system doesn't have swap device space and out of page cache
>> due to heavy memory pressue, VM might scan & drop pages until priority is zero
>> or zone is unreclaimable.
>>
>> I think it would be not a IO wait.
>
> two point.
> 1. For long time, Administrator usually watch iowait% at heavy memory pressure. I
> don't hope change this without reasonable reason. 2. iowait makes scheduler
> bonus a bit, vmscan task should have many time slice than memory consume
> task. it improve VM stabilization.

AFAIK, CFS scheduler doesn't give the bonus to I/O wait task any more.

>
> but I agree the benefit isn't so big. if we have reasonable reason, I
> don't oppose use schedule().
>
>
>
>

--
Kind regards,
Minchan Kim

Mike Galbraith

unread,
Dec 15, 2009, 5:40:02 AM12/15/09
to
On Tue, 2009-12-15 at 09:45 +0900, KOSAKI Motohiro wrote:
> > On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> > > if we don't use exclusive queue, wake_up() function wake _all_ waited
> > > task. This is simply cpu wasting.
> > >
> > > Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> >
> > > if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> > > 0, 0)) {
> > > - wake_up(wq);
> > > + wake_up_all(wq);
> > > finish_wait(wq,&wait);
> > > sc->nr_reclaimed += sc->nr_to_reclaim;
> > > return -ERESTARTSYS;
> >
> > I believe we want to wake the processes up one at a time
> > here. If the queue of waiting processes is very large
> > and the amount of excess free memory is fairly low, the
> > first processes that wake up can take the amount of free
> > memory back down below the threshold. The rest of the
> > waiters should stay asleep when this happens.
>
> OK.
>
> Actually, wake_up() and wake_up_all() aren't different so much.
> Although we use wake_up(), the task wake up next task before
> try to alloate memory. then, it's similar to wake_up_all().

What happens to waiters should running tasks not allocate for a while?

> However, there are few difference. recent scheduler latency improvement
> effort reduce default scheduler latency target. it mean, if we have
> lots tasks of running state, the task have very few time slice. too
> frequently context switch decrease VM efficiency.
> Thank you, Rik. I didn't notice wake_up() makes better performance than
> wake_up_all() on current kernel.

Perhaps this is a spot where an explicit wake_up_all_nopreempt() would
be handy. Excessive wakeup preemption from wake_up_all() has long been
annoying when there are many waiters, but converting it to only have the
first wakeup be preemptive proved harmful to performance. Recent tweaks
will have aggravated the problem somewhat, but it's not new.

-Mike

Mike Galbraith

unread,
Dec 15, 2009, 8:30:02 AM12/15/09
to
> be handy....

Maybe something like below. I can also imagine that under _heavy_ vm
pressure, it'd likely be good for throughput to not provide for sleeper
fairness for these wakeups as well, as that increases vruntime spread,
and thus increases preemption with no benefit in sight.

---
include/linux/sched.h | 1 +
include/linux/wait.h | 3 +++
kernel/sched.c | 21 +++++++++++++++++++++
kernel/sched_fair.c | 2 +-
4 files changed, 26 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1065,6 +1065,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_NOPREEMPT 0x04 /* wakeup is not preemptive */

struct sched_class {
const struct sched_class *next;
Index: linux-2.6/include/linux/wait.h
===================================================================
--- linux-2.6.orig/include/linux/wait.h
+++ linux-2.6/include/linux/wait.h
@@ -140,6 +140,7 @@ static inline void __remove_wait_queue(w
}

void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
+void __wake_up_nopreempt(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr,
void *key);
@@ -154,8 +155,10 @@ int out_of_line_wait_on_bit_lock(void *,
wait_queue_head_t *bit_waitqueue(void *, int);

#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
+#define wake_up_nopreempt(x) __wake_up_nopreempt(x, TASK_NORMAL, 1, NULL)
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
+#define wake_up_all_nopreempt(x) __wake_up_nopreempt(x, TASK_NORMAL, 0, NULL)
#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)

#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5682,6 +5682,27 @@ void __wake_up(wait_queue_head_t *q, uns
}
EXPORT_SYMBOL(__wake_up);

+/**
+ * __wake_up_nopreempt - wake up threads blocked on a waitqueue.
+ * @q: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
+void __wake_up_nopreempt(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __wake_up_common(q, mode, nr_exclusive, WF_NOPREEMPT, key);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(__wake_up_nopreempt);
+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1709,7 +1709,7 @@ static void check_preempt_wakeup(struct
pse->avg_overlap < sysctl_sched_migration_cost)
goto preempt;

- if (!sched_feat(WAKEUP_PREEMPT))
+ if (!sched_feat(WAKEUP_PREEMPT) || (wake_flags & WF_NOPREEMPT))
return;

update_curr(cfs_rq);

Mike Galbraith

unread,
Dec 15, 2009, 2:40:02 PM12/15/09
to

Copy/pasting some methods, and hardcoding futexes, where I know vmark
loads to the point of ridiculous on my little box, it's good for ~17%
throughput boost. Used prudently, something along these lines could
save some thrashing when core code knows it's handling a surge. It
would have a very negative effect at low to modest load though.

Hohum.

---
include/linux/completion.h | 2
include/linux/sched.h | 10 ++-
include/linux/wait.h | 3
kernel/sched.c | 140 ++++++++++++++++++++++++++++++++++++---------
kernel/sched_fair.c | 32 +++++-----
kernel/sched_idletask.c | 2
kernel/sched_rt.c | 6 -
7 files changed, 146 insertions(+), 49 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h

@@ -1065,12 +1065,16 @@ struct sched_domain;


*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */

+#define WF_BATCH 0x04 /* batch wakeup, not preemptive */
+#define WF_REQUEUE 0x00 /* task requeue */
+#define WF_WAKE 0x10 /* task waking */
+#define WF_SLEEP 0x20 /* task going to sleep */



struct sched_class {
const struct sched_class *next;

- void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
- void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
+ void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
+ void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
@@ -2028,6 +2032,8 @@ extern void do_timer(unsigned long ticks

extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
+extern int wake_up_state_batch(struct task_struct *tsk, unsigned int state);
+extern int wake_up_process_batch(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
#ifdef CONFIG_SMP


Index: linux-2.6/include/linux/wait.h
===================================================================
--- linux-2.6.orig/include/linux/wait.h
+++ linux-2.6/include/linux/wait.h
@@ -140,6 +140,7 @@ static inline void __remove_wait_queue(w
}

void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);

+void __wake_up_batch(wait_queue_head_t *q, unsigned int mode, int nr, void *key);


void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr,
void *key);
@@ -154,8 +155,10 @@ int out_of_line_wait_on_bit_lock(void *,
wait_queue_head_t *bit_waitqueue(void *, int);

#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)

+#define wake_up_batch(x) __wake_up_batch(x, TASK_NORMAL, 1, NULL)


#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)

+#define wake_up_all_batch(x) __wake_up_batch(x, TASK_NORMAL, 0, NULL)


#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)

#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c

@@ -1392,7 +1392,7 @@ static const u32 prio_to_wmult[40] = {
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

-static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);
+static void activate_task(struct rq *rq, struct task_struct *p, int flags);

/*
* runqueue iterator, to support SMP load-balancing between different
@@ -1962,24 +1962,24 @@ static int effective_prio(struct task_st
/*
* activate_task - move a task to the runqueue.
*/
-static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
+static void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;

- enqueue_task(rq, p, wakeup);
+ enqueue_task(rq, p, flags);
inc_nr_running(rq);
}

/*
* deactivate_task - remove a task from the runqueue.
*/
-static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
+static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible++;

- dequeue_task(rq, p, sleep);
+ dequeue_task(rq, p, flags);
dec_nr_running(rq);
}

@@ -2415,7 +2415,7 @@ out_activate:
schedstat_inc(p, se.nr_wakeups_local);
else
schedstat_inc(p, se.nr_wakeups_remote);
- activate_task(rq, p, 1);
+ activate_task(rq, p, wake_flags);
success = 1;

/*
@@ -2474,13 +2474,35 @@ out:
*/
int wake_up_process(struct task_struct *p)
{
- return try_to_wake_up(p, TASK_ALL, 0);
+ return try_to_wake_up(p, TASK_ALL, WF_WAKE);
}
EXPORT_SYMBOL(wake_up_process);

+/**
+ * wake_up_process_batch - Wake up a specific process
+ * @p: The process to be woken up.
+ *
+ * Attempt to wake up the nominated process and move it to the set of runnable
+ * processes. Returns 1 if the process was woken up, 0 if it was already
+ * running.


+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */

+int wake_up_process_batch(struct task_struct *p)
+{
+ return try_to_wake_up(p, TASK_ALL, WF_WAKE|WF_BATCH);
+}
+EXPORT_SYMBOL(wake_up_process_batch);
+
int wake_up_state(struct task_struct *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, WF_WAKE);
+}
+
+int wake_up_state_batch(struct task_struct *p, unsigned int state)
+{
+ return try_to_wake_up(p, state, WF_WAKE|WF_BATCH);
}

/*
@@ -2628,7 +2650,7 @@ void wake_up_new_task(struct task_struct
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
update_rq_clock(rq);
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_WAKE|WF_FORK);
trace_sched_wakeup_new(rq, p, 1);
check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
@@ -3156,9 +3178,9 @@ void sched_exec(void)
static void pull_task(struct rq *src_rq, struct task_struct *p,
struct rq *this_rq, int this_cpu)
{
- deactivate_task(src_rq, p, 0);
+ deactivate_task(src_rq, p, WF_REQUEUE);
set_task_cpu(p, this_cpu);
- activate_task(this_rq, p, 0);
+ activate_task(this_rq, p, WF_REQUEUE);
check_preempt_curr(this_rq, p, 0);
}

@@ -5468,7 +5490,7 @@ need_resched_nonpreemptible:
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
- deactivate_task(rq, prev, 1);
+ deactivate_task(rq, prev, WF_SLEEP);
switch_count = &prev->nvcsw;
}

@@ -5634,7 +5656,7 @@ asmlinkage void __sched preempt_schedule
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
void *key)
{
- return try_to_wake_up(curr->private, mode, wake_flags);
+ return try_to_wake_up(curr->private, mode, wake_flags|WF_WAKE);
}
EXPORT_SYMBOL(default_wake_function);

@@ -5677,22 +5699,43 @@ void __wake_up(wait_queue_head_t *q, uns
unsigned long flags;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive, 0, key);
+ __wake_up_common(q, mode, nr_exclusive, WF_WAKE, key);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(__wake_up);

+/**
+ * __wake_up_batch - wake up threads blocked on a waitqueue.


+ * @q: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */

+void __wake_up_batch(wait_queue_head_t *q, unsigned int mode,


+ int nr_exclusive, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);

+ __wake_up_common(q, mode, nr_exclusive, WF_WAKE|WF_BATCH, key);


+ spin_unlock_irqrestore(&q->lock, flags);
+}

+EXPORT_SYMBOL(__wake_up_batch);


+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/

void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
{
- __wake_up_common(q, mode, 1, 0, NULL);
+ __wake_up_common(q, mode, 1, WF_WAKE, NULL);


}

void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)

{
- __wake_up_common(q, mode, 1, 0, key);
+ __wake_up_common(q, mode, 1, WF_WAKE, key);
}

/**
@@ -5716,7 +5759,7 @@ void __wake_up_sync_key(wait_queue_head_
int nr_exclusive, void *key)
{
unsigned long flags;
- int wake_flags = WF_SYNC;
+ int wake_flags = WF_WAKE|WF_SYNC;

if (unlikely(!q))
return;
@@ -5757,12 +5800,35 @@ void complete(struct completion *x)

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_NORMAL, 1, 0, NULL);
+ __wake_up_common(&x->wait, TASK_NORMAL, 1, WF_WAKE, NULL);
spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete);

/**
+ * complete_batch: - signals a single thread waiting on this completion
+ * @x: holds the state of this particular completion
+ *
+ * This will wake up a single thread waiting on this completion. Threads will be
+ * awakened in the same order in which they were queued.
+ *
+ * See also complete_all(), wait_for_completion() and related routines.


+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */

+void complete_batch(struct completion *x)


+{
+ unsigned long flags;
+

+ spin_lock_irqsave(&x->wait.lock, flags);
+ x->done++;
+ __wake_up_common(&x->wait, TASK_NORMAL, 1, WF_WAKE|WF_BATCH, NULL);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+EXPORT_SYMBOL(complete_batch);
+
+/**
* complete_all: - signals all threads waiting on this completion
* @x: holds the state of this particular completion
*
@@ -5777,11 +5843,31 @@ void complete_all(struct completion *x)

spin_lock_irqsave(&x->wait.lock, flags);
x->done += UINT_MAX/2;
- __wake_up_common(&x->wait, TASK_NORMAL, 0, 0, NULL);
+ __wake_up_common(&x->wait, TASK_NORMAL, 0, WF_WAKE, NULL);
spin_unlock_irqrestore(&x->wait.lock, flags);
}
EXPORT_SYMBOL(complete_all);

+/**
+ * complete_all_batch: - signals all threads waiting on this completion
+ * @x: holds the state of this particular completion
+ *
+ * This will wake up all threads waiting on this particular completion event.


+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */

+void complete_all_batch(struct completion *x)


+{
+ unsigned long flags;
+

+ spin_lock_irqsave(&x->wait.lock, flags);
+ x->done += UINT_MAX/2;
+ __wake_up_common(&x->wait, TASK_NORMAL, 0, WF_WAKE|WF_BATCH, NULL);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+EXPORT_SYMBOL(complete_all_batch);
+
static inline long __sched
do_wait_for_common(struct completion *x, long timeout, int state)
{
@@ -6344,7 +6430,7 @@ recheck:
on_rq = p->se.on_rq;
running = task_current(rq, p);
if (on_rq)
- deactivate_task(rq, p, 0);
+ deactivate_task(rq, p, WF_REQUEUE);
if (running)
p->sched_class->put_prev_task(rq, p);

@@ -6356,7 +6442,7 @@ recheck:
if (running)
p->sched_class->set_curr_task(rq);
if (on_rq) {
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);

check_class_changed(rq, p, prev_class, oldprio, running);
}
@@ -7172,11 +7258,11 @@ static int __migrate_task(struct task_st

on_rq = p->se.on_rq;
if (on_rq)
- deactivate_task(rq_src, p, 0);
+ deactivate_task(rq_src, p, WF_REQUEUE);

set_task_cpu(p, dest_cpu);
if (on_rq) {
- activate_task(rq_dest, p, 0);
+ activate_task(rq_dest, p, WF_REQUEUE);
check_preempt_curr(rq_dest, p, 0);
}
done:
@@ -7368,7 +7454,7 @@ void sched_idle_next(void)
__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);

update_rq_clock(rq);
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);

raw_spin_unlock_irqrestore(&rq->lock, flags);
}
@@ -7707,7 +7793,7 @@ migration_call(struct notifier_block *nf
/* Idle task back to normal (off runqueue, low prio) */
raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
- deactivate_task(rq, rq->idle, 0);
+ deactivate_task(rq, rq->idle, WF_REQUEUE);
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
@@ -9698,10 +9784,10 @@ static void normalize_task(struct rq *rq
update_rq_clock(rq);
on_rq = p->se.on_rq;
if (on_rq)
- deactivate_task(rq, p, 0);
+ deactivate_task(rq, p, WF_REQUEUE);
__setscheduler(rq, p, SCHED_NORMAL, 0);
if (on_rq) {
- activate_task(rq, p, 0);
+ activate_task(rq, p, WF_REQUEUE);
resched_task(rq->curr);


}
}
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c

@@ -722,7 +722,7 @@ static void check_spread(struct cfs_rq *
}

static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vruntime = cfs_rq->min_vruntime;

@@ -732,11 +732,11 @@ place_entity(struct cfs_rq *cfs_rq, stru
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
- if (initial && sched_feat(START_DEBIT))
+ if (flags & WF_FORK && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. */
- if (!initial && sched_feat(FAIR_SLEEPERS)) {
+ if (!(flags & (WF_FORK|WF_BATCH)) && sched_feat(FAIR_SLEEPERS)) {
unsigned long thresh = sysctl_sched_latency;

/*
@@ -766,7 +766,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
}

static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the 'current'.
@@ -774,8 +774,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);

- if (wakeup) {
- place_entity(cfs_rq, se, 0);
+ if (flags & WF_WAKE) {
+ place_entity(cfs_rq, se, flags);
enqueue_sleeper(cfs_rq, se);
}

@@ -801,7 +801,7 @@ static void clear_buddies(struct cfs_rq
}

static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the 'current'.
@@ -809,7 +809,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
update_curr(cfs_rq);

update_stats_dequeue(cfs_rq, se);
- if (sleep) {
+ if (flags & WF_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@@ -1034,7 +1034,7 @@ static inline void hrtick_update(struct
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
-static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
@@ -1043,8 +1043,8 @@ static void enqueue_task_fair(struct rq
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, wakeup);
- wakeup = 1;
+ enqueue_entity(cfs_rq, se, flags);
+ flags |= WF_WAKE;
}

hrtick_update(rq);
@@ -1055,18 +1055,18 @@ static void enqueue_task_fair(struct rq
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- dequeue_entity(cfs_rq, se, sleep);
+ dequeue_entity(cfs_rq, se, flags);
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
break;
- sleep = 1;
+ flags |= WF_SLEEP;
}

hrtick_update(rq);


@@ -1709,7 +1709,7 @@ static void check_preempt_wakeup(struct
pse->avg_overlap < sysctl_sched_migration_cost)
goto preempt;

- if (!sched_feat(WAKEUP_PREEMPT))

+ if (!sched_feat(WAKEUP_PREEMPT) || (wake_flags & WF_BATCH))
return;

update_curr(cfs_rq);
@@ -1964,7 +1964,7 @@ static void task_fork_fair(struct task_s

if (curr)
se->vruntime = curr->vruntime;
- place_entity(cfs_rq, se, 1);
+ place_entity(cfs_rq, se, WF_FORK);

if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
/*
Index: linux-2.6/include/linux/completion.h
===================================================================
--- linux-2.6.orig/include/linux/completion.h
+++ linux-2.6/include/linux/completion.h
@@ -88,6 +88,8 @@ extern bool completion_done(struct compl

extern void complete(struct completion *);
extern void complete_all(struct completion *);
+extern void complete_batch(struct completion *);
+extern void complete_all_batch(struct completion *);

/**
* INIT_COMPLETION: - reinitialize a completion structure
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -878,11 +878,11 @@ static void dequeue_rt_entity(struct sch
/*
* Adding/removing a task to/from a priority array:
*/
-static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
+static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;

- if (wakeup)
+ if (flags & WF_WAKE)
rt_se->timeout = 0;

enqueue_rt_entity(rt_se);
@@ -891,7 +891,7 @@ static void enqueue_task_rt(struct rq *r
enqueue_pushable_task(rq, p);
}

-static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
struct sched_rt_entity *rt_se = &p->rt;

Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -32,7 +32,7 @@ static struct task_struct *pick_next_tas
* message if some code attempts to do it:
*/
static void
-dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep)
+dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
{
raw_spin_unlock_irq(&rq->lock);
pr_err("bad: scheduling from the idle thread!\n");

Rik van Riel

unread,
Dec 15, 2009, 3:00:02 PM12/15/09
to
On 12/15/2009 12:32 AM, Mike Galbraith wrote:
> On Tue, 2009-12-15 at 09:45 +0900, KOSAKI Motohiro wrote:
>>> On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
>>>> if we don't use exclusive queue, wake_up() function wake _all_ waited
>>>> task. This is simply cpu wasting.
>>>>
>>>> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
>>>
>>>> if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
>>>> 0, 0)) {
>>>> - wake_up(wq);
>>>> + wake_up_all(wq);
>>>> finish_wait(wq,&wait);
>>>> sc->nr_reclaimed += sc->nr_to_reclaim;
>>>> return -ERESTARTSYS;
>>>
>>> I believe we want to wake the processes up one at a time
>>> here.

>> Actually, wake_up() and wake_up_all() aren't different so much.


>> Although we use wake_up(), the task wake up next task before
>> try to alloate memory. then, it's similar to wake_up_all().

That is a good point. Maybe processes need to wait a little
in this if() condition, before the wake_up(). That would give
the previous process a chance to allocate memory and we can
avoid waking up too many processes.

> What happens to waiters should running tasks not allocate for a while?

When a waiter is woken up, it will either:
1) see that there is enough free memory and wake up the next guy, or
2) run shrink_zone and wake up the next guy

Either way, the processes that just got woken up will ensure that
the sleepers behind them in the queue will get woken up.

--
All rights reversed.

Mike Galbraith

unread,
Dec 15, 2009, 6:20:02 PM12/15/09
to

OK, that more or less covers my worry. From the scheduler standpoint
though, you're better off turning them all loose and letting them race,
_with_ the caveat than thundering herds do indeed make thunder (reason
for patch). Turning them loose piecemeal spreads things out over time,
which prolongs surge operations, possibly much longer than necessary.
We had the same long ago with everyone waiting for kswapd to do all the
work. Sticky problem, this roll-down to inevitable wait.

-Mike

Mike Galbraith

unread,
Dec 15, 2009, 6:50:02 PM12/15/09
to
On Tue, 2009-12-15 at 09:58 -0500, Rik van Riel wrote:
> On 12/15/2009 12:32 AM, Mike Galbraith wrote:
> > On Tue, 2009-12-15 at 09:45 +0900, KOSAKI Motohiro wrote:
> >>> On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> >>>> if we don't use exclusive queue, wake_up() function wake _all_ waited
> >>>> task. This is simply cpu wasting.
> >>>>
> >>>> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> >>>
> >>>> if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> >>>> 0, 0)) {
> >>>> - wake_up(wq);
> >>>> + wake_up_all(wq);
> >>>> finish_wait(wq,&wait);
> >>>> sc->nr_reclaimed += sc->nr_to_reclaim;
> >>>> return -ERESTARTSYS;
> >>>
> >>> I believe we want to wake the processes up one at a time
> >>> here.
>
> >> Actually, wake_up() and wake_up_all() aren't different so much.
> >> Although we use wake_up(), the task wake up next task before
> >> try to alloate memory. then, it's similar to wake_up_all().
>
> That is a good point. Maybe processes need to wait a little
> in this if() condition, before the wake_up(). That would give
> the previous process a chance to allocate memory and we can
> avoid waking up too many processes.

Pondering, I think I'd at least wake NR_CPUS. If there's not enough to
go round, oh darn, but if there is, you have full utilization quicker.

$.02.

-Mike

Rik van Riel

unread,
Dec 15, 2009, 7:40:01 PM12/15/09
to

That depends on what the other CPUs in the system are doing.

If they were doing work, you've just wasted some resources.

--
All rights reversed.

KOSAKI Motohiro

unread,
Dec 16, 2009, 12:50:01 AM12/16/09
to
> On 12/15/2009 12:32 AM, Mike Galbraith wrote:
> > On Tue, 2009-12-15 at 09:45 +0900, KOSAKI Motohiro wrote:
> >>> On 12/14/2009 07:30 AM, KOSAKI Motohiro wrote:
> >>>> if we don't use exclusive queue, wake_up() function wake _all_ waited
> >>>> task. This is simply cpu wasting.
> >>>>
> >>>> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> >>>
> >>>> if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> >>>> 0, 0)) {
> >>>> - wake_up(wq);
> >>>> + wake_up_all(wq);
> >>>> finish_wait(wq,&wait);
> >>>> sc->nr_reclaimed += sc->nr_to_reclaim;
> >>>> return -ERESTARTSYS;
> >>>
> >>> I believe we want to wake the processes up one at a time
> >>> here.
>
> >> Actually, wake_up() and wake_up_all() aren't different so much.
> >> Although we use wake_up(), the task wake up next task before
> >> try to alloate memory. then, it's similar to wake_up_all().
>
> That is a good point. Maybe processes need to wait a little
> in this if() condition, before the wake_up(). That would give
> the previous process a chance to allocate memory and we can
> avoid waking up too many processes.

if we really need wait a bit, Mike's wake_up_batch is best, I think.
It mean
- if another CPU is idle, wake up one process soon. iow, it don't
make meaningless idle.
- if another CPU is busy, woken process don't start to run awhile.
then, zone_watermark_ok() can calculate correct value.

Rik van Riel

unread,
Dec 16, 2009, 2:50:01 AM12/16/09
to
On 12/15/2009 07:48 PM, KOSAKI Motohiro wrote:

> if we really need wait a bit, Mike's wake_up_batch is best, I think.
> It mean
> - if another CPU is idle, wake up one process soon. iow, it don't
> make meaningless idle.
> - if another CPU is busy, woken process don't start to run awhile.
> then, zone_watermark_ok() can calculate correct value.

Agreed, that should work great.

Mike Galbraith

unread,
Dec 16, 2009, 5:50:02 AM12/16/09
to

Along those lines, there's also NEWIDLE balancing considerations. That
idle may result in a task being pulled, which may or may not hurt a bit.

'course, if you're jamming up on memory allocation, that's the least of
your worries, but every idle avoided is potentially a pull avoided.

Just a thought.

-Mike

Larry Woodman

unread,
Dec 17, 2009, 12:30:03 PM12/17/09
to
KOSAKI Motohiro wrote:
> (offlist)
>
> Larry, May I ask current status of following your issue?
> I don't reproduce it. and I don't hope to keep lots patch are up in the air.
>

Yes, sorry for the delay but I dont have direct or exclusive access to
these large systems
and workloads. As far as I can tell this patch series does help prevent
total system
hangs running AIM7. I did have trouble with the early postings mostly
due to using sleep_on()
and wakeup() but those appear to be fixed.

However, I did add more debug code and see ~10000 processes blocked in
shrink_zone_begin().
This is expected but bothersome, practically all of the processes remain
runnable for the entire
duration of these AIM runs. Collectively all these runnable processes
overwhelm the VM system.
There are many more runnable processes now than were ever seen before,
~10000 now versus
~100 on RHEL5(2.6.18 based). So, we have also been experimenting around
with some of the
CFS scheduler tunables to see of this is responsible...
> plus, I integrated page_referenced() improvement patch series and
> limit concurrent reclaimers patch series privately. I plan to post it
> to lkml at this week end. comments are welcome.
>

The only problem I noticed with the page_referenced patch was an
increase in the
try_to_unmap() failures which causes more re-activations. This is very
obvious with
the using tracepoints I have posted over the past few months but they
were never
included. I didnt get a chance to figure out the exact cause due to
access to the hardware
and workload. This patch series also seems to help the overall stalls
in the VM system.
>
> changelog from last post:
> - remake limit concurrent reclaimers series and sort out its patch order
> - change default max concurrent reclaimers from 8 to num_online_cpu().
> it mean, Andi only talked negative feeling comment in last post.
> he dislike constant default value. plus, over num_online_cpu() is
> really silly. iow, it is really low risk.
> (probably we might change default value. as far as I mesure, small
> value makes better benchmark result. but I'm not sure small value
> don't make regression)
> - Improve OOM and SIGKILL behavior.
> (because RHEL5's vmscan has TIF_MEMDIE recovering logic, but
> current mainline doesn't. I don't hope RHEL6 has regression)


>
>
>
>
>> On Fri, 2009-12-11 at 16:46 -0500, Rik van Riel wrote:
>>
>> Rik, the latest patch appears to have a problem although I dont know
>> what the problem is yet. When the system ran out of memory we see
>> thousands of runnable processes and 100% system time:
>>
>>
>> 9420 2 29824 79856 62676 19564 0 0 0 0 8054 379 0
>> 100 0 0 0
>> 9420 2 29824 79368 62292 19564 0 0 0 0 8691 413 0
>> 100 0 0 0
>> 9421 1 29824 79780 61780 19820 0 0 0 0 8928 408 0
>> 100 0 0 0
>>
>> The system would not respond so I dont know whats going on yet. I'll
>> add debug code to figure out why its in that state as soon as I get
>> access to the hardware.
>>

This was in response to Rik's first patch and seems to be fixed by the
latest path set.

Finally, having said all that, the system still struggles reclaiming
memory with
~10000 processes trying at the same time, you fix one bottleneck and it
moves
somewhere else. The latest run showed all but one running process
spinning in
page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
are
~5000 vma's linked to one anon_vma, this seems excessive!!!

I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
read_lock() so multiple callers could execute the page_reference_anon code.
This seems to help quite a bit.


>> Larry

aim.patch

Rik van Riel

unread,
Dec 17, 2009, 2:50:01 PM12/17/09
to
On 12/17/2009 07:23 AM, Larry Woodman wrote:

>>> The system would not respond so I dont know whats going on yet. I'll
>>> add debug code to figure out why its in that state as soon as I get
>>> access to the hardware.
>
> This was in response to Rik's first patch and seems to be fixed by the
> latest path set.
>
> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!

I have some ideas on how to keep processes waiting better
on the per zone reclaim_wait waitqueue.

For one, we should probably only do the lots-free wakeup
if we have more than zone->pages_high free pages in the
zone - having each of the waiters free some memory one
after another should not be a problem as long as we do
not have too much free memory in the zone.

Currently it's a hair trigger, with the threshold for
processes going into the page reclaim path and processes
exiting it "plenty free" being exactly the same.

Some hysteresis there could help.

--
All rights reversed.

Rik van Riel

unread,
Dec 17, 2009, 8:00:02 PM12/17/09
to
After removing some more immediate bottlenecks with
the patches by Kosaki and me, Larry ran into a really
big one:

Larry Woodman wrote:

> Finally, having said all that, the system still struggles reclaiming
> memory with
> ~10000 processes trying at the same time, you fix one bottleneck and it
> moves
> somewhere else. The latest run showed all but one running process
> spinning in
> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
> are
> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>
> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
> read_lock() so multiple callers could execute the page_reference_anon code.
> This seems to help quite a bit.

The system has 10000 processes, all of which are child
processes of the same parent.

Pretty much all memory is anonymous memory.

This means that pretty much every anonymous page in the
system:
1) belongs to just one process, but
2) belongs to an anon_vma which is attached to 10,000 VMAs!

This results in page_referenced scanning 10,000 VMAs for
every page, despite the fact that each page is typically
only mapped into one process.

This seems to be our real scalability issue.

The only way out I can think is to have a new anon_vma
when we start a child process and to have COW place new
pages in the new anon_vma.

However, this is a bit of a paradigm shift in our object
rmap system and I am wondering if somebody else has a
better idea :)

Hugh Dickins

unread,
Dec 17, 2009, 9:20:02 PM12/17/09
to

Please first clarify whether what Larry is running is actually
a workload that people need to behave well in real life.

From time to time such cases have been constructed, but we've
usually found better things to do than solve them, because
they've been no more than academic problems.

I'm not asserting that this one is purely academic, but I do
think we need more than an artificial case to worry much about it.

An rwlock there has been proposed on several occasions, but
we resist because that change benefits this case but performs
worse on more common cases (I believe: no numbers to back that up).

Substitute a MAP_SHARED file underneath those 10000 vmas,
and don't you have an equal problem with the prio_tree,
which would be harder to solve than the anon_vma case?

Hugh

Rik van Riel

unread,
Dec 17, 2009, 11:00:01 PM12/17/09
to

AIM7 is fairly artificial, but real life workloads
like Oracle, PostgreSQL and Apache can also fork off
large numbers of child processes, which also cause
the system to end up with lots of VMAs attached to
the anon_vmas which all the anonymous pages belong
to.

10,000 is fairly extreme, but very large Oracle
workloads can get up to 1,000 or 2,000 today.
This number is bound to grow in the future.

KOSAKI Motohiro

unread,
Dec 18, 2009, 10:30:02 AM12/18/09
to
> KOSAKI Motohiro wrote:
> > (offlist)
> >
> > Larry, May I ask current status of following your issue?
> > I don't reproduce it. and I don't hope to keep lots patch are up in the air.
> >
>
> Yes, sorry for the delay but I dont have direct or exclusive access to
> these large systems
> and workloads. As far as I can tell this patch series does help prevent
> total system
> hangs running AIM7. I did have trouble with the early postings mostly
> due to using sleep_on()
> and wakeup() but those appear to be fixed.
>
> However, I did add more debug code and see ~10000 processes blocked in
> shrink_zone_begin().
> This is expected but bothersome, practically all of the processes remain
> runnable for the entire
> duration of these AIM runs. Collectively all these runnable processes
> overwhelm the VM system.
> There are many more runnable processes now than were ever seen before,
> ~10000 now versus
> ~100 on RHEL5(2.6.18 based). So, we have also been experimenting around
> with some of the
> CFS scheduler tunables to see of this is responsible...

What point you bother? throughput, latency or somethingelse? Actually,
unfairness itself is right thing from VM view. because perfectly fairness
VM easily makes livelock. (e.g. process-A swap out process-B's page, parocess-B
swap out process-A's page). swap token solve above simplest case. but run
many process easily makes similar circulation dependency. recovering from
heavy memory pressure need lots unfairness.

Of cource, if the unfairness makes performance regression, it's bug. it should be
fixed.


> The only problem I noticed with the page_referenced patch was an
> increase in the
> try_to_unmap() failures which causes more re-activations. This is very
> obvious with
> the using tracepoints I have posted over the past few months but they
> were never
> included. I didnt get a chance to figure out the exact cause due to
> access to the hardware
> and workload. This patch series also seems to help the overall stalls
> in the VM system.

I (and many VM developer) don't forget your tracepoint effort. we only
hope to solve the regression at first.

Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
performance characteristics, plenty read_lock block write_lock for very
long time.

and I would like to confirm one thing. anon_vma design didn't change
for long year. Is this really performance regression? Do we strike
right regression point?

Avi Kivity

unread,
Dec 18, 2009, 1:40:01 PM12/18/09
to
On 12/11/2009 11:46 PM, Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways. The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves. We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
>
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + /* Number of processes running page reclaim code on this zone. */
> + atomic_t concurrent_reclaimers;
> + wait_queue_head_t reclaim_wait;
> +
>

Counting semaphore?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Rik van Riel

unread,
Dec 18, 2009, 2:20:02 PM12/18/09
to
On 12/18/2009 05:27 AM, KOSAKI Motohiro wrote:
>> KOSAKI Motohiro wrote:

>> Finally, having said all that, the system still struggles reclaiming
>> memory with
>> ~10000 processes trying at the same time, you fix one bottleneck and it
>> moves
>> somewhere else. The latest run showed all but one running process
>> spinning in
>> page_lock_anon_vma() trying for the anon_vma_lock. I noticed that there
>> are
>> ~5000 vma's linked to one anon_vma, this seems excessive!!!
>>
>> I changed the anon_vma->lock to a rwlock_t and page_lock_anon_vma() to use
>> read_lock() so multiple callers could execute the page_reference_anon code.
>> This seems to help quite a bit.
>
> Ug. no. rw-spinlock is evil. please don't use it. rw-spinlock has bad
> performance characteristics, plenty read_lock block write_lock for very
> long time.
>
> and I would like to confirm one thing. anon_vma design didn't change
> for long year. Is this really performance regression? Do we strike
> right regression point?

In 2.6.9 and 2.6.18 the system would hit different contention
points before getting to the anon_vma lock. Now that we've
gotten the other contention points out of the way, this one
has finally been exposed.

--
All rights reversed.

Rik van Riel

unread,
Dec 18, 2009, 2:20:03 PM12/18/09
to
On 12/18/2009 08:38 AM, Avi Kivity wrote:

>> + /* Number of processes running page reclaim code on this zone. */
>> + atomic_t concurrent_reclaimers;
>> + wait_queue_head_t reclaim_wait;
>> +
>
> Counting semaphore?

I don't see a safe way to adjust a counting semaphore
through /proc/sys.

--
All rights reversed.

Avi Kivity

unread,
Dec 18, 2009, 2:20:03 PM12/18/09
to
On 12/18/2009 04:12 PM, Rik van Riel wrote:
> On 12/18/2009 08:38 AM, Avi Kivity wrote:
>
>>> + /* Number of processes running page reclaim code on this zone. */
>>> + atomic_t concurrent_reclaimers;
>>> + wait_queue_head_t reclaim_wait;
>>> +
>>
>> Counting semaphore?
>
> I don't see a safe way to adjust a counting semaphore
> through /proc/sys.

True. Maybe it's worthwhile to add such a way if this pattern has more
uses.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

Andrea Arcangeli

unread,
Dec 18, 2009, 4:30:02 PM12/18/09
to
On Thu, Dec 17, 2009 at 09:05:23PM +0000, Hugh Dickins wrote:
> Please first clarify whether what Larry is running is actually
> a workload that people need to behave well in real life.

Anything with 10000 connections using a connection-per-thread/process
model, should use threads if good performance are expected, processes
not. Most things that are using multi-process design will never use
one-connection-per-process design (yes there are exceptions and
no we can't expect to fix those as they're proprietary ;). So I'm not
particularly worried.

Also make sure this also happens on older kernels, newer kernels uses
rmap chains and mangle over ptes even when there's no VM pressure for
no good reason. Older kernels would only hit on the anon_vma chain on
any anon page, only after this anon page was converted to swapcache
and swap was hit, so it makes a whole lot of difference. Anon_vma
chains should only be touched after we are I/O bound if anybody is to
expect decent performance out of the kernel.

> I'm not asserting that this one is purely academic, but I do
> think we need more than an artificial case to worry much about it.

Tend to agree.

> An rwlock there has been proposed on several occasions, but
> we resist because that change benefits this case but performs
> worse on more common cases (I believe: no numbers to back that up).

I think rwlock for anon_vma is a must. Whatever higher overhead of the
fast path with no contention is practically zero, and in large smp it
allows rmap on long chains to run in parallel, so very much worth it
because downside is practically zero and upside may be measurable
instead in certain corner cases. I don't think it'll be enough, but I
definitely like it.

> Substitute a MAP_SHARED file underneath those 10000 vmas,
> and don't you have an equal problem with the prio_tree,
> which would be harder to solve than the anon_vma case?

That is a very good point.

Rik suggested to me to have a cowed newly allocated page to use its
own anon_vma. Conceptually Rik's idea is fine one, but the only
complication then is how to chain the same vma into multiple anon_vma
(in practice insert/removal will be slower and more metadata will be
needed for additional anon_vmas and vams queued in more than
anon_vma). But this only will help if the mapcount of the page is 1,
if the mapcount is 10000 no change to anon_vma or prio_tree will solve
this, and we've to start breaking the rmap loop after 64
test_and_clear_young instead to mitigate the inefficiency on pages
that are used and never will go into swap and so where wasting 10000
cachelines just because this used page eventually is in the tail
position of the lru uis entirely wasted.

Rik van Riel

unread,
Dec 18, 2009, 5:50:02 PM12/18/09
to
On 12/18/2009 11:23 AM, Andrea Arcangeli wrote:
> On Thu, Dec 17, 2009 at 09:05:23PM +0000, Hugh Dickins wrote:

>> An rwlock there has been proposed on several occasions, but
>> we resist because that change benefits this case but performs
>> worse on more common cases (I believe: no numbers to back that up).
>
> I think rwlock for anon_vma is a must. Whatever higher overhead of the
> fast path with no contention is practically zero, and in large smp it
> allows rmap on long chains to run in parallel, so very much worth it
> because downside is practically zero and upside may be measurable
> instead in certain corner cases. I don't think it'll be enough, but I
> definitely like it.

I agree, changing the anon_vma lock to an rwlock should
work a lot better than what we have today. The tradeoff
is a tiny slowdown in medium contention cases, at the
benefit of avoiding catastrophic slowdown in some cases.

With Nick Piggin's fair rwlocks, there should be no issue
at all.

> Rik suggested to me to have a cowed newly allocated page to use its
> own anon_vma. Conceptually Rik's idea is fine one, but the only
> complication then is how to chain the same vma into multiple anon_vma
> (in practice insert/removal will be slower and more metadata will be
> needed for additional anon_vmas and vams queued in more than
> anon_vma). But this only will help if the mapcount of the page is 1,
> if the mapcount is 10000 no change to anon_vma or prio_tree will solve
> this,

It's even more complex than this for anonymous pages.

Anonymous pages get COW copied in child (and parent)
processes, potentially resulting in one page, at each
offset into the anon_vma, for every process attached
to the anon_vma.

As a result, with 10000 child processes, page_referenced
can end up searching through 10000 VMAs even for pages
with a mapcount of 1!

--
All rights reversed.

0 new messages