question on sched-rt group allocation cap: sched_rt_runtime

Anirban Sinha

未読、

2009/09/04 21:00:102009/09/04

To:

Hi Ingo and rest:

I have been playing around with the sched_rt_runtime_us cap that can be
used to limit the amount of CPU time allocated towards scheduling rt
group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
use only the root user in our embedded setup). I have no other CPU
intensive workloads (RT or otherwise) running on my system. I have
changed no other scheduling parameters from /proc.

I have written a small test program that:

(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations for
both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular
SCHED_OTHER thread against the fixed number of iterations of the
SCHED_FIFO thread. It then calculates a percentage based on that.

I am running the above workload against varying sched_rt_runtime_us
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
1000 ms. I have also experimented a little bit by decreasing the value
of sched_rt_period_us (thus increasing the sched granularity) with no
apparent change in behavior.

My observations are listed in tabular form:

Ratio of # of completed iterations of reg thread /
sched_rt_runtime_us / # of iterations of RT thread (in %)
sched_rt_runtime_us

0.2 100 % (regular thread completed all its
iterations).
0.3 73 %
0.4 45 %
0.5 17 %
0.6 0 % (SCHED_OTHER thread completely throttled.
Never ran)
0.7 0 %

This result kind of baffles me. Even when we cap the RT group to a
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
available for running regular threads. So my SCHED_OTHER \should\ make
some progress as opposed to being completely throttled. Similarly, with
any fraction less than 0.5, the SCHED_OTHER should complete before
SCHED_FIFO.

I do not have an easy way to verify my results over the latest kernel
(2.6.31). Was there any regressions in the scheduling subsystem in
2.6.26? Can this behavior be explained? Do we need to tweak any other
/proc parameters?

Cheers,

Ani

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Lucas De Marchi

未読、

2009/09/05 13:50:052009/09/05

To:

On Sat, Sep 5, 2009 at 02:55, Anirban Sinha<ASi...@zeugmasystems.com> wrote:
> Hi Ingo and rest:
>
> I have been playing around with the sched_rt_runtime_us cap that can be
> used to limit the amount of CPU time allocated towards scheduling rt
> group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
> use only the root user in our embedded setup). I have no other CPU
> intensive workloads (RT or otherwise) running on my system. I have
> changed no other scheduling parameters from /proc.
>
> I have written a small test program that:

Would you mind sending the source of this test?

Lucas De Marchi

Anirban Sinha

未読、

2009/09/05 13:50:052009/09/05

To:

Hi again:

I am copying my test code here. I am really hoping to get some answers/
pointers. If there are whitespace/formatting issues in this mail,
please let me know. I am using an alternate mailer.

Cheers,

Ani

/* Test code to experiment the CPU allocation cap for an FIFO RT thread
* spinning on a tight loop. Yes, you read it right. RT thread on a
* tight loop.
*/
#define _GNU_SOURCE

#include <sched.h>
#include <pthread.h>
#include <time.h>
#include <utmpx.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>

unsigned long reg_count;

void *fifo_thread(void *arg)
{
int core = (int) arg;
int i, j;
cpu_set_t cpuset;
struct sched_param fifo_schedparam;
int fifo_policy;
unsigned long start, end;
unsigned long fifo_count = 0;

CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);

assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);

/* RT priority 1 - lowest */
fifo_schedparam.sched_priority = 1;
assert(pthread_setschedparam(pthread_self(), SCHED_FIFO,
&fifo_schedparam) == 0);
start = reg_count;
printf("start reg_count=%llu\n", start);

for(i = 0; i < 5; i++) {
for(j = 0; j < UINT_MAX/10; j++) {
fifo_count++;
}
}
printf("\nRT thread has terminated\n");
end = reg_count;
printf("end reg_count=%llu\n", end);
printf("delta reg count = %llu\n", end-start);
printf("fifo count = %llu\n", fifo_count);
printf("% = %f\n", ((float)(end-start)*100)/(float)fifo_count);

return NULL;
}

void *reg_thread(void *arg)
{
int core = (int) arg;
int i, j;
int new_nice;
cpu_set_t cpuset;
struct sched_param fifo_schedparam;
int fifo_policy;
/* let's renice it to highest priority level */
new_nice = nice(-20);
printf("new nice value for regular thread=%d\n", new_nice);
printf("regular thread dispatch(%d)\n", core);

CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);

assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);

for(i = 0; i < 5; i++) {
for(j = 0; j < UINT_MAX/10; j++) {
reg_count++;
}
}
printf("\nregular thread has terminated\n");

return NULL;
}

int main(int argc, char *argv[])
{
char *core_str = NULL;
int core;
pthread_t tid1, tid2;
pthread_attr_t attr;

if(argc != 2) {
fprintf(stderr, "Usage: %s <core-ID>\n", argv[0]);
return -1;
}
reg_count = 0;

core = atoi(argv[1]);

pthread_attr_init(&attr);
assert(pthread_attr_setschedpolicy(&attr, SCHED_FIFO) == 0);
assert(pthread_create(&tid1, &attr, fifo_thread, (void*)core) ==
0);

assert(pthread_attr_setschedpolicy(&attr, SCHED_OTHER) == 0);
assert(pthread_create(&tid2, &attr, reg_thread, (void*)core) == 0);

pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

return 0;
}

-----

From: Anirban Sinha
Sent: Fri 9/4/2009 5:55 PM
To:
Subject: question on sched-rt group allocation cap: sched_rt_runtime_us

Hi Ingo and rest:

I have been playing around with the sched_rt_runtime_us cap that can
be used to limit the amount of CPU time allocated towards scheduling
rt group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled
(we use only the root user in our embedded setup). I have no other CPU
intensive workloads (RT or otherwise) running on my system. I have
changed no other scheduling parameters from /proc.

I have written a small test program that:

(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread

is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations
for both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular
SCHED_OTHER thread against the fixed number of iterations of the
SCHED_FIFO thread. It then calculates a percentage based on that.

I am running the above workload against varying sched_rt_runtime_us
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
1000 ms. I have also experimented a little bit by decreasing the value
of sched_rt_period_us (thus increasing the sched granularity) with no
apparent change in behavior.

My observations are listed in tabular form. The numbers in the two
columns are:

rt_runtime_us /
rt_period_us

Vs

completed iterations of reg thr /
all iterations of RT thr (in %)

0.2 100 % (reg thread completed all its iterations).

0.3 73 %
0.4 45 %
0.5 17 %

0.6 0 % (reg thr completely throttled. Never ran)
0.7 0 %

This result kind of baffles me. Even when we cap the RT group to a
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
available for running regular threads. So my SCHED_OTHER \should\ make
some progress as opposed to being completely throttled. Similarly,
with any fraction less than 0.5, the SCHED_OTHER should complete
before SCHED_FIFO.

I do not have an easy way to verify my results over the latest kernel
(2.6.31). Was there any regressions in the scheduling subsystem in
2.6.26? Can this behavior be explained? Do we need to tweak any other /
proc parameters?

Cheers,

Ani

Fabio Checconi

未読、

2009/09/05 17:40:052009/09/05

To:

> From: Anirban Sinha <ASi...@zeugmasystems.com>
> Date: Fri, Sep 04, 2009 05:55:15PM -0700

You say you pin the threads to a single core: how many cores does your
system have?

I don't know if 2.6.26 had anything wrong (from a quick look the relevant
code seems similar to what we have now), but something like that can be
the consequence of the runtime migration logic moving bandwidth from a
second core to the one executing the two tasks.

If this is the case, this behavior is the expected one, the scheduler
tries to reduce the number of migrations, concentrating the bandwidth
of rt tasks on a single core. With your workload it doesn't work well
because runtime migration has freed the other core(s) from rt bandwidth,
so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
thread is pinned and cannot make use of them.

Lucas De Marchi

未読、

2009/09/05 18:50:072009/09/05

To:

> You say you pin the threads to a single core: how many cores does your
> system have?
>

> If this is the case, this behavior is the expected one, the scheduler
> tries to reduce the number of migrations, concentrating the bandwidth
> of rt tasks on a single core. �With your workload it doesn't work well
> because runtime migration has freed the other core(s) from rt bandwidth,
> so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
> thread is pinned and cannot make use of them.

Indeed. I've tested this same test program in a single core machine and it
produces the expected behavior:

rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
95% 4.48%
60% 54.84%
50% 86.03%
40% OTHER completed first

Lucas De Marchi

Anirban Sinha

未読、

2009/09/05 20:50:072009/09/05

To:

> You say you pin the threads to a single core: how many cores does
your
> system have?

The results I sent you were on a dual core blade.

> If this is the case, this behavior is the expected one, the scheduler
> tries to reduce the number of migrations, concentrating the bandwidth
> of rt tasks on a single core. With your workload it doesn't work
well
> because runtime migration has freed the other core(s) from rt
bandwidth,
> so these cores are available to SCHED_OTHER ones, but your
SCHED_OTHER
> thread is pinned and cannot make use of them.

But, I ran the same routine on a quadcore blade and the results this
time were:

rt_runtime/rt_period % of iterations of reg thrd against rt thrd

0.20 46%
0.25 18%
0.26 7%
0.3 0%
0.4 0%
(rest of the cases) 0%

So if the scheduler is concentrating all rt bandwidth to one core, it
should be effectively 0.2 * 4 = 0.8 for this core. Hence, we should
see the percentage closer to 20% but it seems that it's more than
double. At ~0.25, the regular thread should make no progress, but it
seems it does make a little progress.

Ani

未読、

2009/09/05 23:20:052009/09/05

To:

On Sep 5, 3:50�pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
>
> Indeed. I've tested this same test program in a single core machine and it
> produces the expected behavior:
>
> rt_runtime_us / rt_period_us � � % loops executed in SCHED_OTHER
> 95% � � � � � � � � � � � � � � �4.48%
> 60% � � � � � � � � � � � � � � �54.84%
> 50% � � � � � � � � � � � � � � �86.03%
> 40% � � � � � � � � � � � � � � �OTHER completed first
>

Hmm. This does seem to indicate that there is some kind of
relationship with SMP. So I wonder whether there is a way to turn this
'RT bandwidth accumulation' heuristic off. I did an
echo 0 > /proc/sys/kernel/sched_migration_cost
but results were identical to previous.

I figure that if I set it to zero, the regular sched-fair (non-RT)
tasks will be treated as not being cache hot and hence susceptible to
migration. From the code it looks like sched-rt tasks are always
treated as cache cold? Mind you though that I have not yet looked into
the code very rigorously. I knew the O(1) scheduler relatively well,
but I am just begun digging into the new CFS scheduler code.

On a side note, why is there no documentation explain the
sched_migration_cost tuning knob? It would be nice to have one - at
least where the sysctl variable is defined.

--Ani

Mike Galbraith

未読、

2009/09/06 2:40:042009/09/06

To:

On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> >
> > Indeed. I've tested this same test program in a single core machine and it
> > produces the expected behavior:
> >
> > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > 95% 4.48%
> > 60% 54.84%
> > 50% 86.03%
> > 40% OTHER completed first
> >
>
> Hmm. This does seem to indicate that there is some kind of
> relationship with SMP. So I wonder whether there is a way to turn this
> 'RT bandwidth accumulation' heuristic off.

No there isn't, but maybe there should be, since this isn't the first
time it's come up. One pro argument is that pinned tasks are thoroughly
screwed when an RT hog lands on their runqueue. On the con side, the
whole RT bandwidth restriction thing is intended (AFAIK) to allow an
admin to regain control should RT app go insane, which the default 5%
aggregate accomplishes just fine.

Dunno. Fly or die little patchlet (toss).

sched: allow the user to disable RT bandwidth aggregation.

Signed-off-by: Mike Galbraith <efa...@gmx.de>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <new-submission>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8736ba1..6e6d4c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1881,6 +1881,7 @@ static inline unsigned int get_sysctl_timer_migration(void)
#endif
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
+extern int sysctl_sched_rt_bandwidth_aggregate;

int sched_rt_handler(struct ctl_table *table, int write,
struct file *filp, void __user *buffer, size_t *lenp,
diff --git a/kernel/sched.c b/kernel/sched.c
index c512a02..ca6a378 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+/*
+ * aggregate bandwidth, ie allow borrowing from neighbors when
+ * bandwidth for an individual runqueue is exhausted.
+ */
+int sysctl_sched_rt_bandwidth_aggregate = 1;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 2eb4bd6..75daf88 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
{
int more = 0;

+ if (!sysctl_sched_rt_bandwidth_aggregate)
+ return 0;
+
if (rt_rq->rt_time > rt_rq->rt_runtime) {
spin_unlock(&rt_rq->rt_runtime_lock);
more = do_balance_runtime(rt_rq);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cdbe8d0..0ad08e5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_rt_bandwidth_aggregate",
+ .data = &sysctl_sched_rt_bandwidth_aggregate,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &sched_rt_handler,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_compat_yield",
.data = &sysctl_sched_compat_yield,
.maxlen = sizeof(unsigned int),

Mike Galbraith

未読、

2009/09/06 6:20:072009/09/06

To:

On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
>
> No there isn't, but maybe there should be, since this isn't the first
> time it's come up. One pro argument is that pinned tasks are thoroughly
> screwed when an RT hog lands on their runqueue. On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.
>
> Dunno. Fly or die little patchlet (toss).

btw, a _kinda sorta_ pro is that it can prevent IO lockups like the
below. Seems kjournald can end up depending on kblockd/3, which ain't
going anywhere with that 100% RT hog in the way, so the whole box is
fairly hosed. (much better would be to wake some other kblockd)

top - 12:01:49 up 56 min, 20 users, load average: 8.01, 4.96, 2.39
Tasks: 304 total, 4 running, 300 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.8%us, 0.3%sy, 0.0%ni, 0.0%id, 73.7%wa, 0.3%hi, 0.0%si, 0.0%st

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
13897 root -2 0 7920 592 484 R 100 0.0 1:13.43 3 xx
12716 root 20 0 8868 1328 860 R 1 0.0 0:01.44 0 top
14 root 15 -5 0 0 0 R 0 0.0 0:00.02 3 events/3
94 root 15 -5 0 0 0 R 0 0.0 0:00.00 3 kblockd/3
1212 root 15 -5 0 0 0 D 0 0.0 0:00.04 2 kjournald
14393 root 20 0 9848 2296 756 D 0 0.1 0:00.01 0 make
14404 root 20 0 38012 25m 5552 D 0 0.8 0:00.21 1 cc1
14405 root 20 0 20220 8852 2388 D 0 0.3 0:00.02 1 as
14437 root 20 0 24132 10m 2680 D 0 0.3 0:00.06 2 cc1
14448 root 20 0 18324 1724 1240 D 0 0.1 0:00.00 2 cc1
14452 root 20 0 12540 792 656 D 0 0.0 0:00.00 2 mv

Fabio Checconi

未読、

2009/09/06 9:20:072009/09/06

To:

> From: Anirban Sinha <a...@anirban.org>
> Date: Sat, Sep 05, 2009 05:47:39PM -0700

So this can be a bug. While it is possible that the kernel does
not succeed in migrating all the runtime (e.g., due to a (system) rt
task consuming some bandwidth on a remote cpu), 46% instead of 20%
is too much.

Running your program I'm unable to reproduce the same issue on a recent
kernel here; for 25ms over 100ms across several runs I get less than 2%.
This number increases, reaching your values, only when using short
periods (where the meaning for short depends on your HZ value), which
is something to be expected, due to the fact that rt throttling uses
the tick to charge runtimes to tasks.

Looking at the git history, there have been several bugfixes to the rt
bandwidth code from 2.6.26, one of them seems to be strictly related to
runtime accounting with your setup:

commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
Author: Dario Faggioli <rais...@linux.it>
Date: Fri Oct 3 17:40:46 2008 +0200

sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq

Mike Galbraith

未読、

2009/09/06 11:20:072009/09/06

To:

On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
>
>
>
> > Seems kjournald can end up depending on kblockd/3, which ain't
> > going anywhere with that 100% RT hog in the way,
>

> I think in the past AKPM's response to this has been "just don't do
> it", i.e, don't hog the CPU with an RT thread.

Oh yeah, sure. Best to run RT oinkers on isolated cpus. It just
surprised me that the 100% compute RT cpu became involved in IO.

-Mike

Anirban Sinha

未読、

2009/09/06 20:30:062009/09/06

To:

> Running your program I'm unable to reproduce the same issue on a
recent
> kernel here; for 25ms over 100ms across several runs I get less
than 2%.
> This number increases, reaching your values, only when using short
> periods (where the meaning for short depends on your HZ value),

In our kernel, the jiffies are configured as 100 HZ.

> which
> is something to be expected, due to the fact that rt throttling uses
> the tick to charge runtimes to tasks.

Hmm. I see. I understand that.

> Looking at the git history, there have been several bugfixes to the
rt
> bandwidth code from 2.6.26, one of them seems to be strictly
related to
> runtime accounting with your setup:

I will apply these patches on Tuesday and rerun the tests.

Anirban Sinha

未読、

2009/09/06 20:30:072009/09/06

To:

> Dunno. Fly or die little patchlet (toss).

> sched: allow the user to disable RT bandwidth aggregation.

Hmm. Interesting. With this change, my results are as follows:

rt_runtime/rt_period % of reg iterations

0.2 100%
0.25 100%
0.3 100%
0.4 100%
0.5 82%
0.6 66%
0.7 54%
0.8 46%
0.9 38.5%
0.95 32%

This results are on a quad core blade. Does it still makes sense though?
Can anyone else run the same tests on a quadcore over the latest kernel?
I will patch our 2.6.26 kernel with upstream fixes and rerun these
tests on tuesday.

Ani

Anirban Sinha

未読、

2009/09/06 20:50:052009/09/06

To:

On 2009-09-06, at 8:09 AM, Mike Galbraith wrote:

> On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
>>
>>
>>
>>> Seems kjournald can end up depending on kblockd/3, which ain't
>>> going anywhere with that 100% RT hog in the way,
>>
>> I think in the past AKPM's response to this has been "just don't do
>> it", i.e, don't hog the CPU with an RT thread.
>
> Oh yeah, sure. Best to run RT oinkers on isolated cpus.

Correct. Unfortunately at some places, the application coders do
stupid things and then the onus falls on the kernel guys to make
things 'just work'.

I would not have any problems if such a cap mechanism did not exist at
all. However, since we do have such a tuning knob. I would say that
let's make it do what it is supposed to do. In the documentation it
says "0.05s to be used by SCHED_OTHER". Unfortunately, it never hints
that if your thread is tied to the RT core, you are screwed. The
bandwidth accumulation logic would virtually kill all the remaining
SCHED_OTHER threads, much before that 95% cap is reached. Somewhere it
doesn't quite seem right. At the very very least, can we have this
clearly written in sched-rt-group.txt?

Cheers,

Ani

Mike Galbraith

未読、

2009/09/07 3:00:142009/09/07

To:

On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
>
>
> > Dunno. Fly or die little patchlet (toss).
>
> > sched: allow the user to disable RT bandwidth aggregation.
>

> Hmm. Interesting. With this change, my results are as follows:
>
> rt_runtime/rt_period % of reg iterations
>
> 0.2 100%
> 0.25 100%
> 0.3 100%
> 0.4 100%
> 0.5 82%
> 0.6 66%
> 0.7 54%
> 0.8 46%
> 0.9 38.5%
> 0.95 32%
>
>
> This results are on a quad core blade. Does it still makes sense
> though?
> Can anyone else run the same tests on a quadcore over the latest
> kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> these tests on tuesday.

I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
measurement proglet on an isolated Q6600 core.

10s measurement interval results:

sched_rt_runtime_us RT utilization
950000 94.99%
750000 75.00%
500000 50.04%
250000 25.02%
50000 5.03%

Seems to work fine here.

-Mike

Peter Zijlstra

未読、

2009/09/07 4:00:142009/09/07

To:

On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:

> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
>

> No there isn't..

Actually there is, use cpusets to carve the system into partitions.

Mike Galbraith

未読、

2009/09/07 4:30:122009/09/07

To:

On Mon, 2009-09-07 at 09:59 +0200, Peter Zijlstra wrote:
> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > > 95% 4.48%
> > > > 60% 54.84%
> > > > 50% 86.03%
> > > > 40% OTHER completed first
> > > >
> > >
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to turn this
> > > 'RT bandwidth accumulation' heuristic off.
> >
> > No there isn't..
>
> Actually there is, use cpusets to carve the system into partitions.

Yeah, I stand corrected. I tend to think in terms of the dirt simplest
configuration only.

-Mike

Peter Zijlstra

未読、

2009/09/07 7:10:052009/09/07

To:

On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:

> [ 774.651779] SysRq : Show Blocked State
> [ 774.655770] task PC stack pid father
> [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> [ 774.676008] Call Trace:
> [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b

FWIW, something like the below (prone to explode since its utterly
untested) should (mostly) fix that one case. Something similar needs to
be done for pretty much all machine wide workqueue thingies, possibly
also flush_workqueue().

---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 52 +++++++++++++++++++++++++++++++++++---------
mm/swap.c | 14 ++++++++---
3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6273fa9..95b1df2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ extern int schedule_work_on(int cpu, struct work_struct *work);
extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay);
extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
unsigned long delay);
+extern int schedule_on_mask(const struct cpumask *mask, work_func_t func);
extern int schedule_on_each_cpu(work_func_t func);
extern int current_is_keventd(void);
extern int keventd_up(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..81456fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -657,6 +657,23 @@ int schedule_delayed_work_on(int cpu,
}
EXPORT_SYMBOL(schedule_delayed_work_on);

+struct sched_work_struct {
+ struct work_struct work;
+ work_func_t func;
+ atomic_t *count;
+ struct completion *completion;
+};
+
+static void do_sched_work(struct work_struct *work)
+{
+ struct sched_work_struct *sws = work;
+
+ sws->func(NULL);
+
+ if (atomic_dec_and_test(sws->count))
+ complete(sws->completion);
+}
+
/**
* schedule_on_each_cpu - call a function on each online CPU from keventd
* @func: the function to call
@@ -666,29 +683,42 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
*
* schedule_on_each_cpu() is very slow.
*/
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_mask(const struct cpumask *mask, work_func_t func)
{
+ struct completion completion = COMPLETION_INITIALIZER_ONSTACK(completion);
+ atomic_t count = ATOMIC_INIT(cpumask_weight(mask));
+ struct sched_work_struct *works;
int cpu;
- struct work_struct *works;

- works = alloc_percpu(struct work_struct);
+ works = alloc_percpu(struct sched_work_struct);
if (!works)
return -ENOMEM;

- get_online_cpus();
- for_each_online_cpu(cpu) {
- struct work_struct *work = per_cpu_ptr(works, cpu);
+ for_each_cpu(cpu, mask) {
+ struct sched_work_struct *work = per_cpu_ptr(works, cpu);
+ work->count = &count;
+ work->completion = &completion;
+ work->func = func;

- INIT_WORK(work, func);
- schedule_work_on(cpu, work);
+ INIT_WORK(&work->work, do_sched_work);
+ schedule_work_on(cpu, &work->work);
}
- for_each_online_cpu(cpu)
- flush_work(per_cpu_ptr(works, cpu));
- put_online_cpus();
+ wait_for_completion(&completion);
free_percpu(works);
return 0;
}

+int schedule_on_each_cpu(work_func_t func)
+{
+ int ret;
+
+ get_online_cpus();
+ ret = schedule_on_mask(cpu_online_mask, func);
+ put_online_cpus();
+
+ return ret;
+}
+
void flush_scheduled_work(void)
{
flush_workqueue(keventd_wq);
diff --git a/mm/swap.c b/mm/swap.c
index cb29ae5..11e4b1e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
/* How many pages do we try to swap or page in/out together? */
int page_cluster;

+static cpumask_t lru_drain_mask;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);

@@ -216,12 +217,15 @@ EXPORT_SYMBOL(mark_page_accessed);

void __lru_cache_add(struct page *page, enum lru_list lru)
{
- struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
+ int cpu = get_cpu();
+ struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu)[lru];
+
+ cpumask_set_cpu(cpu, lru_drain_mask);

page_cache_get(page);
if (!pagevec_add(pvec, page))
____pagevec_lru_add(pvec, lru);
- put_cpu_var(lru_add_pvecs);
+ put_cpu();
}

/**
@@ -294,7 +298,9 @@ static void drain_cpu_pagevecs(int cpu)

void lru_add_drain(void)
{
- drain_cpu_pagevecs(get_cpu());
+ int cpu = get_cpu();
+ cpumask_clear_cpu(cpu, lru_drain_mask);
+ drain_cpu_pagevecs(cpu);
put_cpu();
}

@@ -308,7 +314,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
*/
int lru_add_drain_all(void)
{
- return schedule_on_each_cpu(lru_add_drain_per_cpu);
+ return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
}

/*

Oleg Nesterov

未読、

2009/09/07 9:50:062009/09/07

To:

Failed to google the previous discussion. Could you please point me?
What is the problem?

> +struct sched_work_struct {
> + struct work_struct work;
> + work_func_t func;
> + atomic_t *count;
> + struct completion *completion;
> +};

(not that it matters, but perhaps sched_work_struct should have a single
pointer to the struct which contains func,count,comletion).

> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_mask(const struct cpumask *mask, work_func_t func)

Looks like a usefule helper. But,

> + for_each_cpu(cpu, mask) {
> + struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> + work->count = &count;
> + work->completion = &completion;
> + work->func = func;
>
> - INIT_WORK(work, func);
> - schedule_work_on(cpu, work);
> + INIT_WORK(&work->work, do_sched_work);
> + schedule_work_on(cpu, &work->work);

This means the caller must ensure CPU online and can't go away. Otherwise
we can hang forever.

schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
But,

> int lru_add_drain_all(void)
> {
> - return schedule_on_each_cpu(lru_add_drain_per_cpu);
> + return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> }

This doesn't look safe.

Looks like, schedule_on_mask() should take get_online_cpus(), do
cpus_and(mask, mask, online_cpus), then schedule works.

If we don't care the work can migrate to another CPU, schedule_on_mask()
can do put_online_cpus() before wait_for_completion().

Oleg.

Peter Zijlstra

未読、

2009/09/07 10:00:102009/09/07

To:

Ah, the general problem is that when we carve up the machine into
partitions using cpusets, we still get machine wide tickles on all cpus
from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
even if some cpus don't actually used their workqueue.

So the below limits lru_add_drain() activity to cpus that actually have
pages in their per-cpu lists.

flush_workqueue() could limit itself to cpus that had work queued since
the last flush_workqueue() invocation, etc.

This avoids un-needed disruption of these cpus.

Christoph wants this because he's running cpu-bound userspace and simply
doesn't care to donate a few cycles to the kernel maintenance when not
needed (every tiny bit helps in completing the HPC job sooner).

Mike ran into this because he's starving a partitioned cpu using an RT
task -- which currently starves the other cpus because the workqueues
don't get to run and everybody waits...

The lru_add_drain_all() thing is just one of the many cases, and the
below won't fully solve Mike's problem since the cpu could still have
pending work on the per-cpu list from starting the RT task.. but its
showing the direction on how improve things.

> > +struct sched_work_struct {
> > + struct work_struct work;
> > + work_func_t func;
> > + atomic_t *count;
> > + struct completion *completion;
> > +};
>
> (not that it matters, but perhaps sched_work_struct should have a single
> pointer to the struct which contains func,count,comletion).

Sure, it more-or-less grew while writing, I always forget completions
don't count.

> > -int schedule_on_each_cpu(work_func_t func)
> > +int schedule_on_mask(const struct cpumask *mask, work_func_t func)
>
> Looks like a usefule helper. But,
>
> > + for_each_cpu(cpu, mask) {
> > + struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> > + work->count = &count;
> > + work->completion = &completion;
> > + work->func = func;
> >
> > - INIT_WORK(work, func);
> > - schedule_work_on(cpu, work);
> > + INIT_WORK(&work->work, do_sched_work);
> > + schedule_work_on(cpu, &work->work);
>
> This means the caller must ensure CPU online and can't go away. Otherwise
> we can hang forever.
>
> schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
> But,
>
> > int lru_add_drain_all(void)
> > {
> > - return schedule_on_each_cpu(lru_add_drain_per_cpu);
> > + return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> > }
>
> This doesn't look safe.
>
> Looks like, schedule_on_mask() should take get_online_cpus(), do
> cpus_and(mask, mask, online_cpus), then schedule works.
>
> If we don't care the work can migrate to another CPU, schedule_on_mask()
> can do put_online_cpus() before wait_for_completion().

Ah, right. Like said, I only quickly hacked this up as an example on how
to improve isolation between cpus and limit unneeded work in the hope
someone would pick this up and maybe tackle other sites as well.

Peter Zijlstra

未読、

2009/09/07 10:30:092009/09/07

To:

On Mon, 2009-09-07 at 16:18 +0200, Oleg Nesterov wrote:

> > flush_workqueue() could limit itself to cpus that had work queued since
> > the last flush_workqueue() invocation, etc.
>

> But "work queued since the last flush_workqueue() invocation" just means
> "has work queued". Please note that flush_cpu_workqueue() does nothing
> if there are no works, except it does lock/unlock of cwq->lock.
>
> IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
> CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.
>
> Afaics, we can add the workqueue_struct->cpu_map_has_works to help
> flush_workqueue(), but this means we should complicate insert_work()
> and run_workqueue() which should set/clear the bit. But given that
> flush_workqueue() should be avoided anyway, I am not sure.

Ah, indeed. Then nothing new would be needed here, since it will indeed
not interrupt processing on the remote cpus that never queued any work.

Oleg Nesterov

未読、

2009/09/07 10:30:122009/09/07

To:

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> >
> > Failed to google the previous discussion. Could you please point me?
> > What is the problem?
>
> Ah, the general problem is that when we carve up the machine into
> partitions using cpusets, we still get machine wide tickles on all cpus
> from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
> even if some cpus don't actually used their workqueue.
>
> So the below limits lru_add_drain() activity to cpus that actually have
> pages in their per-cpu lists.

Thanks Peter!

> flush_workqueue() could limit itself to cpus that had work queued since
> the last flush_workqueue() invocation, etc.

But "work queued since the last flush_workqueue() invocation" just means

"has work queued". Please note that flush_cpu_workqueue() does nothing
if there are no works, except it does lock/unlock of cwq->lock.

IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.

Afaics, we can add the workqueue_struct->cpu_map_has_works to help
flush_workqueue(), but this means we should complicate insert_work()
and run_workqueue() which should set/clear the bit. But given that
flush_workqueue() should be avoided anyway, I am not sure.

Oleg.

KOSAKI Motohiro

未読、

2009/09/07 20:00:142009/09/07

To:

Hi Peter,

> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
>
> > [ 774.651779] SysRq : Show Blocked State
> > [ 774.655770] task PC stack pid father
> > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [ 774.676008] Call Trace:
> > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
>
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Can you please explain reproduce way and problem detail?

AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
it doesn't cause any deadlock.

Anirban Sinha

未読、

2009/09/08 3:10:062009/09/08

To:

On 2009-09-07, at 9:42 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Peter Zijlstra [mailto:a.p.zi...@chello.nl]
> Sent: Mon 9/7/2009 12:59 AM
> To: Mike Galbraith
> Cc: Anirban Sinha; Lucas De Marchi; linux-...@vger.kernel.org;
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>

> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com>
> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core
> machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > > 95% 4.48%
> > > > 60% 54.84%
> > > > 50% 86.03%
> > > > 40% OTHER completed first
> > > >
> > >
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to
> turn this
> > > 'RT bandwidth accumulation' heuristic off.
> >
> > No there isn't..
>
> Actually there is, use cpusets to carve the system into partitions.

hmm. ok. I looked at the code a little bit. It seems to me that the
'borrowing' of RT runtimes occurs only from rt runqueues belonging to
the same root domain. And partition_sched_domains() is the only
external interface that can be used to create root domain out of a CPU
set. But then I think it needs to have CGROUPS/USER groups enabled?
Right?

--Ani

Anirban Sinha

未読、

2009/09/08 3:20:052009/09/08

To:

On 2009-09-07, at 9:44 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:efa...@gmx.de]
> Sent: Sun 9/6/2009 11:54 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; linux-...@vger.kernel.org; Peter Zijlstra;
> Ingo Molnar
> Subject: RE: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>

> On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
> >
> >
> > > Dunno. Fly or die little patchlet (toss).
> >
> > > sched: allow the user to disable RT bandwidth aggregation.
> >
> > Hmm. Interesting. With this change, my results are as follows:
> >
> > rt_runtime/rt_period % of reg iterations
> >
> > 0.2 100%
> > 0.25 100%
> > 0.3 100%
> > 0.4 100%
> > 0.5 82%
> > 0.6 66%
> > 0.7 54%
> > 0.8 46%
> > 0.9 38.5%
> > 0.95 32%
> >
> >
> > This results are on a quad core blade. Does it still makes sense
> > though?
> > Can anyone else run the same tests on a quadcore over the latest
> > kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> > these tests on tuesday.
>
> I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> measurement proglet on an isolated Q6600 core.

Thanks Mike. Is this on a single core machine (or one core carved out
of N)? We may have some newer patches missing from the 2.6.26 kernel
that fixes some accounting bugs. I will do a review and rerun the test
after applying the upstream patches.

Ani

Peter Zijlstra

未読、

2009/09/08 4:30:162009/09/08

To:

Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
cpu0 does mlock()->lru_add_drain_all(), which does
schedule_on_each_cpu(), which then waits for all cpus to complete the
work. Except that cpu1, which is busy with the RT task, will never run
keventd until the RT load goes away.

This is not so much an actual deadlock as a serious starvation case.

Peter Zijlstra

未読、

2009/09/08 4:50:122009/09/08

To:

On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:

> > Actually there is, use cpusets to carve the system into partitions.
>
> hmm. ok. I looked at the code a little bit. It seems to me that the
> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to
> the same root domain. And partition_sched_domains() is the only
> external interface that can be used to create root domain out of a CPU
> set. But then I think it needs to have CGROUPS/USER groups enabled?
> Right?

No you need cpusets, you create a partition by disabling load-balancing
on the top set, thereby only allowing load-balancing withing the
children.

The runtime sharing is a form of load-balancing.

CONFIG_CPUSETS=y

Documentation/cgroups/cpusets.txt

Mike Galbraith

未読、

2009/09/08 5:30:162009/09/08

To:

On Tue, 2009-09-08 at 00:10 -0700, Anirban Sinha wrote:

> > I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> > measurement proglet on an isolated Q6600 core.
>
>
> Thanks Mike. Is this on a single core machine (or one core carved out
> of N)? We may have some newer patches missing from the 2.6.26 kernel
> that fixes some accounting bugs. I will do a review and rerun the test
> after applying the upstream patches.

Q6600 is a quad, test was 1 carved out of 4 (thought i said that).

-Mike

KOSAKI Motohiro

未読、

2009/09/08 6:10:062009/09/08

To:

This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
Why other workqueue flusher doesn't affect this issue?

Peter Zijlstra

未読、

2009/09/08 6:30:092009/09/08

To:

flush_work() will only flush workqueues on which work has been enqueued
as Oleg pointed out.

The problem is with lru_add_drain_all() enqueueing work on all
workqueues.

There is nothing that makes lru_add_drain_all() the only such site, its
the one Mike posted to me, and my patch was a way to deal with that.

I also explained that its not only RT related in that the HPC folks also
want to avoid unneeded work -- for them its not starvation but a
performance issue.

In generic we should avoid doing work when there is no work to be done.

KOSAKI Motohiro

未読、

2009/09/08 7:50:132009/09/08

To:

Thank you for kindly explanation. I gradually become to understand this isssue.
Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code

for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));

However, I don't think your approach solve this issue.
lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.

lru_add_pvecs is accounted when
- lru move
e.g. read(2), write(2), page fault, vmscan, page migration, et al

lru_rotate_pves is accounted when
- page writeback

IOW, if RT-thread call write(2) syscall or page fault, we face the same
problem. I don't think we can assume RT-thread don't make page fault....

hmm, this seems difficult problem. I guess any mm code should use
schedule_on_each_cpu(). I continue to think this issue awhile.

> There is nothing that makes lru_add_drain_all() the only such site, its
> the one Mike posted to me, and my patch was a way to deal with that.

Well, schedule_on_each_cpu() is very limited used function.
Practically we can ignore other caller.

> I also explained that its not only RT related in that the HPC folks also
> want to avoid unneeded work -- for them its not starvation but a
> performance issue.

I think you talked about OS jitter issue. if so, I don't think this issue
make serious problem. OS jitter mainly be caused by periodic action
(e.g. tick update, timer, vmstat update). it's because
little-delay x plenty-times = large-delay

lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
page-migration, memory-hotplug. all caller is not periodic.

> In generic we should avoid doing work when there is no work to be done.

Probably. but I'm not sure ;)

Peter Zijlstra

未読、

2009/09/08 8:10:042009/09/08

To:

On Tue, 2009-09-08 at 20:41 +0900, KOSAKI Motohiro wrote:

> Thank you for kindly explanation. I gradually become to understand this isssue.
> Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
>
> for_each_online_cpu(cpu)
> flush_work(per_cpu_ptr(works, cpu));
>
> However, I don't think your approach solve this issue.
> lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
>
> lru_add_pvecs is accounted when
> - lru move
> e.g. read(2), write(2), page fault, vmscan, page migration, et al
>
> lru_rotate_pves is accounted when
> - page writeback
>
> IOW, if RT-thread call write(2) syscall or page fault, we face the same
> problem. I don't think we can assume RT-thread don't make page fault....
>
> hmm, this seems difficult problem. I guess any mm code should use
> schedule_on_each_cpu(). I continue to think this issue awhile.

This is about avoiding work when there is non, clearly when an
application does use the kernel it creates work.

But a clearly userspace, cpu-bound process, while(1), should not get
interrupted by things like lru_add_drain() when it doesn't have any
pages to drain.

> > There is nothing that makes lru_add_drain_all() the only such site, its
> > the one Mike posted to me, and my patch was a way to deal with that.
>
> Well, schedule_on_each_cpu() is very limited used function.
> Practically we can ignore other caller.

No, we need to inspect all callers, having only a few makes that easier.

> > I also explained that its not only RT related in that the HPC folks also
> > want to avoid unneeded work -- for them its not starvation but a
> > performance issue.
>
> I think you talked about OS jitter issue. if so, I don't think this issue
> make serious problem. OS jitter mainly be caused by periodic action
> (e.g. tick update, timer, vmstat update). it's because
> little-delay x plenty-times = large-delay
>
> lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> page-migration, memory-hotplug. all caller is not periodic.

Doesn't matter, if you want to reduce it, you need to address all of
them, a process 4 nodes away calling mlock() while this partition has
been user-bound for the last hour or so and doesn't have any lru pages
simply needn't be woken.

Christoph Lameter

未読、

2009/09/08 10:10:062009/09/08

To:

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.

Hmmm. The lru draining in page migration is to reduce the number of pages
that are not on the lru to increase the chance of page migration to be
successful. A page on a per cpu list cannot be drained.

Reducing the number of cpus where we perform the drain results in
increased likelyhood that we cannot migrate a page because its on the per
cpu lists of a cpu not covered.

On the other hand if the cpu is offline then we know that it has no per
cpu pages. That is why I found the idea of the OFFLINE
scheduler attractive.

Peter Zijlstra

未読、

2009/09/08 10:30:112009/09/08

To:

On Tue, 2009-09-08 at 10:03 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
>
> > This is about avoiding work when there is non, clearly when an
> > application does use the kernel it creates work.
>
> Hmmm. The lru draining in page migration is to reduce the number of pages
> that are not on the lru to increase the chance of page migration to be
> successful. A page on a per cpu list cannot be drained.
>
> Reducing the number of cpus where we perform the drain results in
> increased likelyhood that we cannot migrate a page because its on the per
> cpu lists of a cpu not covered.

Did you even read the patch?

There is _no_ functional difference between before and after, except
less wakeups on cpus that don't have any __lru_cache_add activity.

If there's pages on the per cpu lru_add_pvecs list it will be present in
the mask and will be send a drain request. If its not, then it won't be
send.

Anirban Sinha

未読、

2009/09/08 10:50:122009/09/08

To:

On 2009-09-08, at 1:42 AM, Peter Zijlstra wrote:

> On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:
>
>>> Actually there is, use cpusets to carve the system into partitions.
>>
>> hmm. ok. I looked at the code a little bit. It seems to me that the
>> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to
>> the same root domain. And partition_sched_domains() is the only
>> external interface that can be used to create root domain out of a
>> CPU
>> set. But then I think it needs to have CGROUPS/USER groups enabled?
>> Right?
>
> No you need cpusets, you create a partition by disabling load-
> balancing
> on the top set, thereby only allowing load-balancing withing the
> children.
>

Ah I see. Thanks for the clarification.

> The runtime sharing is a form of load-balancing.

sure.

>
> CONFIG_CPUSETS=y

Hmm. Ok. I guess what I meant but did not articulate properly (because
I was thinking in terms of code) was CPUSETS needed CGROUPS support:

config CPUSETS
bool "Cpuset support"
depends on CGROUPS

Anyway, that's fine. I'll dig around the code a little bit more.

>
> Documentation/cgroups/cpusets.txt

Thanks for the pointer. My bad, I did not care to see the docs. I tend
to ignore docs and read code instead. :D

Christoph Lameter

未読、

2009/09/08 11:30:102009/09/08

To:

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> There is _no_ functional difference between before and after, except
> less wakeups on cpus that don't have any __lru_cache_add activity.
>
> If there's pages on the per cpu lru_add_pvecs list it will be present in
> the mask and will be send a drain request. If its not, then it won't be
> send.

Ok I see.

A global cpu mask like this will cause cacheline bouncing. After all this
is a hot cpu path. Maybe do not set the bit if its already set
(which may be very frequent)? Then add some benchmarks to show that it
does not cause a regression on a 16p box (Nehalem) or so?

Peter Zijlstra

未読、

2009/09/08 11:30:162009/09/08

To:

On Tue, 2009-09-08 at 11:22 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
>
> > There is _no_ functional difference between before and after, except
> > less wakeups on cpus that don't have any __lru_cache_add activity.
> >
> > If there's pages on the per cpu lru_add_pvecs list it will be present in
> > the mask and will be send a drain request. If its not, then it won't be
> > send.
>
> Ok I see.
>
> A global cpu mask like this will cause cacheline bouncing. After all this
> is a hot cpu path. Maybe do not set the bit if its already set
> (which may be very frequent)? Then add some benchmarks to show that it
> does not cause a regression on a 16p box (Nehalem) or so?

Yeah, testing the bit before poking at is sounds like a good plan.

Unless someone feels inclined to finish this and audit the kernel for
more such places, I'll stick it on the ever growing todo pile.

Christoph Lameter

未読、

2009/09/08 11:40:122009/09/08

To:

The usefulness of a scheme like this requires:

1. There are cpus that continually execute user space code
without system interaction.

2. There are repeated VM activities that require page isolation /
migration.

The first page isolation activity will then clear the lru caches of the
processes doing number crunching in user space (and therefore the first
isolation will still interrupt). The second and following isolation will
then no longer interrupt the processes.

2. is rare. So the question is if the additional code in the LRU handling
can be justified. If lru handling is not time sensitive then yes.

Anirban Sinha

未読、

2009/09/08 13:40:062009/09/08

To:

> Looking at the git history, there have been several bugfixes to the rt
> bandwidth code from 2.6.26, one of them seems to be strictly related
> to
> runtime accounting with your setup:
>
> commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
> Author: Dario Faggioli <rais...@linux.it>
> Date: Fri Oct 3 17:40:46 2008 +0200
>
> sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq

Hmm. Indeed there did seem to have quite a few fixes to the accounting
logic. I back-patched our 2.6.26 kernel with the upstream patches that
seemed relevant and my test code now yields reasonable results.
Applying the above patch did not fix it though which kind of makes
sense since from the commit log it seems that the patch fixed cases
when the RT task was getting *less* CPU than it's bandwidth allocation
as opposed to more as in my case. I haven't bisected the patchet to
figure out exactly which one fixed it but I intend to do it later just
for fun.

For completeness, these are the results after applying the upstream
patches *and* disabling bandwidth borrowing logic on my 2.6.26 kernel
running on a quad core blade with CONFIG_GROUP_SCHED turned off (100HZ
jiffies):

rt_runtime/
rt_period % of SCHED_OTHER iterations

.40 100%
.50 74%
.60 47%
.70 31%
.80 18%
.90 8%
.95 4%

--Ani

Anirban Sinha

未読、

2009/09/08 13:50:082009/09/08

To:

On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:

>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:efa...@gmx.de]
> Sent: Sat 9/5/2009 11:32 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; linux-...@vger.kernel.org; Peter Zijlstra;
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>

> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <lucas.de.mar...@gmail.com>
> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core
> machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn
> this
> > 'RT bandwidth accumulation' heuristic off.
>

> No there isn't, but maybe there should be, since this isn't the first
> time it's come up. One pro argument is that pinned tasks are
> thoroughly
> screwed when an RT hog lands on their runqueue. On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.

>
> Dunno. Fly or die little patchlet (toss).

So it would be nice to have a knob like this when CGROUPS is disabled
(it say 'say N when unsure' :)). CPUSETS depends on CGROUPS.

>
> sched: allow the user to disable RT bandwidth aggregation.
>

> Signed-off-by: Mike Galbraith <efa...@gmx.de>
> Cc: Ingo Molnar <mi...@elte.hu>
> Cc: Peter Zijlstra <a.p.zi...@chello.nl>

Verified-by: Anirban Sinha <asi...@zeugmasystems.com>

> LKML-Reference: <new-submission>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8736ba1..6e6d4c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1881,6 +1881,7 @@ static inline unsigned int
> get_sysctl_timer_migration(void)
> #endif
> extern unsigned int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
> +extern int sysctl_sched_rt_bandwidth_aggregate;
>
> int sched_rt_handler(struct ctl_table *table, int write,
> struct file *filp, void __user *buffer, size_t *lenp,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c512a02..ca6a378 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
> */
> int sysctl_sched_rt_runtime = 950000;
>
> +/*
> + * aggregate bandwidth, ie allow borrowing from neighbors when
> + * bandwidth for an individual runqueue is exhausted.
> + */
> +int sysctl_sched_rt_bandwidth_aggregate = 1;
> +
> static inline u64 global_rt_period(void)
> {
> return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 2eb4bd6..75daf88 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
> {
> int more = 0;
>
> + if (!sysctl_sched_rt_bandwidth_aggregate)
> + return 0;
> +
> if (rt_rq->rt_time > rt_rq->rt_runtime) {
> spin_unlock(&rt_rq->rt_runtime_lock);
> more = do_balance_runtime(rt_rq);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cdbe8d0..0ad08e5 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
> },
> {
> .ctl_name = CTL_UNNUMBERED,
> + .procname = "sched_rt_bandwidth_aggregate",
> + .data =
> &sysctl_sched_rt_bandwidth_aggregate,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &sched_rt_handler,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> .procname = "sched_compat_yield",
> .data = &sysctl_sched_compat_yield,
> .maxlen = sizeof(unsigned int),

Mike Galbraith

未読、

2009/09/08 15:10:042009/09/08

To:

On Tue, 2009-09-08 at 10:41 -0700, Anirban Sinha wrote:
> On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:
>

> > Dunno. Fly or die little patchlet (toss).
>
> So it would be nice to have a knob like this when CGROUPS is disabled
> (it say 'say N when unsure' :)). CPUSETS depends on CGROUPS.

Maybe. Short term hack. My current thoughts on the subject, after some
testing, is that the patchlet should just die, and pondering the larger
solution should happen.

-Mike

Anirban Sinha

未読、

2009/09/08 15:40:052009/09/08

To:

>Maybe. Short term hack. My current thoughts on the subject, after
some
>testing, is that the patchlet should just die, and pondering the larger
>solution should happen.

Just curious, what is the larger solution? When everyone adapts using
control-groups?

Anirban Sinha

未読、

2009/09/08 17:40:092009/09/08

To:

KOSAKI Motohiro

未読、

2009/09/08 22:10:052009/09/08

To:

Hi

> > Thank you for kindly explanation. I gradually become to understand this isssue.
> > Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> >
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(works, cpu));
> >
> > However, I don't think your approach solve this issue.
> > lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> >
> > lru_add_pvecs is accounted when
> > - lru move
> > e.g. read(2), write(2), page fault, vmscan, page migration, et al
> >
> > lru_rotate_pves is accounted when
> > - page writeback
> >
> > IOW, if RT-thread call write(2) syscall or page fault, we face the same
> > problem. I don't think we can assume RT-thread don't make page fault....
> >
> > hmm, this seems difficult problem. I guess any mm code should use
> > schedule_on_each_cpu(). I continue to think this issue awhile.
>
> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.
>
> But a clearly userspace, cpu-bound process, while(1), should not get
> interrupted by things like lru_add_drain() when it doesn't have any
> pages to drain.

Yup. makes sense.
So, I think you mean you'd like to tackle this special case as fist step, right?
if yes, I agree.

> > > There is nothing that makes lru_add_drain_all() the only such site, its
> > > the one Mike posted to me, and my patch was a way to deal with that.
> >
> > Well, schedule_on_each_cpu() is very limited used function.
> > Practically we can ignore other caller.
>
> No, we need to inspect all callers, having only a few makes that easier.

Sorry my poor english. I meaned I don't oppose your patch approach. I don't oppose
additional work at all.

>
> > > I also explained that its not only RT related in that the HPC folks also
> > > want to avoid unneeded work -- for them its not starvation but a
> > > performance issue.
> >
> > I think you talked about OS jitter issue. if so, I don't think this issue
> > make serious problem. OS jitter mainly be caused by periodic action
> > (e.g. tick update, timer, vmstat update). it's because
> > little-delay x plenty-times = large-delay
> >
> > lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> > page-migration, memory-hotplug. all caller is not periodic.
>
> Doesn't matter, if you want to reduce it, you need to address all of
> them, a process 4 nodes away calling mlock() while this partition has
> been user-bound for the last hour or so and doesn't have any lru pages
> simply needn't be woken.

Doesn't matter? You mean can we stop to discuss hits HPC performance issue
as Christoph pointed out?
hmmm, sorry, I haven't catch your point.

Mike Galbraith

未読、

2009/09/09 0:20:142009/09/09

To:

On Tue, 2009-09-08 at 12:34 -0700, Anirban Sinha wrote:
> >Maybe. Short term hack. My current thoughts on the subject, after
> some
> >testing, is that the patchlet should just die, and pondering the larger
> >solution should happen.
>
> Just curious, what is the larger solution?

That's what needs pondering :)

-Mike

KOSAKI Motohiro

未読、

2009/09/09 0:30:062009/09/09

To:

> The usefulness of a scheme like this requires:
>
> 1. There are cpus that continually execute user space code
> without system interaction.
>
> 2. There are repeated VM activities that require page isolation /
> migration.
>
> The first page isolation activity will then clear the lru caches of the
> processes doing number crunching in user space (and therefore the first
> isolation will still interrupt). The second and following isolation will
> then no longer interrupt the processes.
>
> 2. is rare. So the question is if the additional code in the LRU handling
> can be justified. If lru handling is not time sensitive then yes.

Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
I think page migration don't need lru_add_drain_all() as synchronous, because
page migration have 10 times retry.

Then asynchronous lru_add_drain_all() cause

- if system isn't under heavy pressure, retry succussfull.
- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.

I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

What do you think?

Christoph Lameter

未読、

2009/09/09 10:10:052009/09/09

To:

On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:

> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.

True this is only an optimization that increases the chance of isolation
being successful. You dont need draining at all.

> Then asynchronous lru_add_drain_all() cause
>
> - if system isn't under heavy pressure, retry succussfull.
> - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> What do you think?

The retries can be very fast if the migrate pages list is small. The
migrate attempts may be finished before the IPI can be processed by the
other cpus.

Minchan Kim

未読、

2009/09/09 11:40:052009/09/09

To:

On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
>> The usefulness of a scheme like this requires:
>>
>> 1. There are cpus that continually execute user space code
>> � �without system interaction.
>>
>> 2. There are repeated VM activities that require page isolation /
>> � �migration.
>>
>> The first page isolation activity will then clear the lru caches of the
>> processes doing number crunching in user space (and therefore the first
>> isolation will still interrupt). The second and following isolation will
>> then no longer interrupt the processes.
>>
>> 2. is rare. So the question is if the additional code in the LRU handling
>> can be justified. If lru handling is not time sensitive then yes.
>
> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.
>
> Then asynchronous lru_add_drain_all() cause
>
> �- if system isn't under heavy pressure, retry succussfull.
> �- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

I think, more exactly, we don't have to drain lru pages for mlocking.
Mlocked pages will go into unevictable lru due to
try_to_unmap when shrink of lru happens.
How about removing draining in case of mlock?

>
> What do you think?
>
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. �For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
>

--
Kind regards,
Minchan Kim

Lee Schermerhorn

未読、

2009/09/09 12:20:082009/09/09

To:

Remember how the code works: __mlock_vma_pages_range() loops calliing
get_user_pages() to fault in batches of 16 pages and returns the page
pointers for mlocking. Mlocking now requires isolation from the lru.
If you don't drain after each call to get_user_pages(), up to a
pagevec's worth of pages [~14] will likely still be in the pagevec and
won't be isolatable/mlockable(). We can end up with most of the pages
still on the normal lru lists. If we want to move to an almost
exclusively lazy culling of mlocked pages to the unevictable then we can
remove the drain. If we want to be more proactive in culling the
unevictable pages as we populate the vma, we'll want to keep the drain.

Lee

Minchan Kim

未読、

2009/09/09 12:50:052009/09/09

To:

Hi, Lee.
Long time no see. :)

Sorry for confusing.
I said not lru_add_drain but lru_add_drain_all.
Now problem is schedule_on_each_cpu.

Anyway, that case pagevec's worth of pages will be much
increased by the number of CPU as you pointed out.

> still on the normal lru lists. �If we want to move to an almost
> exclusively lazy culling of mlocked pages to the unevictable then we can
> remove the drain. �If we want to be more proactive in culling the
> unevictable pages as we populate the vma, we'll want to keep the drain.
>

It's not good that lazy culling of many pages causes high reclaim overhead.
But now lazy culling of reclaim is doing just only shrink_page_list.
we can do it shrink_active_list's page_referenced so that we can sparse
cost of lazy culling.

> Lee
>
>

--
Kind regards,
Minchan Kim

KOSAKI Motohiro

未読、

2009/09/09 19:50:062009/09/09

To:

> On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:
>
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
>
> True this is only an optimization that increases the chance of isolation
> being successful. You dont need draining at all.
>
> > Then asynchronous lru_add_drain_all() cause
> >
> > - if system isn't under heavy pressure, retry succussfull.
> > - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> >
> > What do you think?
>
> The retries can be very fast if the migrate pages list is small. The
> migrate attempts may be finished before the IPI can be processed by the
> other cpus.

Ah, I see. Yes, my last proposal is not good. small migration might be fail.

How about this?
- pass 1-2, lru_add_drain_all_async()
- pass 3-10, lru_add_drain_all()

this scheme might save RT-thread case and never cause regression. (I think)

The last remain problem is, if RT-thread binding cpu's pagevec has migrate
targetted page, migration still face the same issue.
but we can't solve it...
RT-thread must use /proc/sys/vm/drop_caches properly.

KOSAKI Motohiro

未読、

2009/09/09 20:00:122009/09/09

To:

> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <kosaki....@jp.fujitsu.com> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >> � �without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >> � �migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> > �- if system isn't under heavy pressure, retry succussfull.
> > �- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.

Right.

> How about removing draining in case of mlock?

Umm, I don't like this. because perfectly no drain often make strange test result.
I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
but many tester and administrator wiill think it's bug... ;)

Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
costly operation. it hide drain cost. now, we only want to treat corner case issue.
I don't hope dramatic change.

Minchan Kim

未読、

2009/09/09 21:10:062009/09/09

To:

I agree. I have no objection to your approach. :)

> Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> costly operation. it hide drain cost. now, we only want to treat corner case issue.
> I don't hope dramatic change.

Another problem is as follow.

Although some CPUs don't have any thing to do, we do it.
HPC guys don't want to consume CPU cycle as Christoph pointed out.
I liked Peter's idea with regard to this.
My approach can solve it, too.
But I agree it would be dramatic change.

--
Kind regards,
Minchan Kim

KOSAKI Motohiro

未読、

2009/09/09 21:20:052009/09/09

To:

Is Perter's + mine approach bad?

It mean,

- RT-thread binding cpu is not grabbing the page
-> mlock successful by Peter's improvement
- RT-thread binding cpu is grabbing the page
-> mlock successful by mine approach
the page is culled later.

Minchan Kim

未読、

2009/09/09 21:30:092009/09/09

To:

On Thu, 10 Sep 2009 10:15:07 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

It's good to me! :)

> It mean,
>
> - RT-thread binding cpu is not grabbing the page
> -> mlock successful by Peter's improvement
> - RT-thread binding cpu is grabbing the page
> -> mlock successful by mine approach
> the page is culled later.
>
>
>
>

--
Kind regards,
Minchan Kim

Christoph Lameter

未読、

2009/09/10 14:10:072009/09/10

To:

On Thu, 10 Sep 2009, KOSAKI Motohiro wrote:

> How about this?
> - pass 1-2, lru_add_drain_all_async()
> - pass 3-10, lru_add_drain_all()
>
> this scheme might save RT-thread case and never cause regression. (I think)

Sounds good.

> The last remain problem is, if RT-thread binding cpu's pagevec has migrate
> targetted page, migration still face the same issue.
> but we can't solve it...
> RT-thread must use /proc/sys/vm/drop_caches properly.

A system call "sys_os_quiet_down" may be useful. It would drain all
caches, fold counters etc etc so that there will be no OS activities
needed for those things later.