[PATCH 0/2] timer: Migrate running timers

Viresh Kumar

unread,

Mar 31, 2015, 2:55:36 AM3/31/15

to Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Viresh Kumar, Steven Miao

Hi Ingo/Thomas/Peter,

While queuing a timer, we try to migrate it to a non-idle core if the local core
is idle, but we don't try that if the timer is re-armed from its handler.

There were few unsolved problems due to which it was avoided until now. But
there are cases where solving these problems can be useful. When the timer is
always re-armed from its handler, it never migrates to other cores. And many a
times, it ends up waking an idle core to just service the timer, which could
have been handled by a non-idle core.

Peter suggested [1] few changes which can make that work and the first patch
does exactly that. The second one is a minor improvement, that replaces
'running_timer' pointer with 'busy'. That variable was required as part of a
sanity check during CPU hot-unplug operation. I was not sure if we should drop
this extra variable ('running_timer' or 'busy') and the sanity check.

Because we are using another bit from base pointer to keep track of running
status of timer, we get a warning on blackfin, as it doesn't respect
____cacheline_aligned [2].

kernel/time/timer.c: In function 'init_timers':
kernel/time/timer.c:1731:2: error: call to '__compiletime_assert_1731' declared
with attribute error: BUILD_BUG_ON failed: __alignof__(struct tvec_base)
& TIMER_FLAG_MASK

--
viresh

[1] https://lkml.org/lkml/2015/3/28/32
[2] https://lkml.org/lkml/2015/3/29/178

Cc: Steven Miao <rea...@gmail.com>

Viresh Kumar (2):
timer: Avoid waking up an idle-core by migrate running timer
timer: Replace base-> 'running_timer' with 'busy'

include/linux/timer.h | 3 +-
kernel/time/timer.c | 102 ++++++++++++++++++++++++++++++++++++++------------
2 files changed, 81 insertions(+), 24 deletions(-)

--
2.3.0.rc0.44.ga94655d

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Mar 31, 2015, 10:53:34 AM3/31/15

to Viresh Kumar, Ingo Molnar, Thomas Gleixner, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao

On Tue, Mar 31, 2015 at 12:25:16PM +0530, Viresh Kumar wrote:
> Cc: Steven Miao <rea...@gmail.com>

> +#define TIMER_FLAG_MASK 0x7LU

So Steven, this will break compilation on blackfin because that makes
____cacheline_aligned a NOP while we assume it will generate
__attribute__((__aligned__(SMP_CACHE_BYTES))) with the further
assumption that SMP_CACHE_BYTES >= 8.

Can you explain why blackfin chooses to break ____cacheline_align for
UP? We have the *_in_smp variants to distinguish these cases.

Viresh Kumar

unread,

Mar 31, 2015, 11:01:34 AM3/31/15

to Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Linaro Kernel Mailman List, Linux Kernel Mailing List, Viresh Kumar, Steven Miao

On 31 March 2015 at 12:25, Viresh Kumar <viresh...@linaro.org> wrote:
> While queuing a timer, we try to migrate it to a non-idle core if the local core
> is idle, but we don't try that if the timer is re-armed from its handler.
>
> There were few unsolved problems due to which it was avoided until now. But
> there are cases where solving these problems can be useful. When the timer is
> always re-armed from its handler, it never migrates to other cores. And many a
> times, it ends up waking an idle core to just service the timer, which could
> have been handled by a non-idle core.
>
> Peter suggested [1] few changes which can make that work and the first patch
> does exactly that. The second one is a minor improvement, that replaces
> 'running_timer' pointer with 'busy'. That variable was required as part of a
> sanity check during CPU hot-unplug operation. I was not sure if we should drop
> this extra variable ('running_timer' or 'busy') and the sanity check.

Peter reminded me that I failed to tell if it really worked or not :)

So yes it worked. I tested this on a Dual-core ARM cortex-A15 board with 5
timers getting re-armed 100 times each from their handler.

Most of the time the remote CPU was idle (along with the local one) and
so migration didn't happen.

But as and when the local CPU was idle and remote one wasn't, timers got
successfully migrated.

My branches are also tested by Fengguang's build-bot, and in case of any
of wreckage on other machines, we will be informed :)

--
viresh

Thomas Gleixner

unread,

Apr 14, 2015, 7:13:27 PM4/14/15

to Viresh Kumar, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao

On Tue, 31 Mar 2015, Viresh Kumar wrote:
> @@ -1189,12 +1195,41 @@ static inline void __run_timers(struct tvec_base *base)
> cascade(base, &base->tv5, INDEX(3));
> ++base->timer_jiffies;
> list_replace_init(base->tv1.vec + index, head);
> +
> +again:
> while (!list_empty(head)) {
> void (*fn)(unsigned long);
> unsigned long data;
> bool irqsafe;
>
> - timer = list_first_entry(head, struct timer_list,entry);
> + timer = list_first_entry(head, struct timer_list, entry);
> +
> + if (unlikely(timer_running(timer))) {
> + /*
> + * The only way to get here is if the handler,
> + * running on another base, re-queued itself on
> + * this base, and the handler hasn't finished
> + * yet.
> + */
> +
> + if (list_is_last(&timer->entry, head)) {
> + /*
> + * Drop lock, so that TIMER_RUNNING can
> + * be cleared on another base.
> + */
> + spin_unlock(&base->lock);
> +
> + while (timer_running(timer))
> + cpu_relax();
> +
> + spin_lock(&base->lock);
> + } else {
> + /* Rotate the list and try someone else */
> + list_move_tail(&timer->entry, head);
> + }

Can we please stick that timer into the next bucket and be done with it?

> + goto again;

"continue;" is old school, right?

> + }
> +
> fn = timer->function;
> data = timer->data;
> irqsafe = tbase_get_irqsafe(timer->base);
> @@ -1202,6 +1237,7 @@ static inline void __run_timers(struct tvec_base *base)
> timer_stats_account_timer(timer);
>
> base->running_timer = timer;
> + timer_set_running(timer);
> detach_expired_timer(timer, base);
>
> if (irqsafe) {
> @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base)
> call_timer_fn(timer, fn, data);
> spin_lock_irq(&base->lock);
> }
> +
> + /*
> + * Handler running on this base, re-queued itself on
> + * another base ?
> + */
> + if (unlikely(timer->base != base)) {
> + unsigned long flags;
> + struct tvec_base *tbase;
> +
> + spin_unlock(&base->lock);
> +
> + tbase = lock_timer_base(timer, &flags);
> + timer_clear_running(timer);
> + spin_unlock(&tbase->lock);
> +
> + spin_lock(&base->lock);
> + } else {
> + timer_clear_running(timer);
> + }

Aside of the above this is really horrible. Why not doing the obvious:

__mod_timer()

if (base != newbase) {
if (timer_running()) {
list_add(&base->migration_list);
goto out_unlock;
}
.....

__run_timers()

while(!list_empty(head)) {
....
}

if (unlikely(!list_empty(&base->migration_list)) {
/* dequeue and requeue again */
}

Simple, isn't it?

No new flags in the timer base, no concurrent expiry, no extra
conditionals in the expiry path. Just a single extra check at the end
of the softirq handler for this rare condition instead of imposing all
that nonsense to the hotpath.

We might even optimize that:

if (timer_running()) {
list_add(&base->migration_list);
base->preferred_target = newbase;
goto out_unlock;
}

if (unlikely(!list_empty(&base->migration_list)) {
/* dequeue and requeue again */
while (!list_empty(&base->migration_list)) {
dequeue_timer();
newbase = base->preferred_target;
unlock(base);
lock(newbase);
enqueue(newbase);
unlock(newbase);
lock(base);
}
}

Thanks,

tglx

Thomas Gleixner

unread,

Apr 15, 2015, 11:54:30 AM4/15/15

to Viresh Kumar, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao

On Tue, 31 Mar 2015, Viresh Kumar wrote:
> @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base)
> call_timer_fn(timer, fn, data);
> spin_lock_irq(&base->lock);
> }
> +
> + /*
> + * Handler running on this base, re-queued itself on
> + * another base ?
> + */
> + if (unlikely(timer->base != base)) {
> + unsigned long flags;
> + struct tvec_base *tbase;
> +
> + spin_unlock(&base->lock);
> +
> + tbase = lock_timer_base(timer, &flags);
> + timer_clear_running(timer);
> + spin_unlock(&tbase->lock);
> +
> + spin_lock(&base->lock);
> + } else {
> + timer_clear_running(timer);
> + }

And just for the record:

Dereferencing timer _AFTER_ the callback function is a big NONO. The
callback function is allowed to free the timer. See the comment in
call_timer_fn()

Oh well,

tglx

viresh kumar

unread,

Apr 17, 2015, 4:13:17 AM4/17/15

to Thomas Gleixner, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

Hi Thomas,

On Wednesday 15 April 2015 04:43 AM, Thomas Gleixner wrote:
> No new flags in the timer base, no concurrent expiry, no extra
> conditionals in the expiry path. Just a single extra check at the end
> of the softirq handler for this rare condition instead of imposing all
> that nonsense to the hotpath.

Here is what I could get to based on your suggestions:

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..c412511d34b8 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -78,6 +78,8 @@ struct tvec_root {
struct tvec_base {
spinlock_t lock;
struct timer_list *running_timer;
+ struct list_head migration_list;
+ struct tvec_base *preferred_target;
unsigned long timer_jiffies;
unsigned long next_timer;
unsigned long active_timers;
@@ -785,10 +787,18 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
- * handler yet has not finished. This also guarantees that
- * the timer is serialized wrt itself.
+ * handler yet has not finished.
+ *
+ * Move timer to migration_list which can be processed after all
+ * timers in current cycle are serviced. This also guarantees
+ * that the timer is serialized wrt itself.
*/
- if (likely(base->running_timer != timer)) {
+ if (unlikely(base->running_timer == timer)) {
+ timer->expires = expires;
+ base->preferred_target = new_base;
+ list_add_tail(&timer->entry, &base->migration_list);
+ goto out_unlock;
+ } else {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
@@ -1214,6 +1224,33 @@ static inline void __run_timers(struct tvec_base *base)

spin_lock_irq(&base->lock);
}
}
+
+ /*

+ * Timer handler re-armed timer and we didn't wanted to add that
+ * on a idle local CPU. Its handler has finished now, lets
+ * enqueue it again.
+ */
+ if (unlikely(!list_empty(&base->migration_list))) {
+ struct tvec_base *new_base = base->preferred_target;
+
+ do {
+ timer = list_first_entry(&base->migration_list,
+ struct timer_list, entry);
+
+ __list_del(timer->entry.prev, timer->entry.next);
+
+ /* See the comment in lock_timer_base() */
+ timer_set_base(timer, NULL);
+ spin_unlock(&base->lock);
+
+ spin_lock(&new_base->lock);
+ timer_set_base(timer, new_base);
+ internal_add_timer(new_base, timer);
+ spin_unlock(&new_base->lock);
+
+ spin_lock(&base->lock);
+ } while (!list_empty(&base->migration_list));
+ }
}
base->running_timer = NULL;
spin_unlock_irq(&base->lock);
@@ -1583,6 +1620,7 @@ static int init_timers_cpu(int cpu)
for (j = 0; j < TVR_SIZE; j++)
INIT_LIST_HEAD(base->tv1.vec + j);

+ INIT_LIST_HEAD(&base->migration_list);
base->timer_jiffies = jiffies;
base->next_timer = base->timer_jiffies;
base->active_timers = 0;

Does this look any better ?

I tested this on Exynos (Dual ARM Cortex-A9) board, with ubuntu over it.
System was fairly idle and I did 'dmesg > /dev/null' on one of the CPUs
in an infinite loop, to get CPUs out of idle.

I have got few more concerns/diffs over this to further optimize things,
but kept them separate so that I can drop them if they aren't correct.

1.) Should the above list_empty(migration_list) block be added out of the

while (time_after_eq(jiffies, base->timer_jiffies))

block ? So that we check it only once per timer interrupt.

Also ->running_timer is set to the last serviced timer and a
del_timer_sync() might be waiting to remove it. But we continue with
the migration list first, without clearing it first. Not sure if this
is important at all..

2.) By the time we finish serving all pending timers, local CPU might not be
idle anymore OR the target CPU may become idle.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c412511d34b8..d75d31e9dc49 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1233,6 +1233,14 @@ static inline void __run_timers(struct tvec_base *base)
if (unlikely(!list_empty(&base->migration_list))) {
struct tvec_base *new_base = base->preferred_target;

+ if (!idle_cpu(base->cpu)) {
+ /* Local CPU not idle anymore */
+ new_base = base;
+ } else if (idle_cpu(new_base->cpu)) {
+ /* Re-evaluate base, target CPU has gone idle */
+ new_base = per_cpu(tvec_bases, get_nohz_timer_target(false));
+ }
+
do {
timer = list_first_entry(&base->migration_list,
struct timer_list, entry);

The first if block is getting hit a lot of times in my tests. But
the second one has never been hit (Probably because of only two CPUs,
not sure).

2.) It might be better to update preferred_target every time we choose a
difference base. This may help us avoid calling get_nohz_timer_target()
in the second if block above.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d75d31e9dc49..558cd98abd87 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -783,6 +783,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
new_base = per_cpu(tvec_bases, cpu);

if (base != new_base) {
+ base->preferred_target = new_base;
+
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
@@ -795,7 +797,6 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
*/
if (unlikely(base->running_timer == timer)) {
timer->expires = expires;
- base->preferred_target = new_base;
list_add_tail(&timer->entry, &base->migration_list);
goto out_unlock;
} else {

--
viresh

Ingo Molnar

unread,

Apr 17, 2015, 4:33:09 AM4/17/15

to viresh kumar, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

* viresh kumar <viresh...@linaro.org> wrote:

> + /*
> + * Timer handler re-armed timer and we didn't wanted to add that
> + * on a idle local CPU. Its handler has finished now, lets
> + * enqueue it again.
> + */

That single comment has 5 grammatical errors!

Thanks,

Ingo

Thomas Gleixner

unread,

Apr 21, 2015, 5:32:22 PM4/21/15

to viresh kumar, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Fri, 17 Apr 2015, viresh kumar wrote:
> Does this look any better ?

Yes, but:

> 1.) Should the above list_empty(migration_list) block be added out of the
>
> while (time_after_eq(jiffies, base->timer_jiffies))
>
> block ? So that we check it only once per timer interrupt.

That's what I suggested.

> Also ->running_timer is set to the last serviced timer and a
> del_timer_sync() might be waiting to remove it. But we continue with
> the migration list first, without clearing it first. Not sure if this
> is important at all..

Of course it is and doing it outside of the loop solves that issue.

> 2.) By the time we finish serving all pending timers, local CPU might not be
> idle anymore OR the target CPU may become idle.

That's another good reason to move that migration list outside of the
while loop.

> 2.) It might be better to update preferred_target every time we choose a
> difference base. This may help us avoid calling get_nohz_timer_target()
> in the second if block above.

Yuck no. You can do that in that migration code and not add more
pointless crap to the normal case.

Are you realizing that __mod_timer() is a massive hotpath for network
heavy workloads?

This stuff needs to be debloated and not mindlessly packed with more
corner case features.

Thanks,

tglx

Eric Dumazet

unread,

Apr 21, 2015, 5:55:04 PM4/21/15

to Thomas Gleixner, viresh kumar, Ingo Molnar, Peter Zijlstra, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:

>
> Are you realizing that __mod_timer() is a massive hotpath for network
> heavy workloads?

BTW I was considering using mod_timer_pinned() from these networking
timers (ie sk_reset_timer())

get_nohz_timer_target() sounds cool for laptop users, but is one cause
for bad responses to DDOS, when the selected cpu gets stressed.

This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae
("tcp/dccp: get rid of central timewait timer")

Peter Zijlstra

unread,

Apr 22, 2015, 11:29:52 AM4/22/15

to Eric Dumazet, Thomas Gleixner, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Tue, Apr 21, 2015 at 02:54:55PM -0700, Eric Dumazet wrote:
> On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:
>
> >
> > Are you realizing that __mod_timer() is a massive hotpath for network
> > heavy workloads?
>
> BTW I was considering using mod_timer_pinned() from these networking
> timers (ie sk_reset_timer())
>
> get_nohz_timer_target() sounds cool for laptop users, but is one cause
> for bad responses to DDOS, when the selected cpu gets stressed.
>
> This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae
> ("tcp/dccp: get rid of central timewait timer")

Hmm, that sounds unfortunate, this would wreck life for the power aware
laptop/tablet etc.. people.

There is already a sysctl to manage this, is that not enough to mitigate
this problem on the server side of things?

Eric Dumazet

unread,

Apr 22, 2015, 12:03:06 PM4/22/15

to Peter Zijlstra, Thomas Gleixner, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Wed, 2015-04-22 at 17:29 +0200, Peter Zijlstra wrote:

> Hmm, that sounds unfortunate, this would wreck life for the power aware
> laptop/tablet etc.. people.
>
> There is already a sysctl to manage this, is that not enough to mitigate
> this problem on the server side of things?

The thing is : 99% of networking timers never fire.

But when they _do_ fire, and host is under attack, they all fire on
unrelated cpu and this one can not keep up.

Added latencies fire monitoring alerts.

Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
for a specific example of the problems that can be raised.

When we set a timer to fire in 10 seconds, knowing the _current_ idle
state for cpus is of no help.

Add to this that softirq processing is not considered as making current
cpu as non idle.

networking tried hard to use cpu affinities (and all techniques
described in Documentation/networking/scaling.txt),
but /proc/sys/kernel/timer_migration adds a fair overhead in many
workloads.

get_nohz_timer_target() has to touch 3 cache lines per cpu...

Its in the top 10 in "perf top" profiles on servers with 72 threads.

This /proc/sys/kernel/timer_migration should have been instead :

/proc/sys/kernel/timer_on_a_single_cpu_for_laptop_sake

Thomas Gleixner

unread,

Apr 22, 2015, 2:56:32 PM4/22/15

to Eric Dumazet, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Wed, 22 Apr 2015, Eric Dumazet wrote:
> Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> for a specific example of the problems that can be raised.

If you have a problem with the core timer code then it should be fixed
there and not worked around in some place which will ruin stuff for
power saving interested users. I'm so tired of this 'I fix it in my
sandbox' attitude, really. If the core code has a shortcoming we fix
it there right away because you are probably not the only one who runs
into that shortcoming. So if we don't fix it in the core we end up
with a metric ton of slightly different (or broken) workarounds which
affect the workload/system characteristics of other people.

Just for the record. Even the changelog of this commit is blatantly
wrong:

"We can see that timers get migrated into a single cpu, presumably
idle at the time timers are set up."

The timer migration moves timers to non idle cpus to leave the idle
ones alone for power saving sake.

I can see and understand the reason why you want to avoid that, but I
have to ask the question whether this pinning is the correct behaviour
under all workloads and system characteristics. If yes, then the patch
is the right answer, if no, then it is simply the wrong approach.

> but /proc/sys/kernel/timer_migration adds a fair overhead in many
> workloads.
>
> get_nohz_timer_target() has to touch 3 cache lines per cpu...

And this is something we can fix and completely avoid if we think
about it. Looking at the code I have to admit that the out of line
call and the sysctl variable lookup is silly. But its not rocket
science to fix this.

Thanks,

tglx

Eric Dumazet

unread,

Apr 22, 2015, 3:59:13 PM4/22/15

to Thomas Gleixner, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
> On Wed, 22 Apr 2015, Eric Dumazet wrote:
> > Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> > for a specific example of the problems that can be raised.
>
> If you have a problem with the core timer code then it should be fixed
> there and not worked around in some place which will ruin stuff for
> power saving interested users. I'm so tired of this 'I fix it in my
> sandbox' attitude, really. If the core code has a shortcoming we fix
> it there right away because you are probably not the only one who runs
> into that shortcoming. So if we don't fix it in the core we end up
> with a metric ton of slightly different (or broken) workarounds which
> affect the workload/system characteristics of other people.
>
> Just for the record. Even the changelog of this commit is blatantly
> wrong:
>
> "We can see that timers get migrated into a single cpu, presumably
> idle at the time timers are set up."

Spare me the obvious typo. A 'not' is missing.

>
> The timer migration moves timers to non idle cpus to leave the idle
> ones alone for power saving sake.
>
> I can see and understand the reason why you want to avoid that, but I
> have to ask the question whether this pinning is the correct behaviour
> under all workloads and system characteristics. If yes, then the patch
> is the right answer, if no, then it is simply the wrong approach.
>
> > but /proc/sys/kernel/timer_migration adds a fair overhead in many
> > workloads.
> >
> > get_nohz_timer_target() has to touch 3 cache lines per cpu...
>
> And this is something we can fix and completely avoid if we think
> about it. Looking at the code I have to admit that the out of line
> call and the sysctl variable lookup is silly. But its not rocket
> science to fix this.

Awesome.

Thomas Gleixner

unread,

Apr 22, 2015, 5:57:08 PM4/22/15

to Eric Dumazet, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Wed, 22 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
> > On Wed, 22 Apr 2015, Eric Dumazet wrote:
> > > Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> > > for a specific example of the problems that can be raised.
> >
> > If you have a problem with the core timer code then it should be fixed
> > there and not worked around in some place which will ruin stuff for
> > power saving interested users. I'm so tired of this 'I fix it in my
> > sandbox' attitude, really. If the core code has a shortcoming we fix
> > it there right away because you are probably not the only one who runs
> > into that shortcoming. So if we don't fix it in the core we end up
> > with a metric ton of slightly different (or broken) workarounds which
> > affect the workload/system characteristics of other people.
> >
> > Just for the record. Even the changelog of this commit is blatantly
> > wrong:
> >
> > "We can see that timers get migrated into a single cpu, presumably
> > idle at the time timers are set up."
>
> Spare me the obvious typo. A 'not' is missing.

:)

> >
> > The timer migration moves timers to non idle cpus to leave the idle
> > ones alone for power saving sake.
> >
> > I can see and understand the reason why you want to avoid that, but I
> > have to ask the question whether this pinning is the correct behaviour
> > under all workloads and system characteristics. If yes, then the patch
> > is the right answer, if no, then it is simply the wrong approach.

I take your 'Awesome' below as a no then.

> > > but /proc/sys/kernel/timer_migration adds a fair overhead in many
> > > workloads.
> > >
> > > get_nohz_timer_target() has to touch 3 cache lines per cpu...
> >
> > And this is something we can fix and completely avoid if we think
> > about it. Looking at the code I have to admit that the out of line
> > call and the sysctl variable lookup is silly. But its not rocket
> > science to fix this.
>
> Awesome.

Here you go. Completely untested, at least it compiles.

Thanks,

tglx
---

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -343,14 +343,10 @@ extern int runqueue_is_locked(int cpu);
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void nohz_balance_enter_idle(int cpu);
extern void set_cpu_sd_state_idle(void);
-extern int get_nohz_timer_target(int pinned);
+extern int get_nohz_timer_target(void);
#else
static inline void nohz_balance_enter_idle(int cpu) { }
static inline void set_cpu_sd_state_idle(void) { }
-static inline int get_nohz_timer_target(int pinned)
-{
- return smp_processor_id();
-}
#endif

/*
Index: linux/include/linux/sched/sysctl.h
===================================================================
--- linux.orig/include/linux/sched/sysctl.h
+++ linux/include/linux/sched/sysctl.h
@@ -57,24 +57,12 @@ extern unsigned int sysctl_numa_balancin
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_timer_migration;
extern unsigned int sysctl_sched_shares_window;

int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
#endif
-#ifdef CONFIG_SCHED_DEBUG
-static inline unsigned int get_sysctl_timer_migration(void)
-{
- return sysctl_timer_migration;
-}
-#else
-static inline unsigned int get_sysctl_timer_migration(void)
-{
- return 1;
-}
-#endif

/*
* control realtime throttling:
Index: linux/include/linux/timer.h
===================================================================
--- linux.orig/include/linux/timer.h
+++ linux/include/linux/timer.h
@@ -254,6 +254,16 @@ extern void run_local_timers(void);
struct hrtimer;
extern enum hrtimer_restart it_real_fn(struct hrtimer *);

+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+#include <linux/sysctl.h>
+
+extern unsigned int sysctl_timer_migration;
+int timer_migration_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+extern struct static_key timer_migration_enabled;
+#endif
+
unsigned long __round_jiffies(unsigned long j, int cpu);
unsigned long __round_jiffies_relative(unsigned long j, int cpu);
unsigned long round_jiffies(unsigned long j);
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -593,13 +593,12 @@ void resched_cpu(int cpu)
* selecting an idle cpu will add more delays to the timers than intended
* (as that cpu's timer base may not be uptodate wrt jiffies etc).
*/
-int get_nohz_timer_target(int pinned)
+int get_nohz_timer_target(void)
{
- int cpu = smp_processor_id();
- int i;
+ int i, cpu = smp_processor_id();
struct sched_domain *sd;

- if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
+ if (!idle_cpu(cpu))
return cpu;

rcu_read_lock();
@@ -7088,8 +7087,6 @@ void __init sched_init_smp(void)
}
#endif /* CONFIG_SMP */

-const_debug unsigned int sysctl_timer_migration = 1;
-
int in_sched_functions(unsigned long addr)
{
return in_lock_functions(addr) ||
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -349,15 +349,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
- {
- .procname = "timer_migration",
- .data = &sysctl_timer_migration,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &zero,
- .extra2 = &one,
- },
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
@@ -1132,6 +1123,15 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+ {
+ .procname = "timer_migration",
+ .data = &sysctl_timer_migration,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = timer_migration_handler,
+ },
+#endif
{ }
};

Index: linux/kernel/time/hrtimer.c
===================================================================
--- linux.orig/kernel/time/hrtimer.c
+++ linux/kernel/time/hrtimer.c
@@ -190,6 +190,23 @@ hrtimer_check_target(struct hrtimer *tim
#endif
}

+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static inline struct hrtimer_cpu_base *get_target_base(int pinned)
+{
+ if (pinned)
+ return this_cpu_ptr(&hrtimer_bases);
+ if (static_key_true(&timer_migration_enabled))
+ return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+ return this_cpu_ptr(&hrtimer_bases);
+}
+#else
+
+static inline struct hrtimer_cpu_base *get_target_base(int pinned)
+{
+ return this_cpu_ptr(&hrtimer_bases);
+}
+#endif
+
/*
* Switch the timer base to the current CPU when possible.
*/
@@ -197,14 +214,13 @@ static inline struct hrtimer_clock_base
switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
int pinned)
{
+ struct hrtimer_cpu_base *new_cpu_base, *this_base;
struct hrtimer_clock_base *new_base;
- struct hrtimer_cpu_base *new_cpu_base;
- int this_cpu = smp_processor_id();
- int cpu = get_nohz_timer_target(pinned);
int basenum = base->index;

+ this_base = this_cpu_ptr(&hrtimer_bases);
+ new_cpu_base = get_target_base(pinned);
again:
- new_cpu_base = &per_cpu(hrtimer_bases, cpu);
new_base = &new_cpu_base->clock_base[basenum];

if (base != new_base) {
@@ -225,17 +241,19 @@ again:
raw_spin_unlock(&base->cpu_base->lock);
raw_spin_lock(&new_base->cpu_base->lock);

- if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
- cpu = this_cpu;
+ if (new_cpu_base != this_base &&
+ hrtimer_check_target(timer, new_base)) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
+ new_cpu_base = this_base;
timer->base = base;
goto again;
}
timer->base = new_base;
} else {
- if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
- cpu = this_cpu;
+ if (new_cpu_base != this_base &&
+ hrtimer_check_target(timer, new_base)) {
+ new_cpu_base = this_base;
goto again;
}
}
Index: linux/kernel/time/timer.c
===================================================================
--- linux.orig/kernel/time/timer.c
+++ linux/kernel/time/timer.c
@@ -104,6 +104,49 @@ EXPORT_SYMBOL(boot_tvec_bases);

static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;

+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+unsigned int sysctl_timer_migration = 1;
+struct static_key timer_migration_enabled = STATIC_KEY_INIT_TRUE;
+
+int timer_migration_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ static DEFINE_MUTEX(mutex);
+ bool keyon;
+ int ret;
+
+ mutex_lock(&mutex);
+ ret = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (ret || !write)
+ goto unlock;
+
+ keyon = static_key_enabled(&timer_migration_enabled);
+ if (sysctl_timer_migration && !keyon )
+ static_key_slow_inc(&timer_migration_enabled);
+ else if (!sysctl_timer_migration && keyon)
+ static_key_slow_dec(&timer_migration_enabled);
+
+unlock:
+ mutex_unlock(&mutex);
+ return ret;
+}
+
+static inline struct tvec_base *get_target_base(int pinned)
+{
+ if (pinned)
+ return __this_cpu_read(tvec_bases);
+ if (static_key_true(&timer_migration_enabled))
+ return per_cpu(tvec_bases, get_nohz_timer_target());
+ return __this_cpu_read(tvec_bases);
+}
+#else
+static inline struct tvec_base *get_target_base(int pinned)
+{
+ return __this_cpu_read(tvec_bases);
+}
+#endif
+
/* Functions below help us manage 'deferrable' flag */
static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
{
@@ -774,7 +817,7 @@ __mod_timer(struct timer_list *timer, un
{
struct tvec_base *base, *new_base;
unsigned long flags;
- int ret = 0 , cpu;
+ int ret = 0;

timer_stats_timer_set_start_info(timer);
BUG_ON(!timer->function);
@@ -787,8 +830,7 @@ __mod_timer(struct timer_list *timer, un

debug_activate(timer, expires);

- cpu = get_nohz_timer_target(pinned);
- new_base = per_cpu(tvec_bases, cpu);
+ new_base = get_target_base(pinned);

if (base != new_base) {
/*

Eric Dumazet

unread,

Apr 23, 2015, 2:57:40 AM4/23/15

to Thomas Gleixner, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, linux-...@vger.kernel.org, Steven Miao, sha...@codeaurora.org

On Wed, 2015-04-22 at 23:56 +0200, Thomas Gleixner wrote:

> -int get_nohz_timer_target(int pinned)
> +int get_nohz_timer_target(void)
> {
> - int cpu = smp_processor_id();
> - int i;
> + int i, cpu = smp_processor_id();
> struct sched_domain *sd;
>
> - if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
> + if (!idle_cpu(cpu))
> return cpu;

Maybe also test in_serving_softirq() ?

if (in_serving_softirq() || !idle_cpu(cpu))
return cpu;

There is a fundamental problem with networking load : Many cpus appear
to be idle from scheduler perspective because no user/kernel task is running.

CPUs servicing NIC queues can be very busy handling thousands of packets
per second, yet have no user/kernel task running.

idle_cpu() return code is : this cpu is idle. hmmmm, really ?

cpus are busy, *and* have to access alien data/locks to activate timers
that hardly fire anyway.

When idle_cpu() finally gives the right indication, it is too late :
ksoftirqd might be running on the wrong cpu. Innocent cpus, overwhelmed
by a sudden timer load and locked into a service loop.

This cannot resist to a DOS, and even with non malicious traffic, the
overhead is high.

Thanks.

Thomas Gleixner

unread,

Apr 23, 2015, 8:46:01 AM4/23/15

to Eric Dumazet, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, LKML, Steven Miao, sha...@codeaurora.org, Frederic Weisbecker, Paul E. McKenney, c...@kernel.org

On Wed, 22 Apr 2015, Eric Dumazet wrote:

You definitely have a point from the high throughput networking
perspective.

Though in a power optimizing scenario with minimal network traffic
this might be the wrong decision. We have to gather data from the
power maniacs whether this matters or not. The FULL_NO_HZ camp might
be pretty unhappy about the above.

Thanks,

tglx

Eric Dumazet

unread,

Apr 25, 2015, 2:37:46 PM4/25/15

to Thomas Gleixner, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, LKML, Steven Miao, sha...@codeaurora.org, Frederic Weisbecker, Paul E. McKenney, c...@kernel.org

On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:

> You definitely have a point from the high throughput networking
> perspective.
>
> Though in a power optimizing scenario with minimal network traffic
> this might be the wrong decision. We have to gather data from the
> power maniacs whether this matters or not. The FULL_NO_HZ camp might
> be pretty unhappy about the above.

Sure, I understand.

To make this clear, here the profile on a moderately loaded TCP server,
pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus
from softirq context).

(using regular sendmsg() system calls, that why the
get_nohz_timer_target() is 'only' second in the profile, but add the
find_next_bit() to it and this is very close being at first position)

PerfTop: 4712 irqs/sec kernel:96.7% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
-----------------------------------------------------------------------------------------
10.16% [kernel] [k] copy_user_enhanced_fast_string
5.66% [kernel] [k] get_nohz_timer_target
5.59% [kernel] [k] _raw_spin_lock
2.53% [kernel] [k] __netif_receive_skb_core
2.27% [kernel] [k] find_next_bit
1.90% [kernel] [k] tcp_ack

Maybe a reasonable heuristic would be to
change /proc/sys/kernel/timer_migration default to 0 on hosts with more
than 32 cpus.

profile with timer_migration = 0

PerfTop: 3656 irqs/sec kernel:94.3% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
-----------------------------------------------------------------------------------------
13.95% [kernel] [k] copy_user_enhanced_fast_string
4.65% [kernel] [k] _raw_spin_lock
2.57% [kernel] [k] __netif_receive_skb_core
2.33% [kernel] [k] tcp_ack

Thanks.

Thomas Gleixner

unread,

May 5, 2015, 9:00:57 AM5/5/15

to Eric Dumazet, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, LKML, Steven Miao, sha...@codeaurora.org, Frederic Weisbecker, Paul E. McKenney, c...@kernel.org

Is that with the static key patch applied?

Thanks,

tglx

Eric Dumazet

unread,

May 6, 2015, 12:33:46 PM5/6/15

to Thomas Gleixner, Peter Zijlstra, viresh kumar, Ingo Molnar, linaro...@lists.linaro.org, LKML, Steven Miao, sha...@codeaurora.org, Frederic Weisbecker, Paul E. McKenney, c...@kernel.org

This was without.

I applied your patch on current linus tree, but for some reason my 72
cpu host is not liking the resulting kernel. I had to ask for a repair,
and this might take a while.

Note your kernel works correctly on other hosts, but with 48 or 32 cpus,
so this must be something unrelated.

I'll let you know when I get more interesting data.

Thanks