Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[RFC] scheduler: improve SMP fairness in CFS

49 views
Skip to first unread message

Tong Li

unread,
Jul 23, 2007, 2:50:07 PM7/23/07
to
This patch extends CFS to achieve better fairness for SMPs. For example,
with 10 tasks (same priority) on 8 CPUs, it enables each task to receive
equal CPU time (80%). The code works on top of CFS and provides SMP
fairness at a coarser time grainularity; local on each CPU, it relies on
CFS to provide fine-grained fairness and good interactivity.

The code is based on the distributed weighted round-robin (DWRR)
algorithm. It keeps two RB trees on each CPU: one is the original cfs_rq,
referred to as active, and one is a new cfs_rq, called round-expired. Each
CPU keeps a round number, initially zero. The scheduler works exactly the
same way as in CFS, but only runs tasks from the active tree. Each task is
assigned a round slice, equal to its weight times a system constant (e.g.,
100ms), controlled by sysctl_base_round_slice. When a task uses up its
round slice, it moves to the round-expired tree on the same CPU and stops
running. Thus, at any time on each CPU, the active tree contains all tasks
that are running in the current round, while tasks in round-expired have
all finished the current round and await to start the next round. When an
active tree becomes empty, it calls idle_balance() to grab tasks of the
same round from other CPUs. If none can be moved over, it switches its
active and round-expired trees, thus unleashing round-expired tasks and
advancing the local round number by one. An invariant it maintains is that
the round numbers of any two CPUs in the system differ by at most one.
This property ensures fairness across CPUs. The variable
sysctl_base_round_slice controls fairness-performance tradeoffs: a smaller
value leads to better cross-CPU fairness at the potential cost of
performance; on the other hand, the larger the value is, the closer the
system behavior is to the default CFS without the patch.

Any comments and suggestions would be highly appreciated.

Thanks,

tong

Signed-off-by: Tong Li <tong...@intel.com>
---
include/linux/sched.h | 5
kernel/sched.c | 577 ++++++++++++++++++++++++++++++++++++++++++--------
kernel/sched_debug.c | 8
kernel/sched_fair.c | 124 ++++++++--
kernel/sysctl.c | 13 +
5 files changed, 604 insertions(+), 123 deletions(-)

diff -uprN linux-2.6.23-rc1/include/linux/sched.h linux-2.6.23-rc1-dwrr/include/linux/sched.h
--- linux-2.6.23-rc1/include/linux/sched.h 2007-07-22 18:42:28.000000000 -0700
+++ linux-2.6.23-rc1-dwrr/include/linux/sched.h 2007-07-22 18:52:34.000000000 -0700
@@ -887,7 +887,7 @@ struct sched_entity {
s64 fair_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ struct cfs_rq *on_rq; /* NULL, active, or round-expired */

u64 wait_start_fair;
u64 wait_start;
@@ -906,6 +906,8 @@ struct sched_entity {
s64 sum_sleep_runtime;
unsigned long wait_runtime_overruns;
unsigned long wait_runtime_underruns;
+ /* How long it has run in the current round. */
+ u64 round_slice_used;
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
@@ -1382,6 +1384,7 @@ extern unsigned int sysctl_sched_stat_gr
extern unsigned int sysctl_sched_runtime_limit;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;
+extern u64 sysctl_sched_base_round_slice;

#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
diff -uprN linux-2.6.23-rc1/kernel/sched.c linux-2.6.23-rc1-dwrr/kernel/sched.c
--- linux-2.6.23-rc1/kernel/sched.c 2007-07-22 18:42:28.000000000 -0700
+++ linux-2.6.23-rc1-dwrr/kernel/sched.c 2007-07-22 23:05:54.000000000 -0700
@@ -184,17 +184,20 @@ struct cfs_rq {
u64 exec_clock;
s64 wait_runtime;
u64 sleeper_bonus;
+ u64 round; /* round number of this cfs_rq */
unsigned long wait_runtime_overruns, wait_runtime_underruns;
+ u64 wait_start_fair, wait_start; /* Used to track wait times of
+ round-expired tasks. */

struct rb_root tasks_timeline;
struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
+ struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */
#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr;
- struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
@@ -239,7 +242,7 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

- struct cfs_rq cfs;
+ struct cfs_rq *active, *round_expired, cfs[2];
#ifdef CONFIG_FAIR_GROUP_SCHED
struct list_head leaf_cfs_rq_list; /* list of leaf cfs_rq on this cpu */
#endif
@@ -301,6 +304,11 @@ struct rq {
struct lock_class_key rq_lock_key;
};

+/* highest round number in the system */
+__cacheline_aligned u64 dwrr_highest_round = 0;
+ /* lock protecting dwrr_highest_round */
+DEFINE_RWLOCK(dwrr_highest_lock);
+
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
static DEFINE_MUTEX(sched_hotcpu_mutex);

@@ -400,7 +408,7 @@ unsigned long long cpu_clock(int cpu)
/* Change a task's ->cfs_rq if it moves across CPUs */
static inline void set_task_cfs_rq(struct task_struct *p)
{
- p->se.cfs_rq = &task_rq(p)->cfs;
+ p->se.cfs_rq = task_rq(p)->active;
}
#else
static inline void set_task_cfs_rq(struct task_struct *p)
@@ -415,6 +423,19 @@ static inline void set_task_cfs_rq(struc
# define finish_arch_switch(prev) do { } while (0)
#endif

+static inline void dwrr_update_idle(struct task_struct *p, struct rq *rq)
+{
+ unsigned long flags;
+
+ if (rt_task(p))
+ return;
+
+ read_lock_irqsave(&dwrr_highest_lock, flags);
+ rq->active->round = dwrr_highest_round;
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
+ rq->round_expired->round = rq->active->round + 1;
+}
+
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
static inline int task_running(struct rq *rq, struct task_struct *p)
{
@@ -841,7 +862,7 @@ static int balance_tasks(struct rq *this

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
+ task_rq(p)->active->wait_runtime -= p->se.wait_runtime;
p->se.wait_runtime = 0;

if (task_has_rt_policy(p)) {
@@ -868,14 +889,14 @@ enqueue_task(struct rq *rq, struct task_
{
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup, now);
- p->se.on_rq = 1;
+ p->se.on_rq = rq->active;
}

static void
dequeue_task(struct rq *rq, struct task_struct *p, int sleep, u64 now)
{
p->sched_class->dequeue_task(rq, p, sleep, now);
- p->se.on_rq = 0;
+ p->se.on_rq = NULL;
}

/*
@@ -933,9 +954,13 @@ static void activate_task(struct rq *rq,

if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
+ if (rq->curr == rq->idle)
+ dwrr_update_idle(p, rq);

enqueue_task(rq, p, wakeup, now);
inc_nr_running(p, rq, now);
+ if (wakeup)
+ p->se.round_slice_used = 0;
}

/*
@@ -958,12 +983,14 @@ static inline void activate_idle_task(st
static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
{
u64 now = rq_clock(rq);
+ struct cfs_rq *on_rq = p->se.on_rq;

if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;

dequeue_task(rq, p, sleep, now);
- dec_nr_running(p, rq, now);
+ if (on_rq != rq->round_expired)
+ dec_nr_running(p, rq, now);
}

/**
@@ -998,8 +1025,8 @@ void set_task_cpu(struct task_struct *p,
u64 clock_offset, fair_clock_offset;

clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->cfs.fair_clock -
- new_rq->cfs.fair_clock;
+ fair_clock_offset = old_rq->active->fair_clock -
+ new_rq->active->fair_clock;
if (p->se.wait_start)
p->se.wait_start -= clock_offset;
if (p->se.wait_start_fair)
@@ -1061,7 +1088,8 @@ migrate_task(struct task_struct *p, int
void wait_task_inactive(struct task_struct *p)
{
unsigned long flags;
- int running, on_rq;
+ int running;
+ struct cfs_rq *on_rq;
struct rq *rq;

repeat:
@@ -1209,6 +1237,8 @@ find_idlest_group(struct sched_domain *s
unsigned long min_load = ULONG_MAX, this_load = 0;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ int found_highest_cpu, this_group_ok = 0;
+ struct rq *rq;

do {
unsigned long load, avg_load;
@@ -1224,7 +1254,16 @@ find_idlest_group(struct sched_domain *s
/* Tally up the load of all CPUs in the group */
avg_load = 0;

+ found_highest_cpu = 0;
for_each_cpu_mask(i, group->cpumask) {
+ rq = cpu_rq(i);
+ if (cpu_isset(i, p->cpus_allowed) &&
+ (rq->active->round == dwrr_highest_round
+ || rq->curr == rq->idle)) {
+ if (local_group)
+ this_group_ok = 1;
+ found_highest_cpu = 1;
+ }
/* Bias balancing toward cpus of our domain */
if (local_group)
load = source_load(i, load_idx);
@@ -1238,6 +1277,16 @@ find_idlest_group(struct sched_domain *s
avg_load = sg_div_cpu_power(group,
avg_load * SCHED_LOAD_SCALE);

+ if (!found_highest_cpu && !rt_task(p)) {
+ if (local_group) {
+ this_load = avg_load;
+ this = group;
+ }
+ /* If the group doesn't contain a highest round CPU
+ * or an idle CPU, skip it. */
+ goto nextgroup;
+ }
+
if (local_group) {
this_load = avg_load;
this = group;
@@ -1249,7 +1298,8 @@ nextgroup:
group = group->next;
} while (group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest || (100*this_load < imbalance*min_load &&
+ (rt_task(p) || this_group_ok)))
return NULL;
return idlest;
}
@@ -1264,6 +1314,7 @@ find_idlest_cpu(struct sched_group *grou
unsigned long load, min_load = ULONG_MAX;
int idlest = -1;
int i;
+ struct rq *rq;

/* Traverse only the allowed CPUs */
cpus_and(tmp, group->cpumask, p->cpus_allowed);
@@ -1271,6 +1322,11 @@ find_idlest_cpu(struct sched_group *grou
for_each_cpu_mask(i, tmp) {
load = weighted_cpuload(i);

+ rq = cpu_rq(i);
+ if (!rt_task(p) && rq->curr != rq->idle &&
+ rq->active->round != dwrr_highest_round)
+ continue;
+
if (load < min_load || (load == min_load && i == this_cpu)) {
min_load = load;
idlest = i;
@@ -1281,7 +1337,7 @@ find_idlest_cpu(struct sched_group *grou
}

/*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sched_balance_task: balance the given task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
*
@@ -1291,16 +1347,15 @@ find_idlest_cpu(struct sched_group *grou
*
* preempt must be disabled.
*/
-static int sched_balance_self(int cpu, int flag)
+static int sched_balance_task(int cpu, struct task_struct *t, int flag)
{
- struct task_struct *t = current;
struct sched_domain *tmp, *sd = NULL;

for_each_domain(cpu, tmp) {
/*
* If power savings logic is enabled for a domain, stop there.
*/
- if (tmp->flags & SD_POWERSAVINGS_BALANCE)
+ if (t == current && (tmp->flags & SD_POWERSAVINGS_BALANCE))
break;
if (tmp->flags & flag)
sd = tmp;
@@ -1417,10 +1472,17 @@ static int try_to_wake_up(struct task_st
struct rq *rq;
#ifdef CONFIG_SMP
struct sched_domain *sd, *this_sd = NULL;
- unsigned long load, this_load;
- int new_cpu;
+ unsigned long load, this_load, irqflags = 0;
+ int old_cpu, new_cpu, highest_locked;
+ struct rq *new_rq;
+
+dwrr_start:
+ highest_locked = 0;
+ if (!rt_task(p)) {
+ read_lock_irqsave(&dwrr_highest_lock, irqflags);
+ highest_locked = 1;
+ }
#endif
-
rq = task_rq_lock(p, &flags);
old_state = p->state;
if (!(old_state & state))
@@ -1433,8 +1495,20 @@ static int try_to_wake_up(struct task_st
this_cpu = smp_processor_id();

#ifdef CONFIG_SMP
- if (unlikely(task_running(rq, p)))
- goto out_activate;
+ if (unlikely(task_running(rq, p))) {
+ if (rt_task(p) || rq->active->round == dwrr_highest_round)
+ goto out_activate;
+ else {
+ /* wait until p is not running */
+ task_rq_unlock(rq, &flags);
+ if (highest_locked)
+ read_unlock_irqrestore(&dwrr_highest_lock,
+ irqflags);
+ while (task_running(rq, p))
+ cpu_relax();
+ goto dwrr_start;
+ }
+ }

new_cpu = cpu;

@@ -1511,6 +1585,19 @@ static int try_to_wake_up(struct task_st
new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
out_set_cpu:
new_cpu = wake_idle(new_cpu, p);
+ if (highest_locked) {
+ old_cpu = new_cpu;
+ if (!rt_task(p) && !idle_cpu(new_cpu) &&
+ cpu_rq(new_cpu)->active->round != dwrr_highest_round) {
+ /* Need to find a highest round cpu. This is similar
+ to what's done in fork. */
+ new_cpu = sched_balance_task(old_cpu, p,
+ SD_BALANCE_FORK);
+ BUG_ON(new_cpu == -1);
+ }
+
+ new_rq = cpu_rq(new_cpu);
+ }
if (new_cpu != cpu) {
set_task_cpu(p, new_cpu);
task_rq_unlock(rq, &flags);
@@ -1545,6 +1632,10 @@ out_running:
p->state = TASK_RUNNING;
out:
task_rq_unlock(rq, &flags);
+#ifdef CONFIG_SMP
+ if (highest_locked)
+ read_unlock_irqrestore(&dwrr_highest_lock, irqflags);
+#endif

return success;
}
@@ -1590,7 +1681,8 @@ static void __sched_fork(struct task_str
p->se.wait_runtime_underruns = 0;

INIT_LIST_HEAD(&p->run_list);
- p->se.on_rq = 0;
+ p->se.on_rq = NULL;
+ p->se.round_slice_used = 0;

/*
* We mark the process as running here, but have not actually
@@ -1611,7 +1703,10 @@ void sched_fork(struct task_struct *p, i
__sched_fork(p);

#ifdef CONFIG_SMP
- cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
+ /* Non-RT tasks will do sched_balance_task() in wake_up_new_task()
+ * to ensure they start on the "right" CPUs. */
+ if (rt_task(p))
+ cpu = sched_balance_task(cpu, current, SD_BALANCE_FORK);
#endif
__set_task_cpu(p, cpu);

@@ -1651,14 +1746,37 @@ void fastcall wake_up_new_task(struct ta
{
unsigned long flags;
struct rq *rq;
- int this_cpu;
+ int cpu, this_cpu;
+#ifdef CONFIG_SMP
+ unsigned long irqflags = 0;
+ int new_cpu, highest_locked = 0;

+ if (!rt_task(p)) {
+ read_lock_irqsave(&dwrr_highest_lock, irqflags);
+ highest_locked = 1;
+ }
+#endif
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
this_cpu = smp_processor_id(); /* parent's CPU */
+ cpu = task_cpu(p);

p->prio = effective_prio(p);

+#ifdef CONFIG_SMP
+ if (highest_locked) {
+ new_cpu = sched_balance_task(cpu, p, SD_BALANCE_FORK);
+ BUG_ON(new_cpu == -1);
+ if (new_cpu != cpu) {
+ cpu = new_cpu;
+ set_task_cpu(p, cpu);
+ task_rq_unlock(rq, &flags);
+ rq = task_rq_lock(p, &flags);
+ BUG_ON(this_cpu != smp_processor_id());
+ }
+ }
+#endif
+
if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) ||
task_cpu(p) != this_cpu || !current->se.on_rq) {
activate_task(rq, p, 0);
@@ -1671,6 +1789,10 @@ void fastcall wake_up_new_task(struct ta
}
check_preempt_curr(rq, p);
task_rq_unlock(rq, &flags);
+#ifdef CONFIG_SMP
+ if (highest_locked)
+ read_unlock_irqrestore(&dwrr_highest_lock, irqflags);
+#endif
}

/**
@@ -2046,7 +2168,13 @@ out:
void sched_exec(void)
{
int new_cpu, this_cpu = get_cpu();
- new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC);
+
+ new_cpu = this_cpu;
+ if (unlikely(!cpu_isset(this_cpu, current->cpus_allowed)))
+ new_cpu = any_online_cpu(current->cpus_allowed);
+ else if (rt_task(current))
+ new_cpu = sched_balance_task(this_cpu, current,
+ SD_BALANCE_EXEC);
put_cpu();
if (new_cpu != this_cpu)
sched_migrate_task(current, new_cpu);
@@ -2070,6 +2198,25 @@ static void pull_task(struct rq *src_rq,
}

/*
+ * pull_expired_task - move a task from a remote round-expired runqueue to
+ * the local runqueue.
+ * Both runqueues must be locked.
+ */
+static void pull_expired_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu)
+{
+ u64 now = rq_clock(src_rq);
+ dequeue_task(src_rq, p, 0, now);
+ set_task_cpu(p, this_cpu);
+ activate_task(this_rq, p, 0);
+ /*
+ * Note that idle threads have a prio of MAX_PRIO, for this test
+ * to be always true for them.
+ */
+ check_preempt_curr(this_rq, p);
+}
+
+/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
static
@@ -2089,6 +2236,8 @@ int can_migrate_task(struct task_struct

if (task_running(rq, p))
return 0;
+ if (p->se.on_rq == rq->round_expired)
+ return 0;

/*
* Aggressive migration if too many balance attempts have failed:
@@ -2099,6 +2248,72 @@ int can_migrate_task(struct task_struct
return 1;
}

+/*
+ * can_migrate_expired_task - may task p from round expired runqueue rq be
+ * migrated to this_cpu?
+ */
+static
+int can_migrate_expired_task(struct task_struct *p, struct rq *rq,
+ int this_cpu)
+{
+ /*
+ * We do not migrate tasks that are:
+ * 1) running (obviously), or
+ * 2) cannot be migrated to this CPU due to cpus_allowed.
+ */
+ if (!cpu_isset(this_cpu, p->cpus_allowed))
+ return 0;
+
+ /*
+ * p could still be the current running task on rq between the time
+ * it was moved to the round_expired queue and the time schedule()
+ * is called to switch it out.
+ */
+ if (task_running(rq, p))
+ return 0;
+
+ return 1;
+}
+
+static int move_round_expired_tasks(struct rq *this_rq, int this_cpu,
+ struct rq *src_rq, unsigned long max_nr_move)
+{
+ int pulled = 0;
+ struct cfs_rq *src_cfs_rq;
+ struct task_struct *p;
+ struct rq_iterator cfs_rq_iterator;
+
+ if (max_nr_move == 0 || !src_rq->round_expired->nr_running)
+ goto out;
+
+ src_cfs_rq = src_rq->round_expired;
+ cfs_rq_iterator.start = load_balance_start_fair;
+ cfs_rq_iterator.next = load_balance_next_fair;
+
+ p = cfs_rq_iterator.start(src_cfs_rq);
+next:
+ if (!p)
+ goto out;
+
+ if (!can_migrate_expired_task(p, src_rq, this_cpu)) {
+ p = cfs_rq_iterator.next(src_cfs_rq);
+ goto next;
+ }
+
+ pull_expired_task(src_rq, p, this_rq, this_cpu);
+ pulled++;
+
+ /*
+ * We only want to steal up to the prescribed number of tasks.
+ */
+ if (pulled < max_nr_move) {
+ p = cfs_rq_iterator.next(src_cfs_rq);
+ goto next;
+ }
+out:
+ return pulled;
+}
+
static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_nr_move, unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
@@ -2106,7 +2321,7 @@ static int balance_tasks(struct rq *this
int this_best_prio, int best_prio, int best_prio_seen,
struct rq_iterator *iterator)
{
- int pulled = 0, pinned = 0, skip_for_load;
+ int pulled = 0, pinned = 0;
struct task_struct *p;
long rem_load_move = max_load_move;

@@ -2122,18 +2337,8 @@ static int balance_tasks(struct rq *this
next:
if (!p)
goto out;
- /*
- * To help distribute high priority tasks accross CPUs we don't
- * skip a task if it will be the highest priority task (i.e. smallest
- * prio value) on its new queue regardless of its load weight
- */
- skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
- SCHED_LOAD_SCALE_FUZZ;
- if (skip_for_load && p->prio < this_best_prio)
- skip_for_load = !best_prio_seen && p->prio == best_prio;
- if (skip_for_load ||
- !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {

+ if (!can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
best_prio_seen |= p->prio == best_prio;
p = iterator->next(iterator->arg);
goto next;
@@ -2218,6 +2423,10 @@ find_busiest_group(struct sched_domain *
unsigned long min_nr_running = ULONG_MAX;
struct sched_group *group_min = NULL, *group_leader = NULL;
#endif
+ struct rq *this_rq = cpu_rq(this_cpu);
+ unsigned long nr_moved = 0;
+ int found_highest_cpu, highest_cpu = -1, this_group_ok;
+ struct sched_group *highest_group = NULL;

max_load = this_load = total_load = total_pwr = 0;
busiest_load_per_task = busiest_nr_running = 0;
@@ -2244,6 +2453,8 @@ find_busiest_group(struct sched_domain *
/* Tally up the load of all CPUs in the group */
sum_weighted_load = sum_nr_running = avg_load = 0;

+ found_highest_cpu = 0;
+ this_group_ok = 0;
for_each_cpu_mask(i, group->cpumask) {
struct rq *rq;

@@ -2252,6 +2463,17 @@ find_busiest_group(struct sched_domain *

rq = cpu_rq(i);

+ if (idle == CPU_NEWLY_IDLE && !nr_moved &&
+ this_rq->active->round == dwrr_highest_round &&
+ rq->active->round + 1 == dwrr_highest_round) {
+ double_lock_balance(this_rq, rq);
+ nr_moved = move_round_expired_tasks(this_rq,
+ this_cpu, rq,
+ (rq->round_expired->nr_running + 1)/2);
+ spin_unlock(&rq->lock);
+ } else if (rq->active->round == dwrr_highest_round)
+ found_highest_cpu = 1;
+
if (*sd_idle && rq->nr_running)
*sd_idle = 0;

@@ -2269,7 +2491,27 @@ find_busiest_group(struct sched_domain *
avg_load += load;
sum_nr_running += rq->nr_running;
sum_weighted_load += weighted_cpuload(i);
- }
+ if (found_highest_cpu && !local_group &&
+ rq->active->nr_running >= 2) {
+ this_group_ok = 1;
+ if (highest_cpu == -1) {
+ highest_cpu = i;
+ highest_group = group;
+ }
+ }
+ }
+ if (!found_highest_cpu ||
+ (idle == CPU_NEWLY_IDLE && !this_group_ok)) {
+ if (local_group) {
+ avg_load = sg_div_cpu_power(group,
+ avg_load * SCHED_LOAD_SCALE);
+ this_load = avg_load;
+ this = group;
+ this_nr_running = sum_nr_running;
+ this_load_per_task = sum_weighted_load;
+ }
+ goto dwrr_group_next;
+ }

/*
* First idle cpu or the first cpu(busiest) in this sched group
@@ -2361,6 +2603,7 @@ find_busiest_group(struct sched_domain *
}
group_next:
#endif
+dwrr_group_next:
group = group->next;
} while (group != sd->groups);

@@ -2429,6 +2672,9 @@ small_imbalance:
if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
busiest_load_per_task * imbn) {
*imbalance = busiest_load_per_task;
+ if (idle == CPU_NEWLY_IDLE &&
+ (*imbalance == 0 || highest_cpu == -1))
+ goto ret;
return busiest;
}

@@ -2479,10 +2725,20 @@ out_balanced:

if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (idle == CPU_NEWLY_IDLE &&
+ (*imbalance == 0 || highest_cpu == -1))
+ goto ret;
return group_min;
}
#endif
ret:
+ if (idle == CPU_NEWLY_IDLE && highest_cpu != -1) {
+ /* No enough imbalance, so we force one task to be moved
+ * over to the newly idle cpu to ensure SMP fairness. */
+ *imbalance = LONG_MAX; /* signifies forced migration */
+ BUG_ON(!highest_group);
+ return highest_group;
+ }
*imbalance = 0;
return NULL;
}
@@ -2509,6 +2765,10 @@ find_busiest_queue(struct sched_group *g

if (rq->nr_running == 1 && wl > imbalance)
continue;
+ if (idle == CPU_NEWLY_IDLE && rq->nr_running == 1)
+ continue;
+ if (rq->active->round != dwrr_highest_round)
+ continue;

if (wl > max_load) {
max_load = wl;
@@ -2545,6 +2805,18 @@ static int load_balance(int this_cpu, st
cpumask_t cpus = CPU_MASK_ALL;
unsigned long flags;

+ /* Only CPUs in the highest round perform load balancing and this is
+ * the common case. */
+ if (this_rq->active->round < dwrr_highest_round && !idle_cpu(this_cpu))
+ return 0;
+
+ read_lock_irqsave(&dwrr_highest_lock, flags);
+ if (this_rq->active->round < dwrr_highest_round
+ && !idle_cpu(this_cpu)) {
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
+ return 0;
+ }
+
/*
* When power savings policy is enabled for the parent domain, idle
* sibling can pick up load irrespective of busy siblings. In this case,
@@ -2615,32 +2887,11 @@ redo:
sd->nr_balance_failed++;

if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
-
- spin_lock_irqsave(&busiest->lock, flags);
-
- /* don't kick the migration_thread, if the curr
- * task on busiest cpu can't be moved to this_cpu
- */
- if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
- spin_unlock_irqrestore(&busiest->lock, flags);
- all_pinned = 1;
- goto out_one_pinned;
- }
-
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
- spin_unlock_irqrestore(&busiest->lock, flags);
- if (active_balance)
- wake_up_process(busiest->migration_thread);
-
- /*
- * We've kicked active balancing, reset the failure
- * counter.
- */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
+ /* Don't do active balance---DWRR controls when
+ * load balancing is necessary to ensure SMP
+ * fairness. */
+ all_pinned = 1;
+ goto out_one_pinned;
}
} else
sd->nr_balance_failed = 0;
@@ -2660,8 +2911,11 @@ redo:
}

if (!nr_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
+ !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) {
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
return -1;
+ }
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
return nr_moved;

out_balanced:
@@ -2676,8 +2930,11 @@ out_one_pinned:
sd->balance_interval *= 2;

if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
+ !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) {
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
return -1;
+ }
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
return 0;
}

@@ -2699,6 +2956,7 @@ load_balance_newidle(int this_cpu, struc
int all_pinned = 0;
cpumask_t cpus = CPU_MASK_ALL;

+ BUG_ON(this_rq->active->round != dwrr_highest_round);
/*
* When power savings policy is enabled for the parent domain, idle
* sibling can pick up load irrespective of busy siblings. In this case,
@@ -2722,7 +2980,11 @@ redo:
&cpus);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[CPU_NEWLY_IDLE]);
- goto out_balanced;
+ /* find_busiest_group() found a busy group but
+ * find_busiest_queue failed because tasks on the busiest
+ * queue have exited. Let's re-search to find the next
+ * busiest group. */
+ goto redo;
}

BUG_ON(busiest == this_rq);
@@ -2745,6 +3007,17 @@ redo:
goto redo;
}
}
+ else {
+ /*
+ * Two cases we may get here: (1) we found the busiest
+ * queue but all its tasks have exited; (2) the busiest
+ * queue we found has only one, but high load task. In
+ * both cases, we should re-search for the busiest queue.
+ */
+ cpu_clear(cpu_of(busiest), cpus);
+ if (!cpus_empty(cpus))
+ goto redo;
+ }

if (!nr_moved) {
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
@@ -2790,7 +3063,7 @@ static void idle_balance(int this_cpu, s
interval = msecs_to_jiffies(sd->balance_interval);
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
- if (pulled_task)
+ if (pulled_task > 0)
break;
}
if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
@@ -2815,6 +3088,7 @@ static void active_load_balance(struct r
int target_cpu = busiest_rq->push_cpu;
struct sched_domain *sd;
struct rq *target_rq;
+ unsigned long flags;

/* Is there any task to move? */
if (busiest_rq->nr_running <= 1)
@@ -2829,6 +3103,17 @@ static void active_load_balance(struct r
*/
BUG_ON(busiest_rq == target_rq);

+ preempt_disable();
+ spin_unlock(&busiest_rq->lock);
+ /*
+ * The read lock is necessary for getting the correct
+ * dwrr_highest_round value for target cpu that is potentially idle.
+ * It must be acquired before the rq locks to avoid deadlock.
+ */
+ read_lock_irqsave(&dwrr_highest_lock, flags);
+ spin_lock(&busiest_rq->lock);
+ preempt_enable();
+
/* move a task from busiest_rq to target_rq */
double_lock_balance(busiest_rq, target_rq);

@@ -2850,6 +3135,7 @@ static void active_load_balance(struct r
schedstat_inc(sd, alb_failed);
}
spin_unlock(&target_rq->lock);
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
}

#ifdef CONFIG_NO_HZ
@@ -3329,7 +3615,7 @@ pick_next_task(struct rq *rq, struct tas
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
- if (likely(rq->nr_running == rq->cfs.nr_running)) {
+ if (likely(rq->nr_running == rq->active->nr_running)) {
p = fair_sched_class.pick_next_task(rq, now);
if (likely(p))
return p;
@@ -3375,6 +3661,10 @@ need_resched_nonpreemptible:
spin_lock_irq(&rq->lock);
clear_tsk_need_resched(prev);

+ /* prev was inserted into round_expired. */
+ if (prev->se.on_rq == rq->round_expired)
+ dec_nr_running(prev, rq, __rq_clock(rq));
+
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
unlikely(signal_pending(prev)))) {
@@ -3385,13 +3675,80 @@ need_resched_nonpreemptible:
switch_count = &prev->nvcsw;
}

- if (unlikely(!rq->nr_running))
- idle_balance(cpu, rq);
-
- now = __rq_clock(rq);
+ now = 0;
+ if (unlikely(!rq->nr_running)) {
+ unsigned long flags = 0;
+ int write_unlocked = 0;
+ if (rq->active->round == dwrr_highest_round) {
+ preempt_disable();
+ spin_unlock(&rq->lock);
+ write_lock_irqsave(&dwrr_highest_lock, flags);
+ spin_lock(&rq->lock);
+ preempt_enable();
+ if (rq->active->round != dwrr_highest_round ||
+ rq->nr_running) {
+ write_unlocked = 1;
+ write_unlock_irqrestore(&dwrr_highest_lock,
+ flags);
+ } else
+ idle_balance(cpu, rq);
+ } else
+ write_unlocked = 1;
+
+ if (!rq->nr_running) {
+ if (unlikely(!rq->round_expired->nr_running)) {
+ if (!write_unlocked)
+ write_unlock_irqrestore(
+ &dwrr_highest_lock, flags);
+ goto switch_tasks;
+ } else {
+ /* Switch active and round_expired. */
+ struct cfs_rq *cfs_rq = rq->active;
+ rq->active = rq->round_expired;
+ rq->active->fair_clock = cfs_rq->fair_clock;
+ rq->active->exec_clock = cfs_rq->exec_clock;
+ rq->active->sleeper_bonus =
+ cfs_rq->sleeper_bonus;
+ rq->active->wait_runtime_overruns =
+ cfs_rq->wait_runtime_overruns;
+ rq->active->wait_runtime_underruns =
+ cfs_rq->wait_runtime_underruns;
+ rq->round_expired = cfs_rq;
+ rq->round_expired->round =
+ rq->active->round + 1;
+ rq->nr_running = rq->active->nr_running;
+ update_load_add(&rq->ls.load,
+ rq->active->load.weight);
+ if (rq->active->round > dwrr_highest_round)
+ dwrr_highest_round = rq->active->round;
+ /* Since we bypassed enqueue_entity(), each
+ * task's wait_start and wait_start_fair
+ * were not set properly. For all tasks,
+ * they should equal now. Record it in rq. */
+ now = __rq_clock(rq);
+ rq->active->wait_start_fair =
+ rq->active->fair_clock;
+ rq->active->wait_start = now;
+ if (!write_unlocked)
+ write_unlock_irqrestore(
+ &dwrr_highest_lock, flags);
+ }
+ }
+ else if (!write_unlocked)
+ write_unlock_irqrestore(&dwrr_highest_lock, flags);
+ }
+
+switch_tasks:
+ if (!now)
+ now = __rq_clock(rq);
prev->sched_class->put_prev_task(rq, prev, now);
next = pick_next_task(rq, prev, now);

+ if (next == rq->idle) {
+ rq->active->round = 0;
+ rq->round_expired->round = 1;
+ }
+
sched_info_switch(prev, next);

if (likely(prev != next)) {
@@ -3831,7 +4188,8 @@ EXPORT_SYMBOL(sleep_on_timeout);
void rt_mutex_setprio(struct task_struct *p, int prio)
{
unsigned long flags;
- int oldprio, on_rq;
+ int oldprio;
+ struct cfs_rq *on_rq;
struct rq *rq;
u64 now;

@@ -3842,7 +4200,8 @@ void rt_mutex_setprio(struct task_struct

oldprio = p->prio;
on_rq = p->se.on_rq;
- if (on_rq)
+ if (on_rq == rq->active ||
+ (on_rq == rq->round_expired && rt_prio(prio)))
dequeue_task(rq, p, 0, now);

if (rt_prio(prio))
@@ -3852,8 +4211,11 @@ void rt_mutex_setprio(struct task_struct

p->prio = prio;

- if (on_rq) {
+ if (on_rq == rq->active ||
+ (on_rq == rq->round_expired && rt_prio(prio))) {
enqueue_task(rq, p, 0, now);
+ if (on_rq == rq->round_expired)
+ inc_nr_running(p, rq, now);
/*
* Reschedule if we are currently running on this runqueue and
* our priority decreased, or if we are not currently running on
@@ -3873,7 +4235,8 @@ void rt_mutex_setprio(struct task_struct

void set_user_nice(struct task_struct *p, long nice)
{
- int old_prio, delta, on_rq;
+ int old_prio, delta;
+ struct cfs_rq *on_rq;
unsigned long flags;
struct rq *rq;
u64 now;
@@ -3897,7 +4260,7 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
- if (on_rq) {
+ if (on_rq == rq->active) {
dequeue_task(rq, p, 0, now);
dec_load(rq, p, now);
}
@@ -3908,7 +4271,7 @@ void set_user_nice(struct task_struct *p
p->prio = effective_prio(p);
delta = p->prio - old_prio;

- if (on_rq) {
+ if (on_rq == rq->active) {
enqueue_task(rq, p, 0, now);
inc_load(rq, p, now);
/*
@@ -4066,7 +4429,8 @@ __setscheduler(struct rq *rq, struct tas
int sched_setscheduler(struct task_struct *p, int policy,
struct sched_param *param)
{
- int retval, oldprio, oldpolicy = -1, on_rq;
+ int retval, oldprio, oldpolicy = -1;
+ struct cfs_rq *on_rq;
unsigned long flags;
struct rq *rq;

@@ -4885,7 +5249,9 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed);
static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
{
struct rq *rq_dest, *rq_src;
- int ret = 0, on_rq;
+ struct cfs_rq *on_rq;
+ int ret = 0, highest_locked = 0;
+ unsigned long flags = 0;

if (unlikely(cpu_is_offline(dest_cpu)))
return ret;
@@ -4893,6 +5259,23 @@ static int __migrate_task(struct task_st
rq_src = cpu_rq(src_cpu);
rq_dest = cpu_rq(dest_cpu);

+ /* If we move a task to an idle rq_dest, we'll need to acquire
+ * dwrr_highest_read_lock so that we can correctly update the idle
+ * cpu's round number. We could acquire the lock only if rq_dest is
+ * idle, but a subtle case is that rq_dest is busy now but later
+ * becomes idle when we enqueue a task to it. Thus, we
+ * conservatively acquire the lock regardless of the state of
+ * rq_dest. */
+
+ if (!rt_task(p)) {
+ read_lock_irqsave(&dwrr_highest_lock, flags);
+ highest_locked = 1;
+ if (cpu_isset(src_cpu, p->cpus_allowed) &&
+ rq_dest->curr != rq_dest->idle &&
+ rq_dest->active->round != dwrr_highest_round)
+ goto out_dwrr;
+ }
+
double_rq_lock(rq_src, rq_dest);
/* Already moved. */
if (task_cpu(p) != src_cpu)
@@ -4912,6 +5295,9 @@ static int __migrate_task(struct task_st
ret = 1;
out:
double_rq_unlock(rq_src, rq_dest);
+out_dwrr:
+ if (highest_locked)
+ read_unlock_irqrestore(&dwrr_highest_lock, flags);
return ret;
}

@@ -5842,6 +6228,8 @@ static int build_sched_domains(const cpu
SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
sd = &per_cpu(allnodes_domains, i);
*sd = SD_ALLNODES_INIT;
+ sd->flags |= (SD_BALANCE_NEWIDLE | SD_BALANCE_FORK
+ | SD_BALANCE_EXEC);
sd->span = *cpu_map;
cpu_to_allnodes_group(i, cpu_map, &sd->groups);
p = sd;
@@ -5851,6 +6239,7 @@ static int build_sched_domains(const cpu

sd = &per_cpu(node_domains, i);
*sd = SD_NODE_INIT;
+ sd->flags |= SD_BALANCE_NEWIDLE;
sd->span = sched_domain_node_span(cpu_to_node(i));
sd->parent = p;
if (p)
@@ -5861,6 +6250,7 @@ static int build_sched_domains(const cpu
p = sd;
sd = &per_cpu(phys_domains, i);
*sd = SD_CPU_INIT;
+ sd->flags |= SD_BALANCE_FORK;
sd->span = nodemask;
sd->parent = p;
if (p)
@@ -5871,6 +6261,7 @@ static int build_sched_domains(const cpu
p = sd;
sd = &per_cpu(core_domains, i);
*sd = SD_MC_INIT;
+ sd->flags |= SD_BALANCE_FORK;
sd->span = cpu_coregroup_map(i);
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
@@ -5882,6 +6273,7 @@ static int build_sched_domains(const cpu
p = sd;
sd = &per_cpu(cpu_domains, i);
*sd = SD_SIBLING_INIT;
+ sd->flags |= SD_BALANCE_FORK;
sd->span = cpu_sibling_map[i];
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
@@ -6271,13 +6663,18 @@ int in_sched_functions(unsigned long add
&& addr < (unsigned long)__sched_text_end);
}

-static inline void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
+static inline void init_cfs_rq(struct rq *rq)
{
- cfs_rq->tasks_timeline = RB_ROOT;
- cfs_rq->fair_clock = 1;
-#ifdef CONFIG_FAIR_GROUP_SCHED
- cfs_rq->rq = rq;
-#endif
+ rq->active = rq->cfs;
+ rq->round_expired = rq->cfs + 1;
+ rq->active->tasks_timeline = RB_ROOT;
+ rq->active->fair_clock = 1;
+ rq->active->round = 0;
+ rq->round_expired->tasks_timeline = RB_ROOT;
+ rq->round_expired->fair_clock = 1;
+ rq->round_expired->round = 1;
+ rq->active->rq = rq;
+ rq->round_expired->rq = rq;
}

void __init sched_init(void)
@@ -6302,10 +6699,10 @@ void __init sched_init(void)
lockdep_set_class(&rq->lock, &rq->rq_lock_key);
rq->nr_running = 0;
rq->clock = 1;
- init_cfs_rq(&rq->cfs, rq);
+ init_cfs_rq(rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
- list_add(&rq->cfs.leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
+ list_add(&rq->active->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
#endif
rq->ls.load_update_last = now;
rq->ls.load_update_start = now;
@@ -6394,7 +6791,7 @@ void normalize_rt_tasks(void)
struct task_struct *g, *p;
unsigned long flags;
struct rq *rq;
- int on_rq;
+ struct cfs_rq *on_rq;

read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
@@ -6406,7 +6803,7 @@ void normalize_rt_tasks(void)
p->se.sleep_start = 0;
p->se.sleep_start_fair = 0;
p->se.block_start = 0;
- task_rq(p)->cfs.fair_clock = 0;
+ task_rq(p)->active->fair_clock = 0;
task_rq(p)->clock = 0;

if (!rt_task(p)) {
diff -uprN linux-2.6.23-rc1/kernel/sched_debug.c linux-2.6.23-rc1-dwrr/kernel/sched_debug.c
--- linux-2.6.23-rc1/kernel/sched_debug.c 2007-07-22 18:42:28.000000000 -0700
+++ linux-2.6.23-rc1-dwrr/kernel/sched_debug.c 2007-07-22 21:37:19.000000000 -0700
@@ -40,7 +40,7 @@ print_task(struct seq_file *m, struct rq
"%15Ld %15Ld %15Ld %15Ld %15Ld\n",
p->comm, p->pid,
(long long)p->se.fair_key,
- (long long)(p->se.fair_key - rq->cfs.fair_clock),
+ (long long)(p->se.fair_key - rq->active->fair_clock),
(long long)p->se.wait_runtime,
(long long)(p->nvcsw + p->nivcsw),
p->prio,
@@ -72,6 +72,11 @@ static void print_rq(struct seq_file *m,
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
continue;

+ if (p->se.on_rq == rq->active)
+ SEQ_printf(m, "[active] ");
+ else
+ SEQ_printf(m, "[expired] ");
+
print_task(m, rq, p, now);
} while_each_thread(g, p);

@@ -114,6 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
P(wait_runtime_overruns);
P(wait_runtime_underruns);
P(sleeper_bonus);
+ P(round);
#undef P

print_cfs_rq_runtime_sum(m, cpu, cfs_rq);
diff -uprN linux-2.6.23-rc1/kernel/sched_fair.c linux-2.6.23-rc1-dwrr/kernel/sched_fair.c
--- linux-2.6.23-rc1/kernel/sched_fair.c 2007-07-22 18:42:28.000000000 -0700
+++ linux-2.6.23-rc1-dwrr/kernel/sched_fair.c 2007-07-22 21:58:05.000000000 -0700
@@ -62,6 +62,14 @@ unsigned int sysctl_sched_stat_granulari
unsigned int sysctl_sched_runtime_limit __read_mostly;

/*
+ * DWRR base round slice. For each sched_entity, its round slice equals its
+ * normalized weight (i.e., weight/NICE_0_LOAD) multipled by the base round
+ * slice and controls how long it runs in a round.
+ * (default: 30 msec, units: nanoseconds)
+ */
+u64 sysctl_sched_base_round_slice __read_mostly = 30000000;
+
+/*
* Debugging: various feature bits
*/
enum {
@@ -87,14 +95,14 @@ extern struct sched_class fair_sched_cla
* CFS operations on generic schedulable entities:
*/

-#ifdef CONFIG_FAIR_GROUP_SCHED
-
/* cpu runqueue to which this cfs_rq is attached */
static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
{
return cfs_rq->rq;
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+
/* currently running entity (if any) on this cfs_rq */
static inline struct sched_entity *cfs_rq_curr(struct cfs_rq *cfs_rq)
{
@@ -112,11 +120,6 @@ set_cfs_rq_curr(struct cfs_rq *cfs_rq, s

#else /* CONFIG_FAIR_GROUP_SCHED */

-static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
-{
- return container_of(cfs_rq, struct rq, cfs);
-}
-
static inline struct sched_entity *cfs_rq_curr(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
@@ -139,6 +142,12 @@ static inline struct task_struct *task_o
return container_of(se, struct task_struct, se);
}

+static inline u64 weight_to_round_slice(unsigned long weight)
+{
+ /* Nice 0 receives round slice of sysctl_sched_base_round_slice;
+ * others proportional to their weight. */
+ return (weight * sysctl_sched_base_round_slice) >> NICE_0_SHIFT;
+}

/**************************************************************
* Scheduling class tree data structure manipulation methods:
@@ -150,6 +159,7 @@ static inline struct task_struct *task_o
static inline void
__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct rq *rq = rq_of(cfs_rq);
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
@@ -185,7 +195,9 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
update_load_add(&cfs_rq->load, se->load.weight);
cfs_rq->nr_running++;
- se->on_rq = 1;
+ se->on_rq = cfs_rq;
+ if (cfs_rq == rq->round_expired)
+ se->wait_runtime = 0;
}

static inline void
@@ -196,7 +208,7 @@ __dequeue_entity(struct cfs_rq *cfs_rq,
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
update_load_sub(&cfs_rq->load, se->load.weight);
cfs_rq->nr_running--;
- se->on_rq = 0;
+ se->on_rq = NULL;
}

static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
@@ -340,6 +352,7 @@ static void update_curr(struct cfs_rq *c
delta_exec = (unsigned long)(now - curr->exec_start);

curr->delta_exec += delta_exec;
+ curr->round_slice_used += delta_exec;

if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) {
__update_curr(cfs_rq, curr, now);
@@ -383,13 +396,19 @@ calc_weighted(unsigned long delta, unsig
static void
update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
{
+ struct rq *rq = rq_of(cfs_rq);
s64 key;

+ /* Are we enqueueing a round-expired task? */
+ if (cfs_rq == rq->round_expired)
+ /* Flag wait_start as invalid and re-compute it when
+ * the task moves back to active. */
+ se->wait_start = ULLONG_MAX;
/*
* Are we enqueueing a waiting task? (for current tasks
* a dequeue/enqueue event is a NOP)
*/
- if (se != cfs_rq_curr(cfs_rq))
+ else if (se != cfs_rq_curr(cfs_rq))
update_stats_wait_start(cfs_rq, se, now);
/*
* Update the key:
@@ -445,8 +464,19 @@ update_stats_wait_end(struct cfs_rq *cfs
{
unsigned long delta_fair;

- delta_fair = (unsigned long)min((u64)(2*sysctl_sched_runtime_limit),
+ if (se->on_rq == rq_of(cfs_rq)->round_expired)
+ delta_fair = 0;
+ else {
+ if (se->wait_start == ULLONG_MAX) {
+ /* First time here since se was moved from
+ * round-expired to active. */
+ se->wait_start_fair = cfs_rq->wait_start_fair;
+ se->wait_start = cfs_rq->wait_start;
+ }
+ delta_fair = (unsigned long)min(
+ (u64)(2*sysctl_sched_runtime_limit),
(u64)(cfs_rq->fair_clock - se->wait_start_fair));
+ }

se->delta_fair_run += delta_fair;
if (unlikely(abs(se->delta_fair_run) >=
@@ -462,7 +492,7 @@ update_stats_wait_end(struct cfs_rq *cfs
static inline void
update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
{
- update_curr(cfs_rq, now);
+ update_curr(rq_of(cfs_rq)->active, now);
/*
* Mark the end of the wait period if dequeueing a
* waiting task:
@@ -585,12 +615,14 @@ static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
int wakeup, u64 now)
{
+ struct rq *rq = rq_of(cfs_rq);
+
/*
* Update the fair clock.
*/
- update_curr(cfs_rq, now);
+ update_curr(rq->active, now);

- if (wakeup)
+ if (cfs_rq == rq->active && wakeup)
enqueue_sleeper(cfs_rq, se, now);

update_stats_enqueue(cfs_rq, se, now);
@@ -626,15 +658,22 @@ static void
__check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se,
struct sched_entity *curr, unsigned long granularity)
{
- s64 __delta = curr->fair_key - se->fair_key;
+ s64 __delta = 0;
+ struct rq *rq = rq_of(cfs_rq);
+
+ if (se)
+ __delta = curr->fair_key - se->fair_key;

/*
* Take scheduling granularity into account - do not
* preempt the current task unless the best task has
- * a larger than sched_granularity fairness advantage:
- */
- if (__delta > niced_granularity(curr, granularity))
- resched_task(rq_of(cfs_rq)->curr);
+ * a larger than sched_granularity fairness advantage.
+ * Always preempt the current task if it's been moved
+ * to round_expired.
+ */
+ if (__delta > niced_granularity(curr, granularity)
+ || curr->on_rq == rq->round_expired)
+ resched_task(rq->curr);
}

static inline void
@@ -664,6 +703,8 @@ static struct sched_entity *pick_next_en
static void
put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev, u64 now)
{
+ struct rq *rq = rq_of(cfs_rq);
+
/*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
@@ -673,7 +714,7 @@ put_prev_entity(struct cfs_rq *cfs_rq, s

update_stats_curr_end(cfs_rq, prev, now);

- if (prev->on_rq)
+ if (prev->on_rq == rq->active)
update_stats_wait_start(cfs_rq, prev, now);
set_cfs_rq_curr(cfs_rq, NULL);
}
@@ -683,20 +724,37 @@ static void entity_tick(struct cfs_rq *c
struct rq *rq = rq_of(cfs_rq);
struct sched_entity *next;
u64 now = __rq_clock(rq);
+ u64 round_slice;
+
+ if (curr->on_rq == rq->round_expired) {
+ /* Task is in round expired but was not scheduled yet. */
+ set_tsk_need_resched(rq_of(cfs_rq)->curr);
+ return;
+ }

/*
* Dequeue and enqueue the task to update its
* position within the tree:
*/
dequeue_entity(cfs_rq, curr, 0, now);
- enqueue_entity(cfs_rq, curr, 0, now);
+
+ /* Check if the task has used up its entitled round slice. */
+ round_slice = weight_to_round_slice(curr->load.weight);
+ if (curr->round_slice_used >= round_slice) {
+ curr->round_slice_used -= round_slice;
+ enqueue_entity(rq->round_expired, curr, 0, now);
+ } else
+ enqueue_entity(cfs_rq, curr, 0, now);

/*
* Reschedule if another task tops the current one.
*/
- next = __pick_next_entity(cfs_rq);
- if (next == curr)
- return;
+ if (cfs_rq->nr_running) {
+ next = __pick_next_entity(cfs_rq);
+ if (next == curr)
+ return;
+ } else
+ next = NULL;

__check_preempt_curr_fair(cfs_rq, next, curr, sysctl_sched_granularity);
}
@@ -734,7 +792,7 @@ static inline struct cfs_rq *group_cfs_r
static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
{
/* A later patch will take group into account */
- return &cpu_rq(this_cpu)->cfs;
+ return cpu_rq(this_cpu)->active;
}

/* Iterate thr' all leaf cfs_rq's on a runqueue */
@@ -757,7 +815,7 @@ static inline int is_same_group(struct t

static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
{
- return &task_rq(p)->cfs;
+ return task_rq(p)->active;
}

static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
@@ -765,7 +823,7 @@ static inline struct cfs_rq *cfs_rq_of(s
struct task_struct *p = task_of(se);
struct rq *rq = task_rq(p);

- return &rq->cfs;
+ return rq->active;
}

/* runqueue "owned" by this group */
@@ -776,11 +834,11 @@ static inline struct cfs_rq *group_cfs_r

static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
{
- return &cpu_rq(this_cpu)->cfs;
+ return cpu_rq(this_cpu)->active;
}

#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+ for (cfs_rq = rq->active; cfs_rq; cfs_rq = NULL)

static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
{
@@ -820,7 +878,7 @@ dequeue_task_fair(struct rq *rq, struct
struct sched_entity *se = &p->se;

for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
+ cfs_rq = se->on_rq;
dequeue_entity(cfs_rq, se, sleep, now);
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
@@ -836,6 +894,8 @@ static void yield_task_fair(struct rq *r
struct cfs_rq *cfs_rq = task_cfs_rq(p);
u64 now = __rq_clock(rq);

+ if (p->se.on_rq == rq->round_expired)
+ return;
/*
* Dequeue and enqueue the task to update its
* position within the tree:
@@ -872,7 +932,7 @@ static void check_preempt_curr_fair(stru

static struct task_struct *pick_next_task_fair(struct rq *rq, u64 now)
{
- struct cfs_rq *cfs_rq = &rq->cfs;
+ struct cfs_rq *cfs_rq = rq->active;
struct sched_entity *se;

if (unlikely(!cfs_rq->nr_running))
@@ -1073,6 +1133,8 @@ static void task_new_fair(struct rq *rq,

__enqueue_entity(cfs_rq, se);
inc_nr_running(p, rq, now);
+ if (rq->curr == rq->idle)
+ dwrr_update_idle(p, rq);
}

#ifdef CONFIG_FAIR_GROUP_SCHED
diff -uprN linux-2.6.23-rc1/kernel/sysctl.c linux-2.6.23-rc1-dwrr/kernel/sysctl.c
--- linux-2.6.23-rc1/kernel/sysctl.c 2007-07-22 18:42:28.000000000 -0700
+++ linux-2.6.23-rc1-dwrr/kernel/sysctl.c 2007-07-22 21:59:46.000000000 -0700
@@ -217,6 +217,8 @@ static unsigned long min_sched_granulari
static unsigned long max_sched_granularity_ns = 1000000000; /* 1 second */
static unsigned long min_wakeup_granularity_ns; /* 0 usecs */
static unsigned long max_wakeup_granularity_ns = 1000000000; /* 1 second */
+static unsigned long min_sched_base_round_slice = NSEC_PER_TICK * 4;
+static unsigned long max_sched_base_round_slice = ULONG_MAX;
#endif

static ctl_table kern_table[] = {
@@ -276,6 +278,17 @@ static ctl_table kern_table[] = {
.extra1 = &min_sched_granularity_ns,
.extra2 = &max_sched_granularity_ns,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_base_round_slice",
+ .data = &sysctl_sched_base_round_slice,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &proc_doulongvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_sched_base_round_slice,
+ .extra2 = &max_sched_base_round_slice,
+ },
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

unread,
Jul 23, 2007, 3:10:11 PM7/23/07
to
Tong Li <tong...@intel.com> writes:
> +
> + read_lock_irqsave(&dwrr_highest_lock, flags);

That lock looks like it would bounce between CPUs like crazy.
Did you try any benchmarks on a larger system?

-Andi

Li, Tong N

unread,
Jul 23, 2007, 5:20:10 PM7/23/07
to
I benchmarked an early version of this code (against 2.6.21) with
SPECjbb, SPEComp, kernbench, etc. on an 8-processor system, and didn't
see any slowdown compared to the stock scheduler. I'll generate the data
again with this version of the code. On the other hand, if locking does
become a problem for certain systems/workloads, increasing
sysctl_base_round_slice can reduce the locking frequency and alleviate
the problem, at the cost of being relatively less fair across the CPUs.

tong

Chris Friesen

unread,
Jul 23, 2007, 5:30:09 PM7/23/07
to
Li, Tong N wrote:
> On the other hand, if locking does
> become a problem for certain systems/workloads, increasing
> sysctl_base_round_slice can reduce the locking frequency and alleviate
> the problem, at the cost of being relatively less fair across the CPUs.

If locking does become a problem, it may make sense to switch instead to
a lockless algorithm of some sort...

Chris

Chris Snook

unread,
Jul 23, 2007, 7:50:08 PM7/23/07
to

This patch is massive overkill. Maybe you're not seeing the overhead on your
8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT
workload. Do you have any data justifying the need for this patch?

Doing anything globally is expensive, and should be avoided at all costs. The
scheduler already rebalances when a CPU is idle, so you're really just
rebalancing the overload here. On a server workload, we don't necessarily want
to do that, since the overload may be multiple threads spawned to service a
single request, and could be sharing a lot of data.

Instead of an explicit system-wide fairness invariant (which will get very hard
to enforce when you throw SCHED_FIFO processes into the mix and the scheduler
isn't running on some CPUs), try a simpler invariant. If we guarantee that the
load on CPU X does not differ from the load on CPU (X+1)%N by more than some
small constant, then we know that the system is fairly balanced. We can achieve
global fairness with local balancing, and avoid all this overhead. This has the
added advantage of keeping most of the migrations core/socket/node-local on
SMT/multicore/NUMA systems.

-- Chris

Chris Snook

unread,
Jul 24, 2007, 4:10:08 AM7/24/07
to

To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only"
algorithm is the only way to do this. Rather, I'm pointing out that
even an extremely simple algorithm can give you fair loading when you
already have CFS managing the runqueues. There are countless more
sophisticated ways we could do this without using global locking, or
possibly without any locking at all, other than the locking we already
use during migration.

Andi Kleen

unread,
Jul 24, 2007, 5:50:14 AM7/24/07
to
On Mon, Jul 23, 2007 at 03:25:12PM -0600, Chris Friesen wrote:
> Li, Tong N wrote:
> >On the other hand, if locking does
> >become a problem for certain systems/workloads, increasing
> >sysctl_base_round_slice can reduce the locking frequency and alleviate
> >the problem, at the cost of being relatively less fair across the CPUs.
>
> If locking does become a problem, it may make sense to switch instead to
> a lockless algorithm of some sort...

That usually has the same issue with cache lines. It would be better
to analyze first if such global synchronization is really needed.
I hope not.

-Andi

Tong Li

unread,
Jul 24, 2007, 1:10:09 PM7/24/07
to
On Mon, 23 Jul 2007, Chris Snook wrote:

> This patch is massive overkill. Maybe you're not seeing the overhead on your
> 8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RT
> workload. Do you have any data justifying the need for this patch?
>
> Doing anything globally is expensive, and should be avoided at all costs.
> The scheduler already rebalances when a CPU is idle, so you're really just
> rebalancing the overload here. On a server workload, we don't necessarily
> want to do that, since the overload may be multiple threads spawned to
> service a single request, and could be sharing a lot of data.
>
> Instead of an explicit system-wide fairness invariant (which will get very
> hard to enforce when you throw SCHED_FIFO processes into the mix and the
> scheduler isn't running on some CPUs), try a simpler invariant. If we
> guarantee that the load on CPU X does not differ from the load on CPU (X+1)%N
> by more than some small constant, then we know that the system is fairly
> balanced. We can achieve global fairness with local balancing, and avoid all
> this overhead. This has the added advantage of keeping most of the
> migrations core/socket/node-local on SMT/multicore/NUMA systems.
>

Chris,

These are all good comments. Thanks. I see three concerns and I'll try to
address each.

1. Unjustified effort/cost

My view is that fairness (or proportional fairness) is a first-order
metric and necessary in many cases even at the cost of performance. A
server running multiple client apps certainly doesn't want the clients to
see that they are getting different amounts of service, assuming the
clients are of equal importance (priority). When the clients have
different priorities, the server also wants to give them service time
proportional to their priority/weight. The same is true for desktops,
where users want to nice tasks and see an effect that's consistent with
what they expect, i.e., task CPU time should be proportional to their nice
values. The point is that it's important to enforce fairness because it
enables users to control the system in a deterministic way and it helps
each task get good response time. CFS achieves this on local CPUs and this
patch makes the support stronger for SMPs. It's overkill to enforce
unnecessary degree of fairness, but it is necessary to enforce an error
bound, even if large, such that the user can reliably know what kind of
CPU time (even performance) he'd get after making a nice value change.
This patch ensures an error bound of (max task weight currently in system)
* sysctl_base_round_slice compared to an idealized fair system.

2. High performance overhead

Two sources of overhead: (1) the global rw_lock, and (2) task migrations.
I agree they can be problems on NUMA, but I'd argue they are not on SMPs.
Any global lock can cause two performance problems: (1) serialization, and
(2) excessive remote cache accesses and traffic. IMO (1) is not a problem
since this is a rw_lock and a write_lock occurs infrequently only when all
tasks in the system finish the current round. (2) could be a problem as
every read/write lock causes an invalidation. It could be improved by
using Nick's ticket lock. On the other hand, this is a single cache line
and it's invalidated only when a CPU finishes all tasks in its local
active RB tree, where each nice 0 task takes sysctl_base_round_slice
(e.g., 30ms) to finish, so it looks to me the invalidations would be
infrequent enough and could be noise in the whole system.

The patch can introduce more task migrations. I don't think it's a problem
in SMPs. For one, CFS already is doing context switches more frequently
than before, and thus even if a task doesn't migration, it may still miss
in the local cache because the previous task kicked out its data. In this
sense, migration doesn't add more cache misses. Now, in NUMA, the penalty
of a cache miss can be much higher if we migrate off-node. Here, I agree
that this can be a problem. For NUMA, I'd expect the user to tune
sysctl_base_round_slice to a large enough value to avoid frequent
migrations (or just affinitize tasks). Benchmarking could also help us
determine a reasonable default sysctl_base_round_slice for NUMA.

3. Hard to enforce fairness when there are non-SCHED_FAIR tasks

All we want is to enforce fairness among the SCHED_FAIR tasks. Leveraging
CFS, this patch makes sure relative CPU time of two SCHED_FAIR tasks in an
SMP equals their weight ratio. In other words, if the entire SCHED_FAIR
tasks are given X amount of CPU time, then a weight w task is guaranteed
with w/X time in any interval of time.

tong

Li, Tong N

unread,
Jul 24, 2007, 1:20:09 PM7/24/07
to
On Tue, 2007-07-24 at 04:07 -0400, Chris Snook wrote:

> To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only"
> algorithm is the only way to do this. Rather, I'm pointing out that
> even an extremely simple algorithm can give you fair loading when you
> already have CFS managing the runqueues. There are countless more
> sophisticated ways we could do this without using global locking, or
> possibly without any locking at all, other than the locking we already
> use during migration.
>
> -- Chris

Yes, as Andi and Chris also pointed out, I'll think about if global
synchronization can be removed or relaxed.

Thanks,

tong

Chris Snook

unread,
Jul 24, 2007, 2:10:10 PM7/24/07
to

In the cases where it's critical, we have realtime. In the cases where it's
important, this implementation won't keep latency low enough to make people
happier. If you've got a test case to prove me wrong, I'd like to see it.

> A
> server running multiple client apps certainly doesn't want the clients
> to see that they are getting different amounts of service, assuming the
> clients are of equal importance (priority).

A conventional server receives client requests, does a brief amount of work, and
then gives a response. This patch doesn't help that workload. This patch helps
the case where you've got batch jobs running on a slightly overloaded compute
server, and unfairness means you end up waiting for a couple threads to finish
at the end while CPUs sit idle. I don't think it's that big of a problem, and
if it is, I think we can solve it in a more elegant way than reintroducing
expired queues.

> When the clients have
> different priorities, the server also wants to give them service time
> proportional to their priority/weight. The same is true for desktops,
> where users want to nice tasks and see an effect that's consistent with
> what they expect, i.e., task CPU time should be proportional to their
> nice values. The point is that it's important to enforce fairness
> because it enables users to control the system in a deterministic way
> and it helps each task get good response time. CFS achieves this on
> local CPUs and this patch makes the support stronger for SMPs. It's
> overkill to enforce unnecessary degree of fairness, but it is necessary
> to enforce an error bound, even if large, such that the user can
> reliably know what kind of CPU time (even performance) he'd get after
> making a nice value change.

Doesn't CFS already do this?

> This patch ensures an error bound of (max
> task weight currently in system) * sysctl_base_round_slice compared to
> an idealized fair system.

The thing that bugs me about this is the diminishing returns. It looks like it
will only give a substantial benefit when system load is somewhere between 1.0
and 2.0. On a heavily-loaded system, CFS will do the right thing within a good
margin of error, and on an underloaded system, even a naive scheduler will do
the right thing. If you want to optimize smp fairness in this range, that's
great, but there's probably a lighter-weight way to do it.

> 2. High performance overhead
>
> Two sources of overhead: (1) the global rw_lock, and (2) task
> migrations. I agree they can be problems on NUMA, but I'd argue they are
> not on SMPs. Any global lock can cause two performance problems: (1)
> serialization, and (2) excessive remote cache accesses and traffic. IMO
> (1) is not a problem since this is a rw_lock and a write_lock occurs
> infrequently only when all tasks in the system finish the current round.
> (2) could be a problem as every read/write lock causes an invalidation.
> It could be improved by using Nick's ticket lock. On the other hand,
> this is a single cache line and it's invalidated only when a CPU
> finishes all tasks in its local active RB tree, where each nice 0 task
> takes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to me
> the invalidations would be infrequent enough and could be noise in the
> whole system.

Task migrations don't bother me all that much. Since we're migrating the
*overload*, I expect those processes to be fairly cache-cold whenever we get
around to them anyway. It'd be nice to be SMT/multicore/NUMA-smart about the
migrations, but that's an implementation detail.

The global lock is what really bothers me. It's not just that it'll suck in
certain rare cases (though I think it will), but it's really more the top-down
design. If we take a bottom-up approach, first keeping fairness between
neighbors, it'll function identically on the low-end systems, but it'll be a
huge win on the big systems, instead of a potential bottleneck.

> The patch can introduce more task migrations. I don't think it's a
> problem in SMPs. For one, CFS already is doing context switches more
> frequently than before, and thus even if a task doesn't migration, it
> may still miss in the local cache because the previous task kicked out
> its data. In this sense, migration doesn't add more cache misses. Now,
> in NUMA, the penalty of a cache miss can be much higher if we migrate
> off-node. Here, I agree that this can be a problem. For NUMA, I'd expect
> the user to tune sysctl_base_round_slice to a large enough value to
> avoid frequent migrations (or just affinitize tasks). Benchmarking could
> also help us determine a reasonable default sysctl_base_round_slice for
> NUMA.

The kernel should be tuned reasonably by default. The days in which NUMA
implied a supercomputer tuned by specialists are over. NUMA is commodity now.
As a simple heuristic, I suggest multiplying your default
sysctl_base_round_slice by (log2(num_nodes)+1). This scales the migration cost
proportionally on most simple topologies. People with gigantic torus-topology
supercomputers who still bring in specialists to tune everything might still
want to raise it.

> 3. Hard to enforce fairness when there are non-SCHED_FAIR tasks
>
> All we want is to enforce fairness among the SCHED_FAIR tasks.
> Leveraging CFS, this patch makes sure relative CPU time of two
> SCHED_FAIR tasks in an SMP equals their weight ratio. In other words, if
> the entire SCHED_FAIR tasks are given X amount of CPU time, then a
> weight w task is guaranteed with w/X time in any interval of time.

The problem is that there are various reasons why a particular CPU might not get
to execute this codepath for a little while, and then either your round
invariant is broken, or you have a performance-killing barrier. Relaxing your
invariants and your locking would mitigate this.

Concerns aside, I agree that fairness is important, and I'd really like to see a
test case that demonstrates the problem.

-- Chris

Chris Snook

unread,
Jul 24, 2007, 4:50:09 PM7/24/07
to
Chris Friesen wrote:

> Chris Snook wrote:
>
>> Concerns aside, I agree that fairness is important, and I'd really
>> like to see a test case that demonstrates the problem.
>
> One place that might be useful is the case of fairness between resource
> groups, where the load balancer needs to consider each group separately.

You mean like the CFS group scheduler patches? I don't see how this patch is
related to that, besides working on top of it.

> Now it may be the case that trying to keep the load of each class within
> X% of the other cpus is sufficient, but it's not trivial.

I agree. My suggestion is that we try being fair from the bottom-up, rather
than top-down. If most of the rebalancing is local, we can minimize expensive
locking and cross-node migrations, and scale very nicely on large NUMA boxes.

> Consider the case where you have a resource group that is allocated 50%
> of each cpu in a dual cpu system, and only have a single task in that
> group. This means that in order to make use of the full group
> allocation, that task needs to be load-balanced to the other cpu as soon
> as it gets scheduled out. Most load-balancers can't handle that kind of
> granularity, but I have guys in our engineering team that would really
> like this level of performance.

Divining the intentions of the administrator is an AI-complete problem and we're
not going to try to solve that in the kernel. An intelligent administrator
could also allocate 50% of each CPU to a resource group containing all the
*other* processes. Then, when the other processes are scheduled out, your
single task will run on whichever CPU is idle. This will very quickly
equilibrate to the scheduling ping-pong you seem to want. The scheduler
deliberately avoids this kind of migration by default because it hurts cache and
TLB performance, so if you want to override this very sane default behavior,
you're going to have to explicitly configure it yourself.

> We currently use CKRM on an SMP machine, but the only way we can get
> away with it is because our main app is affined to one cpu and just
> about everything else is affined to the other.

If you're not explicitly allocating resources, you're just low-latency, not
truly realtime. Realtime requires guaranteed resources, so messing with
affinities is a necessary evil.

> We have another SMP box that would benefit from group scheduling, but we
> can't use it because the load balancer is not nearly good enough.

Which scheduler? Have you tried the CFS group scheduler patches?

Li, Tong N

unread,
Jul 24, 2007, 5:00:10 PM7/24/07
to
On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote:

> Divining the intentions of the administrator is an AI-complete problem and we're
> not going to try to solve that in the kernel. An intelligent administrator
> could also allocate 50% of each CPU to a resource group containing all the
> *other* processes. Then, when the other processes are scheduled out, your
> single task will run on whichever CPU is idle. This will very quickly
> equilibrate to the scheduling ping-pong you seem to want. The scheduler
> deliberately avoids this kind of migration by default because it hurts cache and
> TLB performance, so if you want to override this very sane default behavior,
> you're going to have to explicitly configure it yourself.
>

Well, the admin wouldn't specifically ask for 50% of each CPU. He would
just allocate 50% of total CPU time---it's up to the scheduler to
fulfill that. If a task is entitled to one CPU, then it'll stay there
and have no migration. Migration occurs only if there's overload, in
which case I think you agree in your last email that the cache and TLB
impact is not an issue (at least in SMP).

tong

Chris Snook

unread,
Jul 24, 2007, 5:10:06 PM7/24/07
to
Li, Tong N wrote:
> On Tue, 2007-07-24 at 16:39 -0400, Chris Snook wrote:
>
>> Divining the intentions of the administrator is an AI-complete problem and we're
>> not going to try to solve that in the kernel. An intelligent administrator
>> could also allocate 50% of each CPU to a resource group containing all the
>> *other* processes. Then, when the other processes are scheduled out, your
>> single task will run on whichever CPU is idle. This will very quickly
>> equilibrate to the scheduling ping-pong you seem to want. The scheduler
>> deliberately avoids this kind of migration by default because it hurts cache and
>> TLB performance, so if you want to override this very sane default behavior,
>> you're going to have to explicitly configure it yourself.
>>
>
> Well, the admin wouldn't specifically ask for 50% of each CPU. He would
> just allocate 50% of total CPU time---it's up to the scheduler to
> fulfill that. If a task is entitled to one CPU, then it'll stay there
> and have no migration. Migration occurs only if there's overload, in
> which case I think you agree in your last email that the cache and TLB
> impact is not an issue (at least in SMP).

I don't think Chris's scenario has much bearing on your patch. What he wants is
to have a task that will always be running, but can't monopolize either CPU.
This is useful for certain realtime workloads, but as I've said before, realtime
requires explicit resource allocation. I don't think this is very relevant to
SCHED_FAIR balancing.

-- Chris

Chris Friesen

unread,
Jul 24, 2007, 5:20:07 PM7/24/07
to
Chris Snook wrote:

>> We have another SMP box that would benefit from group scheduling, but
>> we can't use it because the load balancer is not nearly good enough.

> Which scheduler? Have you tried the CFS group scheduler patches?

CKRM as well.

Haven't tried the CFS group scheduler, as we're stuck on a way old
kernel and I've been too busy to play with the new stuff.

Chris Friesen

unread,
Jul 24, 2007, 3:50:10 PM7/24/07
to
Chris Snook wrote:

> Concerns aside, I agree that fairness is important, and I'd really like
> to see a test case that demonstrates the problem.

One place that might be useful is the case of fairness between resource

groups, where the load balancer needs to consider each group separately.

Now it may be the case that trying to keep the load of each class within

X% of the other cpus is sufficient, but it's not trivial.

Consider the case where you have a resource group that is allocated 50%

of each cpu in a dual cpu system, and only have a single task in that
group. This means that in order to make use of the full group
allocation, that task needs to be load-balanced to the other cpu as soon
as it gets scheduled out. Most load-balancers can't handle that kind of
granularity, but I have guys in our engineering team that would really
like this level of performance.

We currently use CKRM on an SMP machine, but the only way we can get

away with it is because our main app is affined to one cpu and just
about everything else is affined to the other.

We have another SMP box that would benefit from group scheduling, but we

can't use it because the load balancer is not nearly good enough.

Chris

Chris Friesen

unread,
Jul 24, 2007, 5:30:10 PM7/24/07
to
Chris Snook wrote:

> I don't think Chris's scenario has much bearing on your patch. What he
> wants is to have a task that will always be running, but can't
> monopolize either CPU. This is useful for certain realtime workloads,
> but as I've said before, realtime requires explicit resource
> allocation. I don't think this is very relevant to SCHED_FAIR balancing.

I'm not actually using the scenario I described, its just sort of a
worst-case load-balancing thought experiment.

What we want to be able to do is to specify a fraction of each cpu for
each task group. We don't want to have to affine tasks to particular cpus.

This means that the load balancer must be group-aware, and must trigger
a re-balance (possibly just for a particular group) as soon as the cpu
allocation for that group is used up on a particular cpu.

I haven't tried the latest CFS group scheduler patches, so they may
provide the sort of accuracy we're looking for. I'll have to try and
find some time to test them out.

Chris Snook

unread,
Jul 24, 2007, 5:30:11 PM7/24/07
to
Bill Huey (hui) wrote:

> On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:
>> Chris Friesen wrote:
>>> We currently use CKRM on an SMP machine, but the only way we can get away
>>> with it is because our main app is affined to one cpu and just about
>>> everything else is affined to the other.
>> If you're not explicitly allocating resources, you're just low-latency, not
>> truly realtime. Realtime requires guaranteed resources, so messing with
>> affinities is a necessary evil.
>
> You've mentioned this twice in this thread. If you're going to talk about this
> you should characterize this more specifically because resource allocation is
> a rather incomplete area in the Linux.

Well, you need enough CPU time to meet your deadlines. You need pre-allocated
memory, or to be able to guarantee that you can allocate memory fast enough to
meet your deadlines. This principle extends to any other shared resource, such
as disk or network. I'm being vague because it's open-ended. If a medical
device fails to meet realtime guarantees because the battery fails, the
patient's family isn't going to care how correct the software is. Realtime
engineering is hard.

> Rebalancing is still an open research
> problem the last time I looked.

Actually, it's worse than merely an open problem. A clairvoyant fair scheduler
with perfect future knowledge can underperform a heuristic fair scheduler,
because the heuristic scheduler can guess the future incorrectly resulting in
unfair but higher-throughput behavior. This is a perfect example of why we only
try to be as fair as is beneficial.

> Tong's previous trio patch is an attempt at resolving this using a generic
> grouping mechanism and some constructive discussion should come of it.

Sure, but it seems to me to be largely orthogonal to this patch.

Bill Huey

unread,
Jul 24, 2007, 5:40:08 PM7/24/07
to
On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:
> Chris Friesen wrote:
>> We currently use CKRM on an SMP machine, but the only way we can get away
>> with it is because our main app is affined to one cpu and just about
>> everything else is affined to the other.
>
> If you're not explicitly allocating resources, you're just low-latency, not
> truly realtime. Realtime requires guaranteed resources, so messing with
> affinities is a necessary evil.

You've mentioned this twice in this thread. If you're going to talk about this


you should characterize this more specifically because resource allocation is

a rather incomplete area in the Linux. Rebalancing is still an open research


problem the last time I looked.

Tong's previous trio patch is an attempt at resolving this using a generic


grouping mechanism and some constructive discussion should come of it.

bill

Chris Snook

unread,
Jul 24, 2007, 5:50:09 PM7/24/07
to
Chris Friesen wrote:
> Chris Snook wrote:
>
>> I don't think Chris's scenario has much bearing on your patch. What
>> he wants is to have a task that will always be running, but can't
>> monopolize either CPU. This is useful for certain realtime workloads,
>> but as I've said before, realtime requires explicit resource
>> allocation. I don't think this is very relevant to SCHED_FAIR balancing.
>
> I'm not actually using the scenario I described, its just sort of a
> worst-case load-balancing thought experiment.
>
> What we want to be able to do is to specify a fraction of each cpu for
> each task group. We don't want to have to affine tasks to particular cpus.

A fraction of *each* CPU, or a fraction of *total* CPU? Per-cpu granularity
doesn't make anything more fair. You've got a big bucket of MIPS you want to
divide between certain groups, but it shouldn't make a difference which CPUs
those MIPS come from, other than the fact that we try to minimize overhead
induced by migration.

> This means that the load balancer must be group-aware, and must trigger
> a re-balance (possibly just for a particular group) as soon as the cpu
> allocation for that group is used up on a particular cpu.

If I have two threads with the same priority, and two CPUs, the scheduler will
put one on each CPU, and they'll run happily without any migration or balancing.
It sounds like you're saying that every X milliseconds, you want both to
expire, be forbidden from running on the current CPU for the next X
milliseconds, and then migrated to the other CPU. There's no gain in fairness
here, and there's a big drop in performance.

I suggested local fairness as a means to achieve global fairness because it
could reduce overhead, and by adding the margin of error at each level in the
locality hierarchy, you can get an algorithm which naturally tolerates the level
of unfairness beyond which it is impossible to optimize. Strict local fairness
for its own sake doesn't accomplish anything that's better than global fairness.

-- Chris

Bill Huey

unread,
Jul 24, 2007, 7:20:06 PM7/24/07
to
On Tue, Jul 24, 2007 at 05:22:47PM -0400, Chris Snook wrote:
> Bill Huey (hui) wrote:
> Well, you need enough CPU time to meet your deadlines. You need
> pre-allocated memory, or to be able to guarantee that you can allocate
> memory fast enough to meet your deadlines. This principle extends to any
> other shared resource, such as disk or network. I'm being vague because
> it's open-ended. If a medical device fails to meet realtime guarantees
> because the battery fails, the patient's family isn't going to care how
> correct the software is. Realtime engineering is hard.
...

> Actually, it's worse than merely an open problem. A clairvoyant fair
> scheduler with perfect future knowledge can underperform a heuristic fair
> scheduler, because the heuristic scheduler can guess the future incorrectly
> resulting in unfair but higher-throughput behavior. This is a perfect
> example of why we only try to be as fair as is beneficial.

I'm glad we agree on the above points. :)

It might be that there needs to be another more stiff policy than what goes
into SCHED_OTHER in that we also need a SCHED_ISO or something has more
strict rebalancing semantics for -rt applications, sort be a super SCHED_RR.
That's definitely needed and I don't see how the current CFS implementation
can deal with this properly even with numerical running averages, etc...
at this time.

SCHED_FIFO is another issue, but this actually more complicated than just
per cpu run queues in that a global priority analysis. I don't see how
CFS can deal with SCHED_FIFO efficiently without moving to a single run
queue. This is kind of a complicated problem with a significant set of
trade off to take into account (cpu binding, etc..)

>> Tong's previous trio patch is an attempt at resolving this using a generic
>> grouping mechanism and some constructive discussion should come of it.
>
> Sure, but it seems to me to be largely orthogonal to this patch.

It's based on the same kinds of ideas that he's been experimenting with in
Trio. I can't name a single other engineer that's posted to lkml recently
that has quite the depth of experience in this area than him. It would be
nice to facilitted/incorporate some his ideas or get him to and work on
something to this end that's suitable for inclusion in some tree some where.

bill

Chris Friesen

unread,
Jul 24, 2007, 7:40:07 PM7/24/07
to
Chris Snook wrote:

> A fraction of *each* CPU, or a fraction of *total* CPU? Per-cpu
> granularity doesn't make anything more fair.

Well, our current solution uses per-cpu weights, because our vendor
couldn't get the load balancer working accurately enough. Having
per-cpu weights and cpu affinity gives acceptable results for the case
where we're currently using it.

If the load balancer is good enough, per-system weights would be fine.
It would have to play nicely with affinity though, in the case where it
makes sense to lock tasks to particular cpus.

> If I have two threads with the same priority, and two CPUs, the
> scheduler will put one on each CPU, and they'll run happily without any
> migration or balancing.

Sure. Now add a third thread. How often do you migrate? Put another
way, over what time quantum do we ensure fairness?

Ingo Molnar

unread,
Jul 25, 2007, 7:10:07 AM7/25/07
to

* Tong Li <tong...@intel.com> wrote:

> This patch extends CFS to achieve better fairness for SMPs. For
> example, with 10 tasks (same priority) on 8 CPUs, it enables each task

> to receive equal CPU time (80%). [...]

hm, CFS should already offer reasonable long-term SMP fairness. It
certainly works on a dual-core box, i just started 3 tasks of the same
priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7084 mingo 20 0 1576 248 196 R 67 0.0 0:50.13 loop
7083 mingo 20 0 1576 244 196 R 66 0.0 0:48.86 loop
7085 mingo 20 0 1576 244 196 R 66 0.0 0:49.45 loop

so each task gets a perfect 66% of CPU time.

prior CFS, we indeed did a 50%/50%/100% split - so for example on
v2.6.22:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2256 mingo 25 0 1580 248 196 R 100 0.0 1:03.19 loop
2255 mingo 25 0 1580 248 196 R 50 0.0 0:31.79 loop
2257 mingo 25 0 1580 248 196 R 50 0.0 0:31.69 loop

but CFS has changed that behavior.

I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe we
regressed somewhere ...

Ingo

Ingo Molnar

unread,
Jul 25, 2007, 8:10:08 AM7/25/07
to

* Ingo Molnar <mi...@elte.hu> wrote:

> > This patch extends CFS to achieve better fairness for SMPs. For
> > example, with 10 tasks (same priority) on 8 CPUs, it enables each task
> > to receive equal CPU time (80%). [...]
>
> hm, CFS should already offer reasonable long-term SMP fairness. It
> certainly works on a dual-core box, i just started 3 tasks of the same
> priority on 2 CPUs, and on vanilla 2.6.23-rc1 the distribution is
> this:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7084 mingo 20 0 1576 248 196 R 67 0.0 0:50.13 loop
> 7083 mingo 20 0 1576 244 196 R 66 0.0 0:48.86 loop
> 7085 mingo 20 0 1576 244 196 R 66 0.0 0:49.45 loop
>
> so each task gets a perfect 66% of CPU time.
>
> prior CFS, we indeed did a 50%/50%/100% split - so for example on
> v2.6.22:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2256 mingo 25 0 1580 248 196 R 100 0.0 1:03.19 loop
> 2255 mingo 25 0 1580 248 196 R 50 0.0 0:31.79 loop
> 2257 mingo 25 0 1580 248 196 R 50 0.0 0:31.69 loop
>
> but CFS has changed that behavior.
>
> I'll check your 10-tasks-on-8-cpus example on an 8-way box too, maybe
> we regressed somewhere ...

ok, i just tried it on an 8-cpu box and indeed, unlike the dual-core
case, the scheduler does not distribute tasks well enough:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2572 mingo 20 0 1576 244 196 R 100 0.0 1:03.61 loop
2578 mingo 20 0 1576 248 196 R 100 0.0 1:03.59 loop
2576 mingo 20 0 1576 248 196 R 100 0.0 1:03.52 loop
2571 mingo 20 0 1576 244 196 R 100 0.0 1:03.46 loop
2569 mingo 20 0 1576 244 196 R 99 0.0 1:03.36 loop
2570 mingo 20 0 1576 244 196 R 95 0.0 1:00.55 loop
2577 mingo 20 0 1576 248 196 R 50 0.0 0:31.88 loop
2574 mingo 20 0 1576 248 196 R 50 0.0 0:31.87 loop
2573 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop
2575 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop

but this is relatively easy to fix - with the patch below applied, it
looks a lot better:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2681 mingo 20 0 1576 244 196 R 85 0.0 3:51.68 loop
2688 mingo 20 0 1576 244 196 R 81 0.0 3:46.35 loop
2682 mingo 20 0 1576 244 196 R 80 0.0 3:43.68 loop
2685 mingo 20 0 1576 248 196 R 80 0.0 3:45.97 loop
2683 mingo 20 0 1576 248 196 R 80 0.0 3:40.25 loop
2679 mingo 20 0 1576 244 196 R 80 0.0 3:33.53 loop
2680 mingo 20 0 1576 244 196 R 79 0.0 3:43.53 loop
2686 mingo 20 0 1576 244 196 R 79 0.0 3:39.31 loop
2687 mingo 20 0 1576 244 196 R 78 0.0 3:33.31 loop
2684 mingo 20 0 1576 244 196 R 77 0.0 3:27.52 loop

they now nicely converte to the expected 80% long-term CPU usage.

so, could you please try the patch below, does it work for you too?

Ingo

--------------------------->
Subject: sched: increase SCHED_LOAD_SCALE_FUZZ
From: Ingo Molnar <mi...@elte.hu>

increase SCHED_LOAD_SCALE_FUZZ that adds a small amount of
over-balancing: to help distribute CPU-bound tasks more fairly on SMP
systems.

the problem of unfair balancing was noticed and reported by Tong N Li.

10 CPU-bound tasks running on 8 CPUs, v2.6.23-rc1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2572 mingo 20 0 1576 244 196 R 100 0.0 1:03.61 loop
2578 mingo 20 0 1576 248 196 R 100 0.0 1:03.59 loop
2576 mingo 20 0 1576 248 196 R 100 0.0 1:03.52 loop
2571 mingo 20 0 1576 244 196 R 100 0.0 1:03.46 loop
2569 mingo 20 0 1576 244 196 R 99 0.0 1:03.36 loop
2570 mingo 20 0 1576 244 196 R 95 0.0 1:00.55 loop
2577 mingo 20 0 1576 248 196 R 50 0.0 0:31.88 loop
2574 mingo 20 0 1576 248 196 R 50 0.0 0:31.87 loop
2573 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop
2575 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop

v2.6.23-rc1 + patch:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2681 mingo 20 0 1576 244 196 R 85 0.0 3:51.68 loop
2688 mingo 20 0 1576 244 196 R 81 0.0 3:46.35 loop
2682 mingo 20 0 1576 244 196 R 80 0.0 3:43.68 loop
2685 mingo 20 0 1576 248 196 R 80 0.0 3:45.97 loop
2683 mingo 20 0 1576 248 196 R 80 0.0 3:40.25 loop
2679 mingo 20 0 1576 244 196 R 80 0.0 3:33.53 loop
2680 mingo 20 0 1576 244 196 R 79 0.0 3:43.53 loop
2686 mingo 20 0 1576 244 196 R 79 0.0 3:39.31 loop
2687 mingo 20 0 1576 244 196 R 78 0.0 3:33.31 loop
2684 mingo 20 0 1576 244 196 R 77 0.0 3:27.52 loop

so they now nicely converte to the expected 80% long-term CPU usage.

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
include/linux/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -681,7 +681,7 @@ enum cpu_idle_type {
#define SCHED_LOAD_SHIFT 10
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)

-#define SCHED_LOAD_SCALE_FUZZ (SCHED_LOAD_SCALE >> 5)
+#define SCHED_LOAD_SCALE_FUZZ (SCHED_LOAD_SCALE >> 1)

#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */

Tong Li

unread,
Jul 25, 2007, 1:30:14 PM7/25/07
to
On Wed, 25 Jul 2007, Ingo Molnar wrote:

> they now nicely converte to the expected 80% long-term CPU usage.
>
> so, could you please try the patch below, does it work for you too?
>

Thanks for the patch. It doesn't work well on my 8-way box. Here's the
output at two different times. It's also changing all the time.

[time 1]
3717 tongli 20 0 7864 108 16 R 85 0.0 1:38.00 loop
3721 tongli 20 0 7864 108 16 R 85 0.0 1:37.51 loop
3723 tongli 20 0 7864 108 16 R 84 0.0 1:30.48 loop
3714 tongli 20 0 7864 108 16 R 83 0.0 1:33.25 loop
3720 tongli 20 0 7864 108 16 R 81 0.0 1:31.89 loop
3715 tongli 20 0 7864 108 16 R 78 0.0 1:36.38 loop
3719 tongli 20 0 7864 108 16 R 78 0.0 1:28.80 loop
3718 tongli 20 0 7864 108 16 R 78 0.0 1:33.84 loop
3722 tongli 20 0 7864 108 16 R 76 0.0 1:34.69 loop
3716 tongli 20 0 7864 108 16 R 70 0.0 1:33.15 loop

[time 2]
3721 tongli 20 0 7864 108 16 R 91 0.0 1:44.67 loop
3720 tongli 20 0 7864 108 16 R 89 0.0 1:40.55 loop
3722 tongli 20 0 7864 108 16 R 86 0.0 1:41.40 loop
3723 tongli 20 0 7864 108 16 R 82 0.0 1:38.94 loop
3715 tongli 20 0 7864 108 16 R 82 0.0 1:43.09 loop
3717 tongli 20 0 7864 108 16 R 79 0.0 1:46.38 loop
3718 tongli 20 0 7864 108 16 R 79 0.0 1:40.72 loop
3714 tongli 20 0 7864 108 16 R 77 0.0 1:40.17 loop
3719 tongli 20 0 7864 108 16 R 73 0.0 1:35.41 loop
3716 tongli 20 0 7864 108 16 R 61 0.0 1:38.66 loop

I'm running 2.6.23-rc1 with your patch on a two-socket quad-core system.
Top refresh rate is the default 3s.

tong

Li, Tong N

unread,
Jul 25, 2007, 2:30:09 PM7/25/07
to
On Wed, 2007-07-25 at 14:03 +0200, Ingo Molnar wrote:

> Signed-off-by: Ingo Molnar <mi...@elte.hu>
> ---
> include/linux/sched.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/include/linux/sched.h
> ===================================================================
> --- linux.orig/include/linux/sched.h
> +++ linux/include/linux/sched.h
> @@ -681,7 +681,7 @@ enum cpu_idle_type {
> #define SCHED_LOAD_SHIFT 10
> #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>
> -#define SCHED_LOAD_SCALE_FUZZ (SCHED_LOAD_SCALE >> 5)
> +#define SCHED_LOAD_SCALE_FUZZ (SCHED_LOAD_SCALE >> 1)
>
> #ifdef CONFIG_SMP
> #define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */

Hi Ingo,

The problem I see is that the current load balancing code is quite
hueristic and can be quite inaccurate sometimes. Its goal is to maintain
roughly equal load on each core, where load is defined to be the sum of
the weights of all tasks on a core. If all tasks have the same weight,
this is a simple problem. If different weights exist, this is an NP-hard
problem and our current hueristic can perform badly under various
workloads. A simple example, if we have four tasks on two cores and they
have weights 1, 5, 7, 7. The balanced partition would have 1 and 7 on
core 1 and 5 and 7 on core 2. But you can see the load isn't evenly
distributed; in fact, it's not possible to evenly distribute the load in
this case. Thus, my opinion is that only balancing the load as we do now
is not enough to achieve SMP fairness. My patch was intended to address
this problem.

tong

PS. I now have a lockless implementation of my patch. Will post later
today.

Ingo Molnar

unread,
Jul 25, 2007, 3:20:10 PM7/25/07
to

* Li, Tong N <tong...@intel.com> wrote:

> The problem I see is that the current load balancing code is quite
> hueristic and can be quite inaccurate sometimes. Its goal is to
> maintain roughly equal load on each core, where load is defined to be
> the sum of the weights of all tasks on a core. If all tasks have the
> same weight, this is a simple problem. If different weights exist,
> this is an NP-hard problem and our current hueristic can perform badly
> under various workloads. A simple example, if we have four tasks on
> two cores and they have weights 1, 5, 7, 7. The balanced partition
> would have 1 and 7 on core 1 and 5 and 7 on core 2. But you can see
> the load isn't evenly distributed; in fact, it's not possible to
> evenly distribute the load in this case. Thus, my opinion is that only
> balancing the load as we do now is not enough to achieve SMP fairness.
> My patch was intended to address this problem.

could you please try my patch nevertheless and see how much it differs
in practice from your solution, and cite specific examples and
measurements? (please measure over longer periods of time such as 120
seconds or 300 seconds - so that slow-rate balancing has its chance to
work)

Ingo

Ingo Molnar

unread,
Jul 25, 2007, 3:30:22 PM7/25/07
to

* Tong Li <tong...@intel.com> wrote:

> Thanks for the patch. It doesn't work well on my 8-way box. Here's the
> output at two different times. It's also changing all the time.

you need to measure it over longer periods of time. Its not worth
balancing for such a thing in any high-frequency manner. (we'd trash the
cache constantly migrating tasks back and forth.)

> [time 2]
> 3721 tongli 20 0 7864 108 16 R 91 0.0 1:44.67 loop
> 3720 tongli 20 0 7864 108 16 R 89 0.0 1:40.55 loop
> 3722 tongli 20 0 7864 108 16 R 86 0.0 1:41.40 loop
> 3723 tongli 20 0 7864 108 16 R 82 0.0 1:38.94 loop
> 3715 tongli 20 0 7864 108 16 R 82 0.0 1:43.09 loop
> 3717 tongli 20 0 7864 108 16 R 79 0.0 1:46.38 loop
> 3718 tongli 20 0 7864 108 16 R 79 0.0 1:40.72 loop
> 3714 tongli 20 0 7864 108 16 R 77 0.0 1:40.17 loop
> 3719 tongli 20 0 7864 108 16 R 73 0.0 1:35.41 loop
> 3716 tongli 20 0 7864 108 16 R 61 0.0 1:38.66 loop

the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which
is pretty acceptable for ~90 seconds of runtime. Note that the
percentages you see above were likely done over a shorter period that's
why you see them range from 61% to 91%.

> I'm running 2.6.23-rc1 with your patch on a two-socket quad-core
> system. Top refresh rate is the default 3s.

the 3s is the problem: change that to 60s! We no way want to
over-migrate for SMP fairness, the change i did gives us reasonable
long-term SMP fairness without the need for high-rate rebalancing.

Ingo

Chris Friesen

unread,
Jul 25, 2007, 4:40:09 PM7/25/07
to
Ingo Molnar wrote:

> the 3s is the problem: change that to 60s! We no way want to
> over-migrate for SMP fairness, the change i did gives us reasonable
> long-term SMP fairness without the need for high-rate rebalancing.

Actually, I do have requirements from our engineering guys for
short-term fairness. They'd actually like decent fairness over even
shorter intervals...1 second would be nice, 2 is acceptable.

They are willing to trade off random peak performance for predictability.

Chris

Chris Snook

unread,
Jul 25, 2007, 5:00:20 PM7/25/07
to
Chris Friesen wrote:
> Ingo Molnar wrote:
>
>> the 3s is the problem: change that to 60s! We no way want to
>> over-migrate for SMP fairness, the change i did gives us reasonable
>> long-term SMP fairness without the need for high-rate rebalancing.
>
> Actually, I do have requirements from our engineering guys for
> short-term fairness. They'd actually like decent fairness over even
> shorter intervals...1 second would be nice, 2 is acceptable.
>
> They are willing to trade off random peak performance for predictability.
>
> Chris
>

The sysctls for CFS have nanosecond resolution. They default to
millisecond-order values, but you can set them much lower. See sched_fair.c for
the knobs and their explanations.

-- Chris

Li, Tong N

unread,
Jul 25, 2007, 5:20:11 PM7/25/07
to
On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote:
> Chris Friesen wrote:
> > Ingo Molnar wrote:
> >
> >> the 3s is the problem: change that to 60s! We no way want to
> >> over-migrate for SMP fairness, the change i did gives us reasonable
> >> long-term SMP fairness without the need for high-rate rebalancing.
> >
> > Actually, I do have requirements from our engineering guys for
> > short-term fairness. They'd actually like decent fairness over even
> > shorter intervals...1 second would be nice, 2 is acceptable.
> >
> > They are willing to trade off random peak performance for predictability.
> >
> > Chris
> >
>
> The sysctls for CFS have nanosecond resolution. They default to
> millisecond-order values, but you can set them much lower. See sched_fair.c for
> the knobs and their explanations.
>
> -- Chris

This is incorrect. Those knobs control local-CPU fairness granularity
but have no control over fairness across CPUs.

I'll do some benchmarking as Ingo suggested.

tong

Chris Snook

unread,
Jul 25, 2007, 6:30:14 PM7/25/07
to
Li, Tong N wrote:
> On Wed, 2007-07-25 at 16:55 -0400, Chris Snook wrote:
>> Chris Friesen wrote:
>>> Ingo Molnar wrote:
>>>
>>>> the 3s is the problem: change that to 60s! We no way want to
>>>> over-migrate for SMP fairness, the change i did gives us reasonable
>>>> long-term SMP fairness without the need for high-rate rebalancing.
>>> Actually, I do have requirements from our engineering guys for
>>> short-term fairness. They'd actually like decent fairness over even
>>> shorter intervals...1 second would be nice, 2 is acceptable.
>>>
>>> They are willing to trade off random peak performance for predictability.
>>>
>>> Chris
>>>
>> The sysctls for CFS have nanosecond resolution. They default to
>> millisecond-order values, but you can set them much lower. See sched_fair.c for
>> the knobs and their explanations.
>>
>> -- Chris
>
> This is incorrect. Those knobs control local-CPU fairness granularity
> but have no control over fairness across CPUs.
>
> I'll do some benchmarking as Ingo suggested.
>
> tong

CFS naturally enforces cross-CPU fairness anyway, as Ingo demonstrated.
Lowering the local CPU parameters should cause system-wide fairness to converge
faster. It might be worthwhile to create a more explicit knob for this, but I'm
inclined to believe we could do it in much less than 700 lines. Ingo's
one-liner to improve the 10/8 balancing case, and the resulting improvement,
were exactly what I was saying should be possible and desirable. TCP Nagle
aside, it generally shouldn't take 700 lines of code to speed up the rate of
convergence of something that already converges.

Until now I've been watching the scheduler rewrite from the sidelines, but I'm
digging into it now. I'll try to give some more constructive criticism soon.

-- Chris

Tong Li

unread,
Jul 26, 2007, 3:10:14 PM7/26/07
to
On Wed, 25 Jul 2007, Ingo Molnar wrote:

>
> * Tong Li <tong...@intel.com> wrote:
>
> > Thanks for the patch. It doesn't work well on my 8-way box. Here's the
> > output at two different times. It's also changing all the time.
>
> you need to measure it over longer periods of time. Its not worth
> balancing for such a thing in any high-frequency manner. (we'd trash the
> cache constantly migrating tasks back and forth.)

I have some data below, but before that, I'd like to say, at the same load
balancing rate, my proposed approach would allow us to have fairness on
the order of seconds. I'm less concerned about trashing the cache. The
important thing is to have a knob that allow users to trade off fairness
and performance based on their needs. This is the same spirit in CFS,
which by default can have much more frequent context switches than the old
O(1) scheduler. A valid concern is that this is going to hurt cache and
this is something that has not yet been benchmarked thoroughly. On the
other hand, I fully support CFS on this because I like the idea of having
a powerful infrastructure in place and exporting a set of knobs that give
users full flexibility. This is also what I'm trying to achieve with my
patch for SMP fairness.

>
> the runtime ranges from 1:38.66 to 1:46.38 - that's a +-3% spread which
> is pretty acceptable for ~90 seconds of runtime. Note that the
> percentages you see above were likely done over a shorter period that's
> why you see them range from 61% to 91%.
>
> > I'm running 2.6.23-rc1 with your patch on a two-socket quad-core
> > system. Top refresh rate is the default 3s.
>
> the 3s is the problem: change that to 60s! We no way want to
> over-migrate for SMP fairness, the change i did gives us reasonable
> long-term SMP fairness without the need for high-rate rebalancing.
>

I ran 10 nice 0 tasks on 8 CPUs. Each task does a trivial while (1) loop.
I ran the whole thing 300 seconds and, at every 60 seconds, I measured for
each task the following:

1. Actual CPU time the task received during the past 60 seconds.
2. Ideal CPU time it would receive under a perfect fair scheduler.
3. Lag = ideal time - actual time
Positive lag means the task received less CPU time than its fair share and
negative means it received more.
4. Error = lag / ideal time

Results of CFS (2.6.23-rc1 with Ingo's SCHED_LOAD_SCALE_FUZZ change):

sampling interval 60 sampling length 300 num tasks 10
---------------------------------------------------------
Task 0 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 51.14 48.00 -3.14 -6.54%
04:18:09.316 46.05 48.00 1.95 4.06%
04:19:09.317 44.13 48.00 3.87 8.06%
04:20:09.318 44.78 48.00 3.22 6.71%
04:21:09.320 45.35 48.00 2.65 5.52%
---------------------------------------------------------
Task 1 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 47.06 48.00 0.94 1.96%
04:18:09.316 45.80 48.00 2.20 4.58%
04:19:09.317 49.95 48.00 -1.95 -4.06%
04:20:09.318 47.59 48.00 0.41 0.85%
04:21:09.320 46.50 48.00 1.50 3.12%
---------------------------------------------------------
Task 2 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 48.14 48.00 -0.14 -0.29%
04:18:09.316 48.66 48.00 -0.66 -1.37%
04:19:09.317 47.52 48.00 0.48 1.00%
04:20:09.318 45.90 48.00 2.10 4.37%
04:21:09.320 45.89 48.00 2.11 4.40%
---------------------------------------------------------
Task 3 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 48.71 48.00 -0.71 -1.48%
04:18:09.316 44.70 48.00 3.30 6.87%
04:19:09.317 45.86 48.00 2.14 4.46%
04:20:09.318 45.85 48.00 2.15 4.48%
04:21:09.320 44.91 48.00 3.09 6.44%
---------------------------------------------------------
Task 4 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 49.24 48.00 -1.24 -2.58%
04:18:09.316 45.73 48.00 2.27 4.73%
04:19:09.317 49.13 48.00 -1.13 -2.35%
04:20:09.318 45.37 48.00 2.63 5.48%
04:21:09.320 45.75 48.00 2.25 4.69%
---------------------------------------------------------
Task 5 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 49.28 48.00 -1.28 -2.67%
04:18:09.316 50.46 48.00 -2.46 -5.12%
04:19:09.317 43.92 48.00 4.08 8.50%
04:20:09.318 55.18 48.00 -7.18 -14.96%
04:21:09.320 56.97 48.00 -8.97 -18.69%
---------------------------------------------------------
Task 6 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 45.22 48.00 2.78 5.79%
04:18:09.316 44.80 48.00 3.20 6.67%
04:19:09.317 45.32 48.00 2.68 5.58%
04:20:09.318 46.66 48.00 1.34 2.79%
04:21:09.320 48.25 48.00 -0.25 -0.52%
---------------------------------------------------------
Task 7 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 47.52 48.00 0.48 1.00%
04:18:09.316 49.99 48.00 -1.99 -4.15%
04:19:09.317 53.40 48.00 -5.40 -11.25%
04:20:09.318 48.85 48.00 -0.85 -1.77%
04:21:09.320 44.67 48.00 3.33 6.94%
---------------------------------------------------------
Task 8 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 45.69 48.00 2.31 4.81%
04:18:09.316 47.92 48.00 0.08 0.17%
04:19:09.317 48.49 48.00 -0.49 -1.02%
04:20:09.318 54.47 48.00 -6.47 -13.48%
04:21:09.320 57.80 48.00 -9.80 -20.42%
---------------------------------------------------------
Task 9 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:17:09.314 46.86 48.00 1.14 2.37%
04:18:09.316 54.75 48.00 -6.75 -14.06%
04:19:09.317 51.13 48.00 -3.13 -6.52%
04:20:09.318 44.22 48.00 3.78 7.87%
04:21:09.320 42.76 48.00 5.24 10.92%
---------------------------------------------------------
System-wide stats:
Sampling interval: 60 s
Sampling length 300 s
Num threads: 10
Num CPUs: 8
Max error: 10.92%
Min error: -20.42%

----------------------------------------------------------------
Results of my patch:

sampling interval 60 sampling length 300 num tasks 10
---------------------------------------------------------
Task 0 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.92 48.00 0.08 0.17%
04:43:35.249 47.89 48.00 0.11 0.23%
04:44:35.250 47.93 48.00 0.07 0.15%
04:45:35.252 47.89 48.00 0.11 0.23%
04:46:35.254 47.90 48.00 0.10 0.21%
---------------------------------------------------------
Task 1 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.88 48.00 0.12 0.25%
04:43:35.249 47.90 48.00 0.10 0.21%
04:44:35.250 47.86 48.00 0.14 0.29%
04:45:35.252 47.89 48.00 0.11 0.23%
04:46:35.254 47.87 48.00 0.13 0.27%
---------------------------------------------------------
Task 2 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.89 48.00 0.11 0.23%
04:43:35.249 47.88 48.00 0.12 0.25%
04:44:35.250 47.88 48.00 0.12 0.25%
04:45:35.252 47.91 48.00 0.09 0.19%
04:46:35.254 47.85 48.00 0.15 0.31%
---------------------------------------------------------
Task 3 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.87 48.00 0.13 0.27%
04:43:35.249 47.81 48.00 0.19 0.40%
04:44:35.250 47.85 48.00 0.15 0.31%
04:45:35.252 47.92 48.00 0.08 0.17%
04:46:35.254 47.83 48.00 0.17 0.35%
---------------------------------------------------------
Task 4 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.84 48.00 0.16 0.33%
04:43:35.249 47.83 48.00 0.17 0.35%
04:44:35.250 47.94 48.00 0.06 0.13%
04:45:35.252 47.87 48.00 0.13 0.27%
04:46:35.254 47.93 48.00 0.07 0.15%
---------------------------------------------------------
Task 5 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.84 48.00 0.16 0.33%
04:43:35.249 47.93 48.00 0.07 0.15%
04:44:35.250 47.88 48.00 0.12 0.25%
04:45:35.252 47.85 48.00 0.15 0.31%
04:46:35.254 47.92 48.00 0.08 0.17%
---------------------------------------------------------
Task 6 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.89 48.00 0.11 0.23%
04:43:35.249 47.90 48.00 0.10 0.21%
04:44:35.250 47.86 48.00 0.14 0.29%
04:45:35.252 47.89 48.00 0.11 0.23%
04:46:35.254 47.86 48.00 0.14 0.29%
---------------------------------------------------------
Task 7 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.93 48.00 0.07 0.15%
04:43:35.249 47.86 48.00 0.14 0.29%
04:44:35.250 47.89 48.00 0.11 0.23%
04:45:35.252 47.86 48.00 0.14 0.29%
04:46:35.254 47.91 48.00 0.09 0.19%
---------------------------------------------------------
Task 8 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.88 48.00 0.12 0.25%
04:43:35.249 47.87 48.00 0.13 0.27%
04:44:35.250 47.87 48.00 0.13 0.27%
04:45:35.252 47.87 48.00 0.13 0.27%
04:46:35.254 47.89 48.00 0.11 0.23%
---------------------------------------------------------
Task 9 (unit: second) nice 0 weight 1024
----------------------------------------
Wall Clock CPU Time Ideal Time Lag Error
04:42:35.248 47.87 48.00 0.13 0.27%
04:43:35.249 47.92 48.00 0.08 0.17%
04:44:35.250 47.90 48.00 0.10 0.21%
04:45:35.252 47.88 48.00 0.12 0.25%
04:46:35.254 47.88 48.00 0.12 0.25%
---------------------------------------------------------
System-wide stats:
Sampling interval: 60 s
Sampling length 300 s
Num threads: 10
Num CPUs: 8
Max error: 0.40%
Min error: 0.13%

tong

Ingo Molnar

unread,
Jul 26, 2007, 5:40:10 PM7/26/07
to

* Tong Li <tong...@intel.com> wrote:

> > you need to measure it over longer periods of time. Its not worth
> > balancing for such a thing in any high-frequency manner. (we'd trash
> > the cache constantly migrating tasks back and forth.)
>
> I have some data below, but before that, I'd like to say, at the same
> load balancing rate, my proposed approach would allow us to have
> fairness on the order of seconds. I'm less concerned about trashing
> the cache. The important thing is to have a knob that allow users to

> trade off fairness and performance based on their needs. [...]

such a knob already exists to a certain degree, but i havent tested its
full effects on SMP fairness yet. If you pull my scheduler tree:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain
parameters become runtime tunable under /proc/sys/cpu*.

Could you try to increase the cross-CPU rebalancing frequency and see
how it impacts the precision of your measurement? Tune 'min_interval'
and 'max_interval' down to increase the frequency of rebalancing.

Ingo

Li, Tong N

unread,
Jul 26, 2007, 6:10:10 PM7/26/07
to
On Thu, 2007-07-26 at 23:31 +0200, Ingo Molnar wrote:
> * Tong Li <tong...@intel.com> wrote:
>
> > > you need to measure it over longer periods of time. Its not worth
> > > balancing for such a thing in any high-frequency manner. (we'd trash
> > > the cache constantly migrating tasks back and forth.)
> >
> > I have some data below, but before that, I'd like to say, at the same
> > load balancing rate, my proposed approach would allow us to have
> > fairness on the order of seconds. I'm less concerned about trashing
> > the cache. The important thing is to have a knob that allow users to
> > trade off fairness and performance based on their needs. [...]
>
> such a knob already exists to a certain degree, but i havent tested its
> full effects on SMP fairness yet. If you pull my scheduler tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> and if you enable CONFIG_SCHED_DEBUG, then all the sched-domain
> parameters become runtime tunable under /proc/sys/cpu*.
>
> Could you try to increase the cross-CPU rebalancing frequency and see
> how it impacts the precision of your measurement? Tune 'min_interval'
> and 'max_interval' down to increase the frequency of rebalancing.
>
> Ingo

Yes, I'll do it when I find time. If anyone is willing to do the
testing, please let me know and I can post my benchmark. On the other
hand, I don't think tuning the existing knobs will help solve the
problem. The root of the problem is that the current load balancing
doesn't take into account how much time each task has been running and
how much it's entitled. This is why I'm proposing a new approach for
solving it. The new approach, as I said, will be much more fair/accurate
than the current one even without tuning those balancing intervals
(i.e., it's able to provide good fairness even if balancing is less
frequent).

tong

Tong Li

unread,
Jul 26, 2007, 9:40:07 PM7/26/07
to
I'd like to clarify that I'm not trying to push this particular code to
the kernel. I'm a researcher. My intent was to point out that we have a
problem in the scheduler and my dwrr algorithm can potentially help fix
it. The patch itself was merely a proof-of-concept. I'd be thrilled if the
algorithm can be proven useful in the real world. I appreciate the people
who have given me comments. Since then, I've revised my algorithm/code.
Now it doesn't require global locking but retains strong fairness
properties (which I was able to prove mathematically).

Thank you,

tong

Chris Snook

unread,
Jul 27, 2007, 1:20:06 PM7/27/07
to
Tong Li wrote:
> I'd like to clarify that I'm not trying to push this particular code to
> the kernel. I'm a researcher. My intent was to point out that we have a
> problem in the scheduler and my dwrr algorithm can potentially help fix
> it. The patch itself was merely a proof-of-concept. I'd be thrilled if
> the algorithm can be proven useful in the real world. I appreciate the
> people who have given me comments. Since then, I've revised my
> algorithm/code. Now it doesn't require global locking but retains strong
> fairness properties (which I was able to prove mathematically).

Thanks for doing this work. Please don't take the implementation criticism as a
lack of appreciation for the work. I'd like to see dwrr in the scheduler, but
I'm skeptical that re-introducing expired runqueues is the most efficient way to
do it.

Given the inherently controversial nature of scheduler code, particularly that
which attempts to enforce fairness, perhaps a concise design document would help
us come to an agreement about what we think the scheduler should do and what
tradeoffs we're willing to make to do those things. Do you have a design
document we could discuss?

-- Chris

Tong Li

unread,
Jul 27, 2007, 3:10:13 PM7/27/07
to
On Fri, 27 Jul 2007, Chris Snook wrote:

> Tong Li wrote:
>> I'd like to clarify that I'm not trying to push this particular code to the
>> kernel. I'm a researcher. My intent was to point out that we have a problem
>> in the scheduler and my dwrr algorithm can potentially help fix it. The
>> patch itself was merely a proof-of-concept. I'd be thrilled if the
>> algorithm can be proven useful in the real world. I appreciate the people
>> who have given me comments. Since then, I've revised my algorithm/code. Now
>> it doesn't require global locking but retains strong fairness properties
>> (which I was able to prove mathematically).
>
> Thanks for doing this work. Please don't take the implementation criticism
> as a lack of appreciation for the work. I'd like to see dwrr in the
> scheduler, but I'm skeptical that re-introducing expired runqueues is the
> most efficient way to do it.
>
> Given the inherently controversial nature of scheduler code, particularly
> that which attempts to enforce fairness, perhaps a concise design document
> would help us come to an agreement about what we think the scheduler should
> do and what tradeoffs we're willing to make to do those things. Do you have
> a design document we could discuss?
>
> -- Chris
>

Thanks for the interest. Attached is a design doc I wrote several months
ago (with small modifications). It talks about the two pieces of my
design: group scheduling and dwrr. The description was based on the
original O(1) scheduler, but as my CFS patch showed, the algorithm is
applicable to other underlying schedulers as well. It's interesting that I
started working on this in January for the purpose of eventually writing a
paper about it. So I knew reasonably well the related research work but
was totally unaware that people in the Linux community were also working
on similar things. This is good. If you are interested, I'd like to help
with the algorithms and theory side of the things.

tong

-------------------------------------------
Overview:

Trio extends the existing Linux scheduler with support for
proportional-share scheduling. It uses a scheduling algorithm, called
Distributed Weighted Round-Robin (DWRR), which retains the existing
scheduler design as much as possible, and extends it to achieve
proportional fairness with O(1) time complexity and a constant error
bound, compared to the ideal fair scheduling algorithm. The goal of Trio
is not to improve interactive performance; rather, it relies on the
existing scheduler for interactivity and extends it to support MP
proportional fairness.

Trio has two unique features: (1) it enables users to control shares of
CPU time for any thread or group of threads (e.g., a process, an
application, etc.), and (2) it enables fair sharing of CPU time across
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
allows each task to take an equal fraction of the total CPU time. These
features enable Trio to complement the existing Linux scheduler to enable
greater user flexibility and stronger fairness.

Background:

Over the years, there has been a lot of criticism that conventional Unix
priorities and the nice interface provide insufficient support for users
to accurately control CPU shares of different threads or applications.
Many have studied scheduling algorithms that achieve proportional
fairness. Assuming that each thread has a weight that expresses its
desired CPU share, informally, a scheduler is proportionally fair if (1)
it is work-conserving, and (2) it allocates CPU time to threads in exact
proportion to their weights in any time interval. Ideal proportional
fairness is impractical since it requires that all runnable threads be
running simultaneously and scheduled with infinitesimally small quanta. In
practice, every proportional-share scheduling algorithm approximates the
ideal algorithm with the goal of achieving a constant error bound. For
more theoretical background, please refer to the following papers:

[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing
approach to flow control in integrated services networks: The single-node
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.

[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair
queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.

Previous proportional-share scheduling algorithms, however, suffer one or
more of the following problems:

(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of a global thread queue;
(4) Inefficient support for latency-sensitive applications.

Since the Linux scheduler has been successful at avoiding problems 2 to 4,
this design attempts to extend it with support for accurate proportional
fairness while retaining all of its existing benefits.

User Interface:

By default, each thread is assigned a weight proportional to its static
priority. A set of system calls also allow users to specify a weight or
reservation for any thread. Weights are relative. For example, for two
threads with weights 3 and 1, the scheduler ensures that the ratio of
their CPU time is 3:1. Reservations are absolute and in the form of X% of
the total CPU time. For example, a reservation of 80% for a thread means
that the thread always receives at least 80% of the total CPU time
regardless of other threads.

The system calls also support specifying weights or reservations for
groups of threads. For example, one can specify an 80% reservation for a
group of threads (e.g., a process) to control the total CPU share to which
the member threads are collectively entitled. Within the group, the user
can further specify local weights to different threads to control their
relative shares.

Scheduling Algorithm:

The scheduler keeps a set data structures, called Trio groups, to maintain
the weight or reservation of each thread group (including one or more
threads) and the local weight of each member thread. When scheduling a
thread, it consults these data structures and computes (in constant time)
a system-wide weight for the thread that represents an equivalent CPU
share. Consequently, the scheduling algorithm, DWRR, operates solely based
on the system-wide weight (or weight for short, hereafter) of each thread.
Having a flat space of system-wide weights for individual threads avoids
performing seperate scheduling at each level of the group hierarchy and
thus greatly simplies the implementation for group scheduling.

For each processor, besides the existing active and expired arrays, DWRR
keeps one more array, called round-expired. It also keeps a round number
for each processor, initially all zero. A thread is said to be in round R
if it is in the active or expired array of a round-R processor. For each
thread, DWRR associates it with a round slice, equal to its weight
multiplied by a system constant, called base round slice, which controls
the total time that the thread can run in any round. When a thread
exhausts its time slice, as in the existing scheduler, DWRR moves it to
the expired array. However, when it exhausts its round slice, DWRR moves
it to the round-expired array, indicating that the thread has finished
round R. In this way, all threads in the active and expired array on a
round-R processor are running in round R, while the threads in the
round-expired array have finished round R and are awaiting to start round
R+1. Threads in the active and expired arrays are scheduled the same way
as the existing scheduler.

When a processor's active array is empty, as usual, the active and expired
arrays are switched. When both active and expired are empty, DWRR
eventually wants to switch the active and round-expired arrays, thus
advancing the current processor to the next round. However, to guarantee
fairness, it needs to maintain the invariant that the differences of all
processors' rounds are bounded by a constant, where the smaller this
constant is, the stronger fairness it can guarantee (the following assumes
the constant is 1). With this invariant, it can be shown that, during any
time interval, the number of rounds that any two threads go through
differs by the constant, which is key to ensuring DWRR's constant error
bound compared to the ideal algorithm.

To enforce the above invariant, DWRR keeps track of the highest round
(referred to as highest) among all processors at any time and ensures that
no processor in round highest can advance to round highest+1 (thus
updating highest), if there exists at least one thread in the system that
is still in round highest. There are at least two approaches to maintain a
global highest round variable. One is to associate it with a global lock
to ensure consistency of its value. However, this may be not be scalable.
Thus, a second approach is to use no locking, but it could lead to
inconsistencies in the value. However, such inconsistencies don't affect
correctness of the kernel and the only impact is that the fairness error
of the scheduler can be twice as big as the locking approach, but the
error is still bounded by a constant and thus sufficient in most cases.
The following describes the operations of DWRR, assuming the locking
approach, while the non-locking approach requires only simple changes.

On any processor p, whenever both the active and expired arrays become
empty, DWRR compares the round of p with highest. If equal, it performs
idle load balancing in two steps: (1) It Identifies runnable threads that
are in round highest but not currently running. Such threads can be in the
active or expired array of a round highest processor, or in the
round-expired array of a round highest - 1 processor. (2) Among those
threads from step 1, move X of them to the active array of p, where X is a
design choice and does not impact the fairness properties of DWRR. If step
1 returns no suitable threads, DWRR proceeds as if the round of processor
p is less than highest, in which case DWRR switches p's active and
round-expired arrays, and increments p's round by one, thus allowing all
threads in its round-expired array to advance to the next round.

Whenever the system creates a new thread or awakens an existing one, DWRR
inserts the thread into the active array of an idle processor and sets the
processor's round to the current value of highest. If no idle processor
exists, it starts the thread on the least loaded processor among those in
round highest.

Whenever a processor goes idle (i.e., all of its three arrays are empty),
DWRR resets its round to zero. Similar to the existing scheduler, DWRR
also performs periodic load balancing but only among processors in round
highest. Unlike idle load balancing, periodic load balancing only improves
performance and is not necessary for fairness.

Bill Huey

unread,
Jul 27, 2007, 6:30:11 PM7/27/07
to
On Fri, Jul 27, 2007 at 12:03:28PM -0700, Tong Li wrote:
> Thanks for the interest. Attached is a design doc I wrote several months
> ago (with small modifications). It talks about the two pieces of my design:
> group scheduling and dwrr. The description was based on the original O(1)
> scheduler, but as my CFS patch showed, the algorithm is applicable to other
> underlying schedulers as well. It's interesting that I started working on
> this in January for the purpose of eventually writing a paper about it. So
> I knew reasonably well the related research work but was totally unaware
> that people in the Linux community were also working on similar things.
> This is good. If you are interested, I'd like to help with the algorithms
> and theory side of the things.

Tong,

This is sufficient as an overview of the algorithm but not detailed enough
for it to be a discussable design doc I believe. You should ask Chris to see
what he means by this.

Some examples of your rebalancing scheme and how your invariant applies
across processor rounds would be helpful for me and possibly others as well.

bill

Chris Snook

unread,
Jul 27, 2007, 7:40:08 PM7/27/07
to

I don't think that achieving a constant error bound is always a good thing. We
all know that fairness has overhead. If I have 3 threads and 2 processors, and
I have a choice between fairly giving each thread 1.0 billion cycles during the
next second, or unfairly giving two of them 1.1 billion cycles and giving the
other 0.9 billion cycles, then we can have a useful discussion about where we
want to draw the line on the fairness/performance tradeoff. On the other hand,
if we can give two of them 1.1 billion cycles and still give the other one 1.0
billion cycles, it's madness to waste those 0.2 billion cycles just to avoid
user jealousy. The more complex the memory topology of a system, the more
"free" cycles you'll get by tolerating short-term unfairness. As a crude
heuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate,
but eventually we should take the boot-time computed migration costs into
consideration.

> [1] A. K. Parekh and R. G. Gallager. A generalized processor sharing
> approach to flow control in integrated services networks: The single-node
> case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.
>
> [2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair
> queueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.
>
> Previous proportional-share scheduling algorithms, however, suffer one
> or more of the following problems:
>
> (1) Inaccurate fairness with non-constant error bounds;
> (2) High run-time overhead (e.g., logarithmic);
> (3) Poor scalability due to the use of a global thread queue;
> (4) Inefficient support for latency-sensitive applications.
>
> Since the Linux scheduler has been successful at avoiding problems 2 to
> 4, this design attempts to extend it with support for accurate
> proportional fairness while retaining all of its existing benefits.

If we allow a little short-term fairness (and I think we should) we can still
account for this unfairness and compensate for it (again, with the same
tolerance) at the next rebalancing.

> User Interface:
>
> By default, each thread is assigned a weight proportional to its static
> priority. A set of system calls also allow users to specify a weight or
> reservation for any thread. Weights are relative. For example, for two
> threads with weights 3 and 1, the scheduler ensures that the ratio of
> their CPU time is 3:1. Reservations are absolute and in the form of X%
> of the total CPU time. For example, a reservation of 80% for a thread
> means that the thread always receives at least 80% of the total CPU time
> regardless of other threads.
>
> The system calls also support specifying weights or reservations for
> groups of threads. For example, one can specify an 80% reservation for a
> group of threads (e.g., a process) to control the total CPU share to
> which the member threads are collectively entitled. Within the group,
> the user can further specify local weights to different threads to
> control their relative shares.

Adding system calls, while great for research, is not something which is done
lightly in the published kernel. If we're going to implement a user interface
beyond simply interpreting existing priorities more precisely, it would be nice
if this was part of a framework with a broader vision, such as a scheduler economy.

> Scheduling Algorithm:
>
> The scheduler keeps a set data structures, called Trio groups, to
> maintain the weight or reservation of each thread group (including one
> or more threads) and the local weight of each member thread. When
> scheduling a thread, it consults these data structures and computes (in
> constant time) a system-wide weight for the thread that represents an
> equivalent CPU share. Consequently, the scheduling algorithm, DWRR,
> operates solely based on the system-wide weight (or weight for short,
> hereafter) of each thread. Having a flat space of system-wide weights
> for individual threads avoids performing seperate scheduling at each
> level of the group hierarchy and thus greatly simplies the
> implementation for group scheduling.

Implementing a flat weight space efficiently is nontrivial. I'm curious to see
how you reworked the original patch without global locking.

> For each processor, besides the existing active and expired arrays, DWRR
> keeps one more array, called round-expired. It also keeps a round number
> for each processor, initially all zero. A thread is said to be in round
> R if it is in the active or expired array of a round-R processor. For
> each thread, DWRR associates it with a round slice, equal to its weight
> multiplied by a system constant, called base round slice, which controls
> the total time that the thread can run in any round. When a thread
> exhausts its time slice, as in the existing scheduler, DWRR moves it to
> the expired array. However, when it exhausts its round slice, DWRR moves
> it to the round-expired array, indicating that the thread has finished
> round R. In this way, all threads in the active and expired array on a
> round-R processor are running in round R, while the threads in the
> round-expired array have finished round R and are awaiting to start
> round R+1. Threads in the active and expired arrays are scheduled the
> same way as the existing scheduler.

I had a feeling this patch was originally designed for the O(1) scheduler, and
this is why. The old scheduler had expired arrays, so adding a round-expired
array wasn't a radical departure from the design. CFS does not have an expired
rbtree, so adding one *is* a radical departure from the design. I think we can
implement DWRR or something very similar without using this implementation
method. Since we've already got a tree of queued tasks, it might be easiest to
basically break off one subtree (usually just one task, but not necessarily) and
migrate it to a less loaded tree whenever we can reduce the difference between
the load on the two trees by at least half. This would prevent both
overcorrection and undercorrection.

The idea of rounds was another implementation detail that bothered me. In the
old scheduler, quantizing CPU time was a necessary evil. Now that we can
account for CPU time with nanosecond resolution, doing things on an as-needed
basis seems more appropriate, and should reduce the need for global synchronization.

> On any processor p, whenever both the active and expired arrays become
> empty, DWRR compares the round of p with highest. If equal, it performs
> idle load balancing in two steps: (1) It Identifies runnable threads
> that are in round highest but not currently running. Such threads can be
> in the active or expired array of a round highest processor, or in the
> round-expired array of a round highest - 1 processor. (2) Among those
> threads from step 1, move X of them to the active array of p, where X is
> a design choice and does not impact the fairness properties of DWRR. If
> step 1 returns no suitable threads, DWRR proceeds as if the round of
> processor p is less than highest, in which case DWRR switches p's active
> and round-expired arrays, and increments p's round by one, thus allowing
> all threads in its round-expired array to advance to the next round.
>
> Whenever the system creates a new thread or awakens an existing one,
> DWRR inserts the thread into the active array of an idle processor and
> sets the processor's round to the current value of highest. If no idle
> processor exists, it starts the thread on the least loaded processor
> among those in round highest.
>
> Whenever a processor goes idle (i.e., all of its three arrays are
> empty), DWRR resets its round to zero. Similar to the existing
> scheduler, DWRR also performs periodic load balancing but only among
> processors in round highest. Unlike idle load balancing, periodic load
> balancing only improves performance and is not necessary for fairness.

In summary, I think the accounting is sound, but the enforcement is sub-optimal
for the new scheduler. A revision of the algorithm more cognizant of the
capabilities and design of the current scheduler would seem to be in order.

I've referenced many times my desire to account for CPU/memory hierarchy in
these patches. At present, I'm not sure we have sufficient infrastructure in
the kernel to automatically optimize for system topology, but I think whatever
design we pursue should have some concept of this hierarchy, even if we end up
using a depth-1 tree in the short term while we figure out how to optimize this.

-- Chris

Bill Huey

unread,
Jul 27, 2007, 9:00:17 PM7/27/07
to
On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
> I don't think that achieving a constant error bound is always a good thing.
> We all know that fairness has overhead. If I have 3 threads and 2
> processors, and I have a choice between fairly giving each thread 1.0
> billion cycles during the next second, or unfairly giving two of them 1.1
> billion cycles and giving the other 0.9 billion cycles, then we can have a
> useful discussion about where we want to draw the line on the
> fairness/performance tradeoff. On the other hand, if we can give two of
> them 1.1 billion cycles and still give the other one 1.0 billion cycles,
> it's madness to waste those 0.2 billion cycles just to avoid user jealousy.
> The more complex the memory topology of a system, the more "free" cycles
> you'll get by tolerating short-term unfairness. As a crude heuristic,
> scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but
> eventually we should take the boot-time computed migration costs into
> consideration.

You have to consider the target for this kind of code. There are applications
where you need something that falls within a constant error bound. According
to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
strict than that.

Even the rt overload code (from my memory) is subject to these limitations
as well until it's moved to use a single global queue while using CPU
binding to turn off that logic. It's the price you pay for accuracy.

> If we allow a little short-term fairness (and I think we should) we can
> still account for this unfairness and compensate for it (again, with the
> same tolerance) at the next rebalancing.

Again, it's a function of *when* and depends on that application.

> Adding system calls, while great for research, is not something which is
> done lightly in the published kernel. If we're going to implement a user
> interface beyond simply interpreting existing priorities more precisely, it
> would be nice if this was part of a framework with a broader vision, such
> as a scheduler economy.

I'm not sure what you mean by scheduler economy, but CFS can and should
be extended to handle proportional scheduling which is outside of the
traditional Unix priority semantics. Having a new API to get at this is
unavoidable if you want it to eventually support -rt oriented appications
that have bandwidth semantics.

All deadline based schedulers have API mechanisms like this to support
extended semantics. This is no different.

> I had a feeling this patch was originally designed for the O(1) scheduler,
> and this is why. The old scheduler had expired arrays, so adding a
> round-expired array wasn't a radical departure from the design. CFS does
> not have an expired rbtree, so adding one *is* a radical departure from the
> design. I think we can implement DWRR or something very similar without
> using this implementation method. Since we've already got a tree of queued
> tasks, it might be easiest to basically break off one subtree (usually just
> one task, but not necessarily) and migrate it to a less loaded tree
> whenever we can reduce the difference between the load on the two trees by
> at least half. This would prevent both overcorrection and undercorrection.

> The idea of rounds was another implementation detail that bothered me. In

> the old scheduler, quantizing CPU time was a necessary evil. Now that we
> can account for CPU time with nanosecond resolution, doing things on an
> as-needed basis seems more appropriate, and should reduce the need for
> global synchronization.

Well, there's nanosecond resolution with no mechanism that exploits it for
rebalancing. Rebalancing in general is a pain and the code for it is
generally orthogonal to the in-core scheduler data structures that are in
use, so I don't understand the objection to this argument and the choice
of methods. If it it gets the job done, then these kind of choices don't
have that much meaning.

> In summary, I think the accounting is sound, but the enforcement is
> sub-optimal for the new scheduler. A revision of the algorithm more
> cognizant of the capabilities and design of the current scheduler would
> seem to be in order.

That would be nice. But the amount of error in Tong's solution is much
less than the current CFS logic as was previously tested even without
consideration to high resolution clocks.

So you have to give some kind of credit for that approach and recognized
that current methods in CFS are technically a dead end if there's a need for
strict fairness in a more rigorous run category than SCHED_OTHER.

> I've referenced many times my desire to account for CPU/memory hierarchy in
> these patches. At present, I'm not sure we have sufficient infrastructure
> in the kernel to automatically optimize for system topology, but I think
> whatever design we pursue should have some concept of this hierarchy, even
> if we end up using a depth-1 tree in the short term while we figure out how
> to optimize this.

Yes, it would be nice. :)

bill

Chris Snook

unread,
Jul 27, 2007, 11:10:07 PM7/27/07
to
Bill Huey (hui) wrote:
> On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
>> I don't think that achieving a constant error bound is always a good thing.
>> We all know that fairness has overhead. If I have 3 threads and 2
>> processors, and I have a choice between fairly giving each thread 1.0
>> billion cycles during the next second, or unfairly giving two of them 1.1
>> billion cycles and giving the other 0.9 billion cycles, then we can have a
>> useful discussion about where we want to draw the line on the
>> fairness/performance tradeoff. On the other hand, if we can give two of
>> them 1.1 billion cycles and still give the other one 1.0 billion cycles,
>> it's madness to waste those 0.2 billion cycles just to avoid user jealousy.
>> The more complex the memory topology of a system, the more "free" cycles
>> you'll get by tolerating short-term unfairness. As a crude heuristic,
>> scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but
>> eventually we should take the boot-time computed migration costs into
>> consideration.
>
> You have to consider the target for this kind of code. There are applications
> where you need something that falls within a constant error bound. According
> to the numbers, the current CFS rebalancing logic doesn't achieve that to
> any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
> strict than that.

I've said from the beginning that I think that anyone who desperately needs
perfect fairness should be explicitly enforcing it with the aid of realtime
priorities. The problem is that configuring and tuning a realtime application
is a pain, and people want to be able to approximate this behavior without doing
a whole lot of dirty work themselves. I believe that CFS can and should be
enhanced to ensure SMP-fairness over potentially short, user-configurable
intervals, even for SCHED_OTHER. I do not, however, believe that we should take
it to the extreme of wasting CPU cycles on migrations that will not improve
performance for *any* task, just to avoid letting some tasks get ahead of
others. We should be as fair as possible but no fairer. If we've already made
it as fair as possible, we should account for the margin of error and correct
for it the next time we rebalance. We should not burn the surplus just to get
rid of it.

On a non-NUMA box with single-socket, non-SMT processors, a constant error bound
is fine. Once we add SMT, go multi-core, go NUMA, and add inter-chassis
interconnects on top of that, we need to multiply this error bound at each stage
in the hierarchy, or else we'll end up wasting CPU cycles on migrations that
actually hurt the processes they're supposed to be helping, and hurt everyone
else even more. I believe we should enforce an error bound that is proportional
to migration cost.

> Even the rt overload code (from my memory) is subject to these limitations
> as well until it's moved to use a single global queue while using CPU
> binding to turn off that logic. It's the price you pay for accuracy.
>
>> If we allow a little short-term fairness (and I think we should) we can
>> still account for this unfairness and compensate for it (again, with the
>> same tolerance) at the next rebalancing.
>
> Again, it's a function of *when* and depends on that application.
>
>> Adding system calls, while great for research, is not something which is
>> done lightly in the published kernel. If we're going to implement a user
>> interface beyond simply interpreting existing priorities more precisely, it
>> would be nice if this was part of a framework with a broader vision, such
>> as a scheduler economy.
>
> I'm not sure what you mean by scheduler economy, but CFS can and should
> be extended to handle proportional scheduling which is outside of the
> traditional Unix priority semantics. Having a new API to get at this is
> unavoidable if you want it to eventually support -rt oriented appications
> that have bandwidth semantics.

A scheduler economy is basically a credit scheduler, augmented to allow
processes to exchange credits with each other. If you want to get more
sophisticated with fairness, you could price CPU time proportional to load on
that CPU.

I've been house-hunting lately, so I like to think of it in real estate terms.
If you're comfortable with your standard of living and you have enough money,
you can rent the apartment in the chic part of town, right next to the subway
station. If you want to be more frugal because you're saving for retirement,
you can get a place out in the suburbs, but the commute will be more of a pain.
If you can't make up your mind and keep moving back and forth, you spend a lot
on moving and all your stuff gets dented and scratched.

> All deadline based schedulers have API mechanisms like this to support
> extended semantics. This is no different.
>
>> I had a feeling this patch was originally designed for the O(1) scheduler,
>> and this is why. The old scheduler had expired arrays, so adding a
>> round-expired array wasn't a radical departure from the design. CFS does
>> not have an expired rbtree, so adding one *is* a radical departure from the
>> design. I think we can implement DWRR or something very similar without
>> using this implementation method. Since we've already got a tree of queued
>> tasks, it might be easiest to basically break off one subtree (usually just
>> one task, but not necessarily) and migrate it to a less loaded tree
>> whenever we can reduce the difference between the load on the two trees by
>> at least half. This would prevent both overcorrection and undercorrection.
>
>> The idea of rounds was another implementation detail that bothered me. In
>> the old scheduler, quantizing CPU time was a necessary evil. Now that we
>> can account for CPU time with nanosecond resolution, doing things on an
>> as-needed basis seems more appropriate, and should reduce the need for
>> global synchronization.
>
> Well, there's nanosecond resolution with no mechanism that exploits it for
> rebalancing. Rebalancing in general is a pain and the code for it is
> generally orthogonal to the in-core scheduler data structures that are in
> use, so I don't understand the objection to this argument and the choice
> of methods. If it it gets the job done, then these kind of choices don't
> have that much meaning.

If we stick to the CFS way of doing things, instead of slapping on a construct
that is alien to the current code, we can implement it in 20 lines of code,
rather than 700. That's an optimization worth making.

>> In summary, I think the accounting is sound, but the enforcement is
>> sub-optimal for the new scheduler. A revision of the algorithm more
>> cognizant of the capabilities and design of the current scheduler would
>> seem to be in order.
>
> That would be nice. But the amount of error in Tong's solution is much
> less than the current CFS logic as was previously tested even without
> consideration to high resolution clocks.
>
> So you have to give some kind of credit for that approach and recognized
> that current methods in CFS are technically a dead end if there's a need for
> strict fairness in a more rigorous run category than SCHED_OTHER.

But this patch is only relevant to SCHED_OTHER. The realtime scheduler doesn't
have a concept of fairness, just priorities. That why each realtime priority
level has its own separate runqueue. Realtime schedulers are supposed to be
dumb as a post, so they cannot heuristically decide to do anything other than
precisely what you configured them to do, and so they don't get in the way when
you're context switching a million times a second.

What you're describing is group fairness, which is precisely what I believe we
can improve in SCHED_OTHER or something similar using the general DWRR concept,
but implemented in a more efficient and maintainable manner.

Tong Li

unread,
Jul 28, 2007, 3:30:09 PM7/28/07
to
On Fri, 27 Jul 2007, Chris Snook wrote:

> I don't think that achieving a constant error bound is always a good thing.
> We all know that fairness has overhead. If I have 3 threads and 2
> processors, and I have a choice between fairly giving each thread 1.0 billion
> cycles during the next second, or unfairly giving two of them 1.1 billion
> cycles and giving the other 0.9 billion cycles, then we can have a useful
> discussion about where we want to draw the line on the fairness/performance
> tradeoff. On the other hand, if we can give two of them 1.1 billion cycles
> and still give the other one 1.0 billion cycles, it's madness to waste those
> 0.2 billion cycles just to avoid user jealousy. The more complex the memory
> topology of a system, the more "free" cycles you'll get by tolerating
> short-term unfairness. As a crude heuristic, scaling some fairly low
> tolerance by log2(NCPUS) seems appropriate, but eventually we should take the
> boot-time computed migration costs into consideration.

I think we are in agreement. To avoid confusion, I think we should be more
precise on what fairness means. Lag (i.e., ideal fair time - actual
service time) is the commonly used metric for fairness. The definition is
that a scheduler is proportionally fair if for any task in any time
interval, the task's lag is bounded by a constant (note it's in terms of
absolute time). The knob here is this constant and can help trade off
performance and fairness. The reason for a constant bound is that we want
consistent fairness properties regardless of the number of tasks. For
example, we don't want the system to be much less fair as the number of
tasks increases. With DWRR, the lag bound is the max weight of currently
running tasks, multiplied by sysctl_base_round_slice. So if all tasks are
of nice 0, i.e., weight 1, and sysctl_base_round_slice equals 30 ms, then
we are guaranteed each task is at most 30ms off of the ideal case. This is
a useful property. Just like what you mentioned about the migration cost,
this property allows the scheduler or user to accurately reason about the
tradeoffs. If we want to trade fairness for performance, we can increase
sysctl_base_round_slice to, say, 100ms; doing so we also know accurately
the worst impact it has on fairness.

> Adding system calls, while great for research, is not something which is done
> lightly in the published kernel. If we're going to implement a user
> interface beyond simply interpreting existing priorities more precisely, it
> would be nice if this was part of a framework with a broader vision, such as
> a scheduler economy.

Agreed. I've seen papers on scheduler economy but not familiar enough to
comment on it.

>
>> Scheduling Algorithm:
>>
>> The scheduler keeps a set data structures, called Trio groups, to maintain
>> the weight or reservation of each thread group (including one or more
>> threads) and the local weight of each member thread. When scheduling a
>> thread, it consults these data structures and computes (in constant time) a
>> system-wide weight for the thread that represents an equivalent CPU share.
>> Consequently, the scheduling algorithm, DWRR, operates solely based on the
>> system-wide weight (or weight for short, hereafter) of each thread. Having
>> a flat space of system-wide weights for individual threads avoids
>> performing seperate scheduling at each level of the group hierarchy and
>> thus greatly simplies the implementation for group scheduling.
>
> Implementing a flat weight space efficiently is nontrivial. I'm curious to
> see how you reworked the original patch without global locking.

I simply removed the locking and changed a little bit in idle_balance().
The lock was trying to avoid a thread from reading or writing the global
highest round value while another thread is writing to it. For writes,
it's simple to ensure without locking only one write takes effect when
multiple writes are concurrent. For the case that there's one write going
on and multiple threads read, without locking, the only problem is that a
reader may read a stale value and thus thinks the current highest round is
X while it's actually X + 1. The end effect is that a thread can be at
most two rounds behind the highest round. This changes DWRR's lag bound to
2 * (max weight of current tasks) * sysctl_base_round_slice, which is
still constant.

> I had a feeling this patch was originally designed for the O(1) scheduler,
> and this is why. The old scheduler had expired arrays, so adding a
> round-expired array wasn't a radical departure from the design. CFS does not
> have an expired rbtree, so adding one *is* a radical departure from the
> design. I think we can implement DWRR or something very similar without
> using this implementation method. Since we've already got a tree of queued
> tasks, it might be easiest to basically break off one subtree (usually just
> one task, but not necessarily) and migrate it to a less loaded tree whenever
> we can reduce the difference between the load on the two trees by at least
> half. This would prevent both overcorrection and undercorrection.

Yes, the description was based on O(1) and the intent was exactly not to
be much a departure from its design. I totally agree the same philosophy
should apply to an implementation based on CFS.

> The idea of rounds was another implementation detail that bothered me. In
> the old scheduler, quantizing CPU time was a necessary evil. Now that we can
> account for CPU time with nanosecond resolution, doing things on an as-needed
> basis seems more appropriate, and should reduce the need for global
> synchronization.

Without the global locking, the global synchronization here is simply
ping-ponging a cache line once of while. This doesn't look expensive to
me, but if it does after benchmarking, adjusting sysctl_base_round_slice
can reduce the ping-pong frequency. There might also be a smart
implementation that can alleviate this problem.

I don't understand why quantizing CPU time is a bad thing. Could you
educate me on this?

I guess it's worth mentioning that although we now have nanosecond-level
accounting, scheduling in the common case still occurs at timer tick
granularity.

> In summary, I think the accounting is sound, but the enforcement is
> sub-optimal for the new scheduler. A revision of the algorithm more
> cognizant of the capabilities and design of the current scheduler would seem
> to be in order.
>
> I've referenced many times my desire to account for CPU/memory hierarchy in
> these patches. At present, I'm not sure we have sufficient infrastructure in
> the kernel to automatically optimize for system topology, but I think
> whatever design we pursue should have some concept of this hierarchy, even if
> we end up using a depth-1 tree in the short term while we figure out how to
> optimize this.
>

Agreed.

tong

Tong Li

unread,
Jul 28, 2007, 3:40:06 PM7/28/07
to
On Fri, 27 Jul 2007, Chris Snook wrote:

> Bill Huey (hui) wrote:
>> You have to consider the target for this kind of code. There are
>> applications
>> where you need something that falls within a constant error bound.
>> According
>> to the numbers, the current CFS rebalancing logic doesn't achieve that to
>> any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything
>> more
>> strict than that.
>
> I've said from the beginning that I think that anyone who desperately needs
> perfect fairness should be explicitly enforcing it with the aid of realtime
> priorities. The problem is that configuring and tuning a realtime
> application is a pain, and people want to be able to approximate this
> behavior without doing a whole lot of dirty work themselves. I believe that
> CFS can and should be enhanced to ensure SMP-fairness over potentially short,
> user-configurable intervals, even for SCHED_OTHER. I do not, however,
> believe that we should take it to the extreme of wasting CPU cycles on
> migrations that will not improve performance for *any* task, just to avoid
> letting some tasks get ahead of others. We should be as fair as possible but
> no fairer. If we've already made it as fair as possible, we should account
> for the margin of error and correct for it the next time we rebalance. We
> should not burn the surplus just to get rid of it.

Proportional-share scheduling actually has one of its roots in real-time
and having a p-fair scheduler is essential for real-time apps (soft
real-time).

>
> On a non-NUMA box with single-socket, non-SMT processors, a constant error
> bound is fine. Once we add SMT, go multi-core, go NUMA, and add
> inter-chassis interconnects on top of that, we need to multiply this error
> bound at each stage in the hierarchy, or else we'll end up wasting CPU cycles
> on migrations that actually hurt the processes they're supposed to be
> helping, and hurt everyone else even more. I believe we should enforce an
> error bound that is proportional to migration cost.
>

I think we are actually in agreement. When I say constant bound, it can
certainly be a constant that's determined based on inputs from the memory
hierarchy. The point is that it needs to be a constant independent of
things like # of tasks.

> But this patch is only relevant to SCHED_OTHER. The realtime scheduler
> doesn't have a concept of fairness, just priorities. That why each realtime
> priority level has its own separate runqueue. Realtime schedulers are
> supposed to be dumb as a post, so they cannot heuristically decide to do
> anything other than precisely what you configured them to do, and so they
> don't get in the way when you're context switching a million times a second.

Are you referring to hard real-time? As I said, an infrastructure that
enables p-fair scheduling, EDF, or things alike is the foundation for
real-time. I designed DWRR, however, with a target of non-RT apps,
although I was hoping the research results might be applicable to RT.

tong

Chris Snook

unread,
Jul 28, 2007, 10:50:05 PM7/28/07
to

Sounds like another scheduler class might be in order. I find CFS to be
fair enough for most purposes. If the code that gives us near-perfect
fairness at the expense of efficiency only runs when tasks have been
given boosted priority by a privileged user, and only on the CPUs that
have such tasks queued on them, the run time overhead and code
complexity become much smaller concerns.

>>
>> On a non-NUMA box with single-socket, non-SMT processors, a constant
>> error bound is fine. Once we add SMT, go multi-core, go NUMA, and add
>> inter-chassis interconnects on top of that, we need to multiply this
>> error bound at each stage in the hierarchy, or else we'll end up
>> wasting CPU cycles on migrations that actually hurt the processes
>> they're supposed to be helping, and hurt everyone else even more. I
>> believe we should enforce an error bound that is proportional to
>> migration cost.
>>
>
> I think we are actually in agreement. When I say constant bound, it can
> certainly be a constant that's determined based on inputs from the
> memory hierarchy. The point is that it needs to be a constant
> independent of things like # of tasks.

Agreed.

>> But this patch is only relevant to SCHED_OTHER. The realtime
>> scheduler doesn't have a concept of fairness, just priorities. That
>> why each realtime priority level has its own separate runqueue.
>> Realtime schedulers are supposed to be dumb as a post, so they cannot
>> heuristically decide to do anything other than precisely what you
>> configured them to do, and so they don't get in the way when you're
>> context switching a million times a second.
>
> Are you referring to hard real-time? As I said, an infrastructure that
> enables p-fair scheduling, EDF, or things alike is the foundation for
> real-time. I designed DWRR, however, with a target of non-RT apps,
> although I was hoping the research results might be applicable to RT.

I'm referring to the static priority SCHED_FIFO and SCHED_RR schedulers,
which are (intentionally) dumb as a post, allowing userspace to manage
CPU time explicitly. Proportionally fair scheduling is a cool
capability, but not a design goal of those schedulers.

-- Chris

Chris Snook

unread,
Jul 28, 2007, 11:10:07 PM7/28/07
to
Tong Li wrote:
> Without the global locking, the global synchronization here is simply
> ping-ponging a cache line once of while. This doesn't look expensive to
> me, but if it does after benchmarking, adjusting sysctl_base_round_slice
> can reduce the ping-pong frequency. There might also be a smart
> implementation that can alleviate this problem.

Scaling it proportionally to migration cost and log2(cpus) should suffice.

> I don't understand why quantizing CPU time is a bad thing. Could you
> educate me on this?

It depends on how precisely you do it. We save a lot of power going
tickless. If round expiration is re-introducing ticks on idle CPUs, we
could waste a lot of power. Hardware is getting even more aggressive
about power saving, to the point of allowing individual cores to be
completely powered off when idle. We need to make sure the scheduler
doesn't interfere with power management.

-- Chris

0 new messages