Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH 3/3] [kidled]: Introduce power capping priority and LB awareness.

3 views

Skip to first unread message

Salman

unread,

Apr 13, 2010, 8:20:02 PM4/13/10

From: Salman Qazi <sq...@google.com>

0) Power Capping Priority:

After we finish a lazy injection, we look at the task groups in the order
of increasing priority. For each task group, we attempt to assign
as much vruntime as possible, to cover the time that was spent doing
the lazy injection. Within each priority, we round-robin between the
task group between different invocations to make sure that we don't
consistently penalize the same one.

The priorities themselves are specified through the value
cpu.power_capping_priority in the parent CPU cgroup of the tasks.

1) Load balancer awareness

Idle cycle injector is an RT thread. A consequence is that from the load
balancer's point of view, it is a particularly heavy thread. While
we appreciate the ability to preempt any CFS threads, it is useful
to have a lesser weight: as a heavy weight makes an injected CPU
disproportionately less desirable than other CPUs. We provide this
by faking the weight of the idle cycle injector to be equivalent to
a CFS thread of a user controllable nice value.

Signed-off-by: Salman Qazi <sq...@google.com>
---
Documentation/kidled.txt | 38 ++++++++++++++++++++++-
include/linux/kidled.h | 6 ++++
kernel/kidled.c | 2 +
kernel/sched.c | 75 +++++++++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 77 +++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 192 insertions(+), 6 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 564aa00..400b97b 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -6,7 +6,7 @@ Overview:
Provides a kernel interface for causing the CPUs to have some
minimum percentage of the idle time.

-Interfaces:
+Basic Interfaces:

Under /proc/sys/kernel/kidled/, we can find the following files:

@@ -51,3 +51,39 @@ tasks become runnable, they are more likely to fall in an interval when we
aren't forcing the CPU idle.

+Power Capping Priority:
+
+The time taken up by the idle cycle injector normally affects all of the
+interactive processes in the same way. Essentially, that length of time
+disappears from CF's decisions.
+
+However, this isn't always desirable. Ideally, we want
+to be able to shield some tasks from the consequences of power capping, while
+letting other tasks take the brunt of the impact. We accomplish this by
+stealing time from tasks, as if they were running while we were lazy
+injecting. We do this in a user specified priority order. The priorities
+are specified as power_capping_priority in the parent CPU cgroup of the tasks.
+The higher the priority, the better it is for the task. The run delay
+introduced by power capping is first given to the lower priority task, but
+if they aren't able to absorb it (i.e. it exceeds the time that they would
+have available to run), then it is passed to the higher priorities. In
+case of a tie, we round robin the order of the tasks for this penalty.
+
+Note that we reserve the power capping priority treatment for lazy injections
+only. Eagerly injected cycles are distributed equally among all the
+tasks. Since interactive tasks are unaffected by eager injection, this
+is fine.
+
+Pretending to be a CFS thread for the LB:
+
+The kidled is an RT thread so that it can preempt almost anything.
+As such, it would normally have the weight associated with an RT thread.
+However, this makes a CPU recieving an idle cycle injection,
+suddenly much much less desirable than other CPUs with just CFS tasks.
+To provide a way to remedy this, we allow the setting of a fake nice value
+for the kidled thread. Normally these threads are nice -19. But the value
+can be adjusted by the user with /proc/sys/kernel/kidled/lb_prio. This is
+specified as a non-negative integer. 0 corresponds to nice -19 (default)
+and 39 corresponds to nice 20.
+
+
diff --git a/include/linux/kidled.h b/include/linux/kidled.h
index 05c4ae5..199915a 100644
--- a/include/linux/kidled.h
+++ b/include/linux/kidled.h
@@ -69,9 +69,15 @@ static inline int ici_in_eager_mode(void)

int kidled_running(void);
struct task_struct *get_kidled_task(int cpu);
+int get_ici_lb_prio(void);
int is_ici_thread(struct task_struct *p);
void kidled_interrupt_enter(void);
void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
extern int should_eager_inject(void);
+void power_capping_reshuffle_runqueue(long injected, long period);
+extern int should_eager_inject(void);
+
+#define MAX_POWER_CAPPING_PRIORITY (48)
+
#endif
diff --git a/kernel/kidled.c b/kernel/kidled.c
index 4e7aff3..5cd6911 100644
--- a/kernel/kidled.c
+++ b/kernel/kidled.c
@@ -218,6 +218,8 @@ static void lazy_inject(long nsecs, long interval)
}
__get_cpu_var(still_lazy_injecting) = 0;
hrtimer_cancel(&halt_timer);
+
+ power_capping_reshuffle_runqueue(nsecs, interval);
}

static DEFINE_PER_CPU(int, still_monitoring);
diff --git a/kernel/sched.c b/kernel/sched.c
index 486cab2..f2e89cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -260,6 +260,8 @@ struct task_group {
unsigned long shares;
#ifdef CONFIG_IDLE_CYCLE_INJECTOR
int power_interactive;
+ int power_capping_priority;
+ struct list_head pcp_queue_list[NR_CPUS];
#endif
#endif

@@ -552,6 +554,9 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct list_head pwrcap_prio_queue[MAX_POWER_CAPPING_PRIORITY];
+#endif
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list;
@@ -1867,8 +1872,20 @@ static void dec_nr_running(struct rq *rq)
static void set_load_weight(struct task_struct *p)
{
if (task_has_rt_policy(p)) {
- p->se.load.weight = prio_to_weight[0] * 2;
- p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (!is_ici_thread(p)) {
+#endif
+ p->se.load.weight = prio_to_weight[0] * 2;
+ p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ } else {
+ int lb_prio = get_ici_lb_prio();
+ p->se.load.weight =
+ prio_to_weight[lb_prio];
+ p->se.load.inv_weight =
+ prio_to_wmult[lb_prio];
+ }
+#endif
return;
}

@@ -9599,7 +9616,12 @@ void __init sched_init(void)
#ifdef CONFIG_GROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&init_task_group.pcp_queue_list[i]);
+#endif
+#endif
#ifdef CONFIG_USER_SCHED
INIT_LIST_HEAD(&root_task_group.children);
init_task_group.parent = &root_task_group;
@@ -9627,6 +9649,10 @@ void __init sched_init(void)
#ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for (j = 0; j < MAX_POWER_CAPPING_PRIORITY; j++)
+ INIT_LIST_HEAD(&rq->pwrcap_prio_queue[j]);
+#endif
#ifdef CONFIG_CGROUP_SCHED
/*
* How much cpu bandwidth does init_task_group get?
@@ -10110,6 +10136,11 @@ struct task_group *sched_create_group(struct task_group *parent)

WARN_ON(!parent); /* root should already exist */

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&tg->pcp_queue_list[i]);
+#endif
+
tg->parent = parent;
INIT_LIST_HEAD(&tg->children);
list_add_rcu(&tg->siblings, &parent->children);
@@ -10676,6 +10707,39 @@ static int cpu_power_interactive_write_u64(struct cgroup *cgrp,
tg->power_interactive = interactive;
return 0;
}
+
+static u64 cpu_power_capping_priority_read_u64(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ return (u64) tg->power_capping_priority;
+}
+
+static int cpu_power_capping_priority_write_u64(struct cgroup *cgrp,
+ struct cftype *cftype,
+ u64 priority)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ int i;
+
+ if (priority >= MAX_POWER_CAPPING_PRIORITY)
+ return -EINVAL;
+
+ tg->power_capping_priority = priority;
+
+ for_each_online_cpu(i) {
+ struct rq *rq = cpu_rq(i);
+
+ raw_spin_lock_irq(&rq->lock);
+ if (!list_empty(&tg->pcp_queue_list[i])) {
+ list_move_tail(&tg->pcp_queue_list[i],
+ &rq->pwrcap_prio_queue[priority]);
+ }
+ raw_spin_unlock_irq(&rq->lock);
+ }
+
+ return 0;
+}
#endif /* CONFIG_IDLE_CYCLE_INJECTOR */
#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -10712,6 +10776,11 @@ static struct cftype cpu_files[] = {
},
#ifdef CONFIG_IDLE_CYCLE_INJECTOR
{
+ .name = "power_capping_priority",
+ .read_u64 = cpu_power_capping_priority_read_u64,
+ .write_u64 = cpu_power_capping_priority_write_u64,
+ },
+ {
.name = "power_interactive",
.read_u64 = cpu_power_interactive_read_u64,
.write_u64 = cpu_power_interactive_write_u64,
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8fe7ee8..715a3ae 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -625,8 +625,23 @@ static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
+ if (!parent_entity(se)) {
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct task_group *tg = NULL;
+
+ if (group_cfs_rq(se))
+ tg = group_cfs_rq(se)->tg;
+ if (tg && tg->parent) {
+ int cpu = cpu_of(rq_of(cfs_rq));
+ int pcp_prio = tg->power_capping_priority;
+ list_add_tail(&tg->pcp_queue_list[cpu],
+ &rq_of(cfs_rq)->pwrcap_prio_queue[pcp_prio]);
+ }
+#endif
+
inc_cpu_load(rq_of(cfs_rq), se->load.weight);
+ }
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, se->load.weight);
list_add(&se->group_node, &cfs_rq->tasks);
@@ -639,8 +654,19 @@ static void
account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_sub(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
+ if (!parent_entity(se)) {
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ struct task_group *tg = NULL;
+
+ if (group_cfs_rq(se))
+ tg = group_cfs_rq(se)->tg;
+ if (tg && tg->parent)
+ list_del_init(&tg->pcp_queue_list[cfs_rq->rq->cpu]);
+#endif
+
dec_cpu_load(rq_of(cfs_rq), se->load.weight);
+ }
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, -se->load.weight);
list_del_init(&se->group_node);
@@ -988,6 +1014,53 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
check_preempt_tick(cfs_rq, curr);
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+/* reshuffle run queue order base on power capping priority */
+void power_capping_reshuffle_runqueue(long injected, long ici_period)
+{
+ int i;
+ int cpu = smp_processor_id();
+ struct rq *rq = this_rq_lock();
+ struct task_group *tg;
+ struct task_group *next;
+
+ for (i = 0; i < MAX_POWER_CAPPING_PRIORITY; i++) {
+ struct list_head tmp_list;
+ INIT_LIST_HEAD(&tmp_list);
+ list_for_each_entry_safe(tg, next, &rq->pwrcap_prio_queue[i],
+ pcp_queue_list[cpu]) {
+ struct sched_entity *se;
+ struct cfs_rq *cfs_rq;
+ long slice, charge;
+
+ se = tg->se[cpu];
+ cfs_rq = se->cfs_rq;
+
+ slice = sched_slice(cfs_rq, se) * ici_period /
+ __sched_period(cfs_rq->nr_running);
+ charge = min(slice, injected);
+
+ __dequeue_entity(cfs_rq, se);
+ se->vruntime += calc_delta_fair(charge, se);
+ __enqueue_entity(cfs_rq, se);
+
+ injected -= charge;
+ list_del(&tg->pcp_queue_list[cpu]);
+ list_add_tail(&tg->pcp_queue_list[cpu], &tmp_list);
+ if (injected <= 0) {
+ list_splice(&tmp_list,
+ rq->pwrcap_prio_queue[i].prev);
+ goto done;
+ }
+ }
+ list_splice(&tmp_list, &rq->pwrcap_prio_queue[i]);
+ }
+done:
+ raw_spin_unlock_irq(&rq->lock);
+ return;
+}
+#endif
+
/**************************************************
* CFS operations on tasks:
*/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Salman

unread,

Apr 13, 2010, 8:20:01 PM4/13/10

From: Salman Qazi <sq...@google.com>

kidled is a kernel thread that implements idle cycle injection for
the purposes of power capping. It measures the naturally occuring
idle time as necessary to avoid injecting idle cycles when the
CPU is already sufficiently idle. The actual idle cycle injection
takes places in a realtime kernel thread, where as the measurements
take place in hrtimer callback functions.

Signed-off-by: Salman Qazi <sq...@google.com>
---

Documentation/kidled.txt | 40 +++
arch/x86/Kconfig | 1
arch/x86/include/asm/idle.h | 1
arch/x86/kernel/process_64.c | 2
drivers/misc/Gconfig.ici | 1
include/linux/kidled.h | 45 +++
kernel/Kconfig.ici | 6
kernel/Makefile | 1
kernel/kidled.c | 547 ++++++++++++++++++++++++++++++++++++++++++
kernel/softirq.c | 15 +
kernel/sysctl.c | 11 +
11 files changed, 664 insertions(+), 6 deletions(-)
create mode 100644 Documentation/kidled.txt
create mode 100644 drivers/misc/Gconfig.ici
create mode 100644 include/linux/kidled.h
create mode 100644 kernel/Kconfig.ici
create mode 100644 kernel/kidled.c

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
new file mode 100644
index 0000000..1149e3f
--- /dev/null
+++ b/Documentation/kidled.txt
@@ -0,0 +1,40 @@
+Idle Cycle Injector:
+====================
+
+Overview:
+
+Provides a kernel interface for causing the CPUs to have some
+minimum percentage of the idle time.
+
+Interfaces:
+
+Under /proc/sys/kernel/kidled/, we can find the following files:
+
+cpu/*/interval
+cpu/*/min_idle_percent
+cpu/*/stats
+
+interval specifies the period of time over which we attempt to make the
+CPU min_idle_percent idle. stats provides three fields. The first is
+the naturally occuring idle time. The second is the busy time, and the last
+is the injected idle time. All three values are reported in the units of
+nanoseconds.
+
+** VERY IMPORTANT NOTE: ** In all kernel stats except for cpu/*/stats, the
+injected idle cycles are by convention reported as busy time, attributed to
+kidled.
+
+
+Operation:
+
+The injecting component of the idle cycle injector is the kernel thread
+kidled. The measurements to determine when to inject idle cycles is done
+in hrtimer callbacks. The idea is to avoid injecting idle cycles when
+the CPU is already sufficiently idle. This is accomplished by always setting
+the next timer expiry to the minimum of when we expect to run out of CPU time
+(running at full tilt) or the end of the interval. When the timer expires,
+we evaluate if we need to inject idle cycles right away to avoid blowing our
+quota. If that's the case, then we inject idle cycles until the end of the
+interval.
+
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..cd384e1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -754,6 +754,7 @@ config SCHED_MC
increased overhead in some places. If unsure say N here.

source "kernel/Kconfig.preempt"
+source "kernel/Kconfig.ici"

config X86_UP_APIC
bool "Local APIC support on uniprocessors"
diff --git a/arch/x86/include/asm/idle.h b/arch/x86/include/asm/idle.h
index 38d8737..e36c5b4 100644
--- a/arch/x86/include/asm/idle.h
+++ b/arch/x86/include/asm/idle.h
@@ -10,6 +10,7 @@ void idle_notifier_unregister(struct notifier_block *n);

#ifdef CONFIG_X86_64
void enter_idle(void);
+void __exit_idle(void);
void exit_idle(void);
#else /* !CONFIG_X86_64 */
static inline void enter_idle(void) { }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 126f0b4..a7c8932 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -77,7 +77,7 @@ void enter_idle(void)
atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
}

-static void __exit_idle(void)
+void __exit_idle(void)
{
if (x86_test_and_clear_bit_percpu(0, is_idle) == 0)
return;
diff --git a/drivers/misc/Gconfig.ici b/drivers/misc/Gconfig.ici
new file mode 100644
index 0000000..ecad2be
--- /dev/null
+++ b/drivers/misc/Gconfig.ici
@@ -0,0 +1 @@
+CONFIG_IDLE_CYCLE_INJECTOR=y
diff --git a/include/linux/kidled.h b/include/linux/kidled.h
new file mode 100644
index 0000000..7940dfa
--- /dev/null
+++ b/include/linux/kidled.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright 2008 Google Inc.
+ *
+ * Author: sq...@google.com
+ *
+ */
+
+#include <linux/tick.h>
+
+#ifndef _IDLED_H
+#define _IDLED_H
+
+DECLARE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+
+static inline s64 current_cpu_lazy_inject_count(void)
+{
+ /* We'll update this value in the idle cycle injector */
+ return __get_cpu_var(cpu_lazy_inject_count);
+}
+
+static inline s64 current_cpu_inject_count(void)
+{
+ return current_cpu_lazy_inject_count();
+}
+
+
+static inline s64 current_cpu_idle_count(void)
+{

+ int cpu = smp_processor_id();

+ struct tick_sched *ts = tick_get_tick_sched(cpu);
+ return ktime_to_ns(ts->idle_sleeptime) + current_cpu_inject_count();
+}
+
+static inline s64 current_cpu_busy_count(void)
+{

+ int cpu = smp_processor_id();

+ struct tick_sched *ts = tick_get_tick_sched(cpu);
+ return ktime_to_ns(ktime_sub(ktime_get(), ts->idle_sleeptime)) -
+ current_cpu_inject_count();
+}
+
+void kidled_interrupt_enter(void);
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
+#endif
diff --git a/kernel/Kconfig.ici b/kernel/Kconfig.ici
new file mode 100644
index 0000000..db5db95
--- /dev/null
+++ b/kernel/Kconfig.ici
@@ -0,0 +1,6 @@
+config IDLE_CYCLE_INJECTOR
+ bool "Idle Cycle Injector"
+ default n
+ help
+ Reduces power consumption by making sure that each CPU is
+ idle the given percentage of time.
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..fc82197 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -24,6 +24,7 @@ CFLAGS_REMOVE_sched_clock.o = -pg
CFLAGS_REMOVE_perf_event.o = -pg
endif

+obj-$(CONFIG_IDLE_CYCLE_INJECTOR) += kidled.o
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/kidled.c b/kernel/kidled.c
new file mode 100644
index 0000000..f590178
--- /dev/null
+++ b/kernel/kidled.c
@@ -0,0 +1,547 @@
+/*
+ * Copyright 2008 Google Inc.
+ *
+ * Idle Cycle Injector, also affectionately known as "kidled".
+ *
+ * Allows us to force each processor to have a specific amount of idle
+ * cycles for the purposes of controlling the power consumed by the machine.
+ *
+ * Authors:
+ *
+ * Salman Qazi <sq...@google.com>
+ * Ken Chen <ken...@google.com>
+ */
+
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/timer.h>
+#include <linux/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/kidled.h>
+#include <linux/poll.h>
+#include <linux/hrtimer.h>
+#include <linux/spinlock.h>
+#include <linux/sysctl.h>
+#include <linux/irqflags.h>
+#include <linux/timer.h>
+#include <asm/atomic.h>
+#include <asm/idle.h>
+
+#ifdef CONFIG_HIGH_RES_TIMERS
+#define SLEEP_GRANULARITY (20*NSEC_PER_USEC)
+#else
+#define SLEEP_GRANULARITY (NSEC_PER_MSEC)
+#endif
+
+#define KIDLED_PRIO (MAX_RT_PRIO - 2)
+#define KIDLED_DEFAULT_INTERVAL (100 * NSEC_PER_MSEC)
+
+struct kidled_inputs {
+ spinlock_t lock;
+ long idle_time;
+ long busy_time;
+};
+
+static int kidled_init_completed;
+static DEFINE_PER_CPU(struct task_struct *, kidled_thread);
+static DEFINE_PER_CPU(struct kidled_inputs, kidled_inputs);
+
+DEFINE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+
+struct monitor_cpu_data {
+ int cpu;
+ long base_clock_count;
+ long base_cpu_count;
+ long max_clock_time;
+ long max_cpu_time;
+ long clock_time;
+ long cpu_time;
+};
+
+static DEFINE_PER_CPU(struct monitor_cpu_data, monitor_cpu_data);
+
+
+static DEFINE_PER_CPU(int, in_lazy_inject);
+static DEFINE_PER_CPU(unsigned long, inject_start);
+static void __enter_lazy_inject(void)
+{
+ if (!__get_cpu_var(in_lazy_inject)) {
+ __get_cpu_var(inject_start) = ktime_to_ns(ktime_get());
+ __get_cpu_var(in_lazy_inject) = 1;
+ }
+ enter_idle();
+}
+
+static void __exit_lazy_inject(void)
+{
+ if (__get_cpu_var(in_lazy_inject)) {
+ get_cpu_var(cpu_lazy_inject_count) +=
+ ktime_to_ns(ktime_get()) - __get_cpu_var(inject_start);
+ __get_cpu_var(in_lazy_inject) = 0;
+ }
+ __exit_idle();
+}
+
+static void enter_lazy_inject(void)
+{
+ local_irq_disable();
+ __enter_lazy_inject();
+ local_irq_enable();
+}
+
+static void exit_lazy_inject(void)
+{
+ local_irq_disable();
+ __exit_lazy_inject();
+ local_irq_enable();
+}
+
+/* Caller must have interrupts disabled */
+void kidled_interrupt_enter(void)
+{
+ if (!kidled_init_completed)
+ return;
+
+ __exit_lazy_inject();
+}
+
+static DEFINE_PER_CPU(int, still_lazy_injecting);
+static enum hrtimer_restart lazy_inject_timer_func(struct hrtimer *timer)
+{
+ __get_cpu_var(still_lazy_injecting) = 0;
+ return HRTIMER_NORESTART;
+}
+
+static void do_idle(void)
+{
+ void (*idle)(void) = NULL;
+
+ idle = pm_idle;
+ if (!idle)
+ idle = default_idle;
+
+ /* Put CPU to sleep until next interrupt */
+ idle();
+}
+
+/* Halts the CPU for the given number of nanoseconds.
+ *
+ * The cond_resched in there must be used responsibly, in the sense
+ * that we should have a minimal amount of work that the kernel
+ * wants done even when we are injecting idle cycles. This work
+ * should be accounted for by higher level users.
+ */
+static void lazy_inject(long nsecs, long interval)
+{
+ struct hrtimer halt_timer;
+
+ if (nsecs <= 0)
+ return;
+
+ __get_cpu_var(still_lazy_injecting) = 1;
+ hrtimer_init(&halt_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_set_expires(&halt_timer, ktime_set(0, nsecs));
+ halt_timer.function = lazy_inject_timer_func;
+ hrtimer_start(&halt_timer, ktime_set(0, nsecs), HRTIMER_MODE_REL);
+
+ while (__get_cpu_var(still_lazy_injecting)) {
+
+ enter_lazy_inject();
+
+ /* Put CPU to sleep until next interrupt */
+ do_idle();
+ exit_lazy_inject();
+
+ /* The supervising userland thread needs to run with
+ * minimal latency. We yield to higher priority threads
+ */
+ cond_resched();
+ }
+ __get_cpu_var(still_lazy_injecting) = 0;
+ hrtimer_cancel(&halt_timer);
+}
+
+static DEFINE_PER_CPU(int, still_monitoring);
+
+/*
+ * Tells us when we would need to wake up next.
+ */
+long get_next_timer(struct monitor_cpu_data *data)
+{
+ long lazy;
+
+ lazy = min(data->max_cpu_time - data->cpu_time,
+ data->max_clock_time - data->clock_time);
+
+ lazy -= SLEEP_GRANULARITY - 1;
+
+ return lazy;
+}
+
+/*
+ * Figures out if the idle cycle injector needs to be woken up at the moment.
+ * If yes, then we go ahead and wake it up. If no, then we figure out the
+ * next time when we should make the same decision. The idea is to always
+ * make the decision before the applications use up the available CPU or
+ * clock time.
+ *
+ */
+static enum hrtimer_restart monitor_cpu_timer_func(struct hrtimer *timer)
+{
+ long next_timer;
+ struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);
+
+ BUG_ON(data->cpu != smp_processor_id());
+ data->clock_time = ktime_to_ns(ktime_get()) - data->base_clock_count;
+ data->cpu_time = current_cpu_busy_count() - data->base_cpu_count;
+
+ if ((data->max_clock_time - data->clock_time < SLEEP_GRANULARITY) ||
+ (data->max_cpu_time - data->cpu_time < SLEEP_GRANULARITY)) {
+ __get_cpu_var(still_monitoring) = 0;
+
+ wake_up_process(__get_cpu_var(kidled_thread));
+ return HRTIMER_NORESTART;
+ } else {
+ next_timer = get_next_timer(data);
+
+ hrtimer_forward_now(timer, ktime_set(0, next_timer));
+ return HRTIMER_RESTART;
+ }
+}
+
+/*
+ * Allow other processes to use CPU for up to max_clock_time
+ * clock time, and max_cpu_time CPU time.
+ *
+ * Accurate only up to resolution of hrtimers.
+ *
+ * @return: Clock time left
+ */
+static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
+ long *left_cpu_time)
+{
+ long first_timer;
+ struct hrtimer sleep_timer;
+ struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);
+ data->max_clock_time = max_clock_time;
+ data->max_cpu_time = max_cpu_time;
+ data->base_clock_count = ktime_to_ns(ktime_get());
+ data->base_cpu_count = current_cpu_busy_count();
+ data->clock_time = 0;
+ data->cpu_time = 0;
+ data->cpu = smp_processor_id();
+
+ first_timer = get_next_timer(data);
+ if (first_timer <= 0) {
+ if (left_cpu_time)
+ *left_cpu_time = max_cpu_time;
+
+ return max_clock_time;
+ }
+
+ __get_cpu_var(still_monitoring) = 1;
+ hrtimer_init(&sleep_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_set_expires(&sleep_timer, ktime_set(0, first_timer));
+ sleep_timer.function = monitor_cpu_timer_func;
+ hrtimer_start(&sleep_timer, ktime_set(0, first_timer),
+ HRTIMER_MODE_REL);
+ while (1) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!__get_cpu_var(still_monitoring))
+ break;
+ schedule();
+ }
+
+ __get_cpu_var(still_monitoring) = 0;
+ hrtimer_cancel(&sleep_timer);
+
+ if (left_cpu_time)
+ *left_cpu_time = max(data->max_cpu_time - data->cpu_time, 0L);
+
+ return max(data->max_clock_time - data->clock_time, 0L);
+}
+
+static int kidled(void *p)
+{
+ struct kidled_inputs *inputs = (struct kidled_inputs *)p;
+ long idle_time = 0;
+ long busy_time = 0;
+ long old_idle_time;
+ long old_busy_time;
+ long interval = 0;
+ unsigned long nsecs_left = 0;
+ __get_cpu_var(still_lazy_injecting) = 0;
+ allow_signal(SIGHUP);
+
+ while (1) {
+ old_idle_time = idle_time;
+ old_busy_time = busy_time;
+ spin_lock(&inputs->lock);
+ busy_time = inputs->busy_time;
+ idle_time = inputs->idle_time;
+
+ /* Just incase we get spurious SIGHUPs */
+ if ((old_idle_time != idle_time) ||
+ (old_busy_time != busy_time)) {
+ interval = idle_time + busy_time;
+ }
+ flush_signals(current);
+ spin_unlock(&inputs->lock);
+
+ /* Keep overhead low when dormant */
+ if (idle_time == 0) {
+ while (!signal_pending(current)) {
+ schedule_timeout_interruptible(
+ MAX_SCHEDULE_TIMEOUT);
+ }
+ }
+
+ while (!signal_pending(current)) {
+ nsecs_left = monitor_cpu(interval, busy_time, NULL);
+ lazy_inject(nsecs_left, interval);
+ }
+ }
+}
+
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time)
+{
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ per_cpu(kidled_inputs, cpu).idle_time = idle_time;
+ per_cpu(kidled_inputs, cpu).busy_time = busy_time;
+ send_sig(SIGHUP, per_cpu(kidled_thread, cpu), 1);
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time)
+{
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ *idle_time = per_cpu(kidled_inputs, cpu).idle_time;
+ *busy_time = per_cpu(kidled_inputs, cpu).busy_time;
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+static long get_kidled_interval(int cpu)
+{
+ long idle_time;
+ long busy_time;
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ return idle_time + busy_time;
+}
+
+static void set_kidled_interval(int cpu, long interval)
+{
+ int old_interval;
+ spin_lock(&per_cpu(kidled_inputs, cpu).lock);
+ old_interval = per_cpu(kidled_inputs, cpu).busy_time +
+ per_cpu(kidled_inputs, cpu).idle_time;
+ per_cpu(kidled_inputs, cpu).idle_time =
+ (per_cpu(kidled_inputs, cpu).idle_time
+ * interval) / old_interval;
+ per_cpu(kidled_inputs, cpu).busy_time = interval -
+ per_cpu(kidled_inputs, cpu).idle_time;
+ send_sig(SIGHUP, per_cpu(kidled_thread, cpu), 1);
+ spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
+}
+
+static int proc_min_idle_percent(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ long idle_time;
+ long busy_time;
+ int ratio;
+ struct ctl_table fake = {};
+ int zero = 0;
+ int hundred = 100;
+ int ret;
+
+ int cpu = (int)((long)table->extra1);
+
+ fake.data = &ratio;
+ fake.maxlen = sizeof(int);
+ fake.extra1 = &zero;
+ fake.extra2 = &hundred;
+
+
+ if (!write) {
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ ratio = (int)((idle_time * 100) / (idle_time + busy_time));
+ }
+
+ ret = proc_dointvec_minmax(&fake, write, buffer, lenp, ppos);
+
+ if (!ret && write) {
+ int idle_interval;
+
+ idle_interval = get_kidled_interval(cpu);
+ idle_time = ((long)ratio * idle_interval) / 100;
+
+ /* round down new_idle to timer resolution */
+ idle_time = (idle_time / SLEEP_GRANULARITY) *
+ SLEEP_GRANULARITY;
+
+ set_cpu_idle_ratio(cpu, idle_time,
+ idle_interval - idle_time);
+ }
+
+ return ret;
+}
+
+static int proc_interval(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ long idle_time;
+ long busy_time;
+ int interval;
+ struct ctl_table fake = {};
+ int min = 1;
+ int max = 500;
+ int ret;
+
+ int cpu = (int)((long)table->extra1);
+
+ fake.data = &interval;
+ fake.maxlen = sizeof(int);
+ fake.extra1 = &min;
+ fake.extra2 = &max;
+
+
+ if (!write) {
+ get_cpu_idle_ratio(cpu, &idle_time, &busy_time);
+ interval = (int)((idle_time + busy_time) / NSEC_PER_MSEC);
+ }
+
+ ret = proc_dointvec_minmax(&fake, write, buffer, lenp, ppos);
+
+ if (!ret && write)
+ set_kidled_interval(cpu, (long)interval * NSEC_PER_MSEC);
+
+ return ret;
+}
+
+static void getstats(void *info)
+{
+ unsigned long *stats = (unsigned long *)info;
+ stats[0] = current_cpu_idle_count();
+ stats[1] = current_cpu_busy_count();
+ stats[2] = current_cpu_lazy_inject_count();
+}
+
+
+static int proc_stats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ unsigned long stats[3];
+ int cpu = (int)((long)table->extra1);
+ struct ctl_table fake = {};
+
+ if (write)
+ return -EINVAL;
+
+ fake.data = stats;
+ fake.maxlen = 3*sizeof(unsigned long);
+
+ ret = smp_call_function_single(cpu, getstats, &stats, 1);
+ if (ret)
+ return ret;
+
+ return proc_doulongvec_minmax(&fake, write, buffer, lenp, ppos);
+
+}
+
+#define NUM_CPU_CTLS 3
+#define CPU_NUM_SIZE 5
+
+static struct ctl_table kidled_cpu_dir_prot[NUM_CPU_CTLS + 1] = {
+ {
+ .procname = "min_idle_percent",
+ .proc_handler = proc_min_idle_percent,
+ .mode = 0644,
+ },
+ {
+ .procname = "interval",
+ .proc_handler = proc_interval,
+ .mode = 0644,
+ },
+ {
+ .procname = "stats",
+ .proc_handler = proc_stats,
+ .mode = 0444,
+ },
+
+ { }
+
+};
+static DEFINE_PER_CPU(char[CPU_NUM_SIZE], cpu_num);
+
+static DEFINE_PER_CPU(struct ctl_table[NUM_CPU_CTLS + 1],
+ kidled_cpu_dir_table);
+
+/* This is the kidled/cpu/ directory */
+static struct ctl_table kidled_cpu_table[NR_CPUS + 1];
+
+static int zero;
+
+struct ctl_table kidled_table[] = {
+ {
+ .procname = "cpu",
+ .mode = 0555,
+ .child = kidled_cpu_table,
+ },
+ { }
+};
+
+static int __init kidled_init(void)
+{
+ int cpu;
+ int i;
+
+ /*
+ * One priority level below maximum. The next higher priority level
+ * will be used by a userland thread supervising us.
+ */
+ struct sched_param param = { .sched_priority = KIDLED_PRIO };
+
+ if (!proc_mkdir("driver/kidled", NULL))
+ return 1;
+
+ for_each_online_cpu(cpu) {
+ spin_lock_init(&per_cpu(kidled_inputs, cpu).lock);
+ per_cpu(kidled_inputs, cpu).idle_time = 0;
+ per_cpu(kidled_inputs, cpu).busy_time =
+ KIDLED_DEFAULT_INTERVAL;
+ per_cpu(kidled_thread, cpu) = kthread_create(kidled,
+ &per_cpu(kidled_inputs, cpu), "kidled/%d", cpu);
+ if (IS_ERR(per_cpu(kidled_thread, cpu))) {
+ printk(KERN_ERR "Failed to start kidled on CPU %d\n",
+ cpu);
+ BUG();
+ }
+
+ kthread_bind(per_cpu(kidled_thread, cpu), cpu);
+ sched_setscheduler(per_cpu(kidled_thread, cpu),
+ SCHED_FIFO, &param);
+ wake_up_process(per_cpu(kidled_thread, cpu));
+
+ snprintf(per_cpu(cpu_num, cpu), CPU_NUM_SIZE, "%d", cpu);
+ kidled_cpu_table[cpu].procname = per_cpu(cpu_num, cpu);
+ kidled_cpu_table[cpu].mode = 0555;
+ kidled_cpu_table[cpu].child = per_cpu(kidled_cpu_dir_table,
+ cpu);
+
+ memcpy(per_cpu(kidled_cpu_dir_table, cpu), kidled_cpu_dir_prot,
+ sizeof(kidled_cpu_dir_prot));
+
+ for (i = 0; i < NUM_CPU_CTLS; i++) {
+ per_cpu(kidled_cpu_dir_table[i], cpu).extra1 =
+ (void *)((long)cpu);
+ }
+
+ }
+ kidled_init_completed = 1;
+ return 0;
+}
+module_init(kidled_init);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7c1a67e..97d6193 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -24,6 +24,7 @@
#include <linux/ftrace.h>
#include <linux/smp.h>
#include <linux/tick.h>
+#include <linux/kidled.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -278,11 +279,15 @@ void irq_enter(void)
int cpu = smp_processor_id();

rcu_irq_enter();
- if (idle_cpu(cpu) && !in_interrupt()) {
- __irq_enter();
- tick_check_idle(cpu);
- } else
- __irq_enter();
+ __irq_enter();
+ if (!in_interrupt()) {
+ if (idle_cpu(cpu))
+ tick_check_idle(cpu);
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ kidled_interrupt_enter();
+#endif
+ }
}

#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..eaec177 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -190,6 +190,9 @@ static struct ctl_table fs_table[];
static struct ctl_table debug_table[];
static struct ctl_table dev_table[];
extern struct ctl_table random_table[];
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+extern struct ctl_table kidled_table[];
+#endif
#ifdef CONFIG_INOTIFY_USER
extern struct ctl_table inotify_table[];
#endif
@@ -601,6 +604,14 @@ static struct ctl_table kern_table[] = {
.mode = 0555,
.child = random_table,
},
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ {
+ .procname = "kidled",
+ .mode = 0555,
+ .child = kidled_table,
+ },
+#endif
{
.procname = "overflowuid",
.data = &overflowuid,

Salman

unread,

Apr 13, 2010, 8:20:02 PM4/13/10

From: Salman Qazi <sq...@google.com>

We add the concept of a "power interactive" task group. This is a task
group that, for the purposes of power capping, will recieve special treatment.

When there are no power interactive tasks on the runqueue, we inject idle
cycles unless we have already met the quota. However, when there are
power interactive tasks on the runqueue, we only inject idle cycles if we
would otherwise fail to meet the quota. As a result, we try our very best
to not hit the interactive tasks with the idle cycles. The power
interactivity status of a task group is determined by the boolean value
in cpu.power_interactive.

Signed-off-by: Salman Qazi <sq...@google.com>
---

Documentation/kidled.txt | 15 ++++
include/linux/kidled.h | 34 +++++++++
include/linux/sched.h | 3 +
kernel/kidled.c | 166 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched.c | 80 ++++++++++++++++++++++
5 files changed, 285 insertions(+), 13 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 1149e3f..564aa00 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -25,7 +25,7 @@ injected idle cycles are by convention reported as busy time, attributed to
kidled.

-Operation:
+Basic Operation:

The injecting component of the idle cycle injector is the kernel thread

kidled. The measurements to determine when to inject idle cycles is done

@@ -38,3 +38,16 @@ quota. If that's the case, then we inject idle cycles until the end of the
interval.

+Eager Injection:
+
+Above is true, when there is at least one tasks marked "interactive" on
+the CPU runqueue for the duration of the interval. Marking a task
+interactive involves setting power_interactive to 1 in its parent CPU
+cgroup. When such no such task is runnable and when we have not achieved
+the minimum idle percentage for the interval, we eagerly inject idle cycles.
+The purpose for doing so is to inject as many of the idle cycles as possible
+while the interactive tasks are not running. Thus, when the interactive
+tasks become runnable, they are more likely to fall in an interval when we
+aren't forcing the CPU idle.

+
+
diff --git a/include/linux/kidled.h b/include/linux/kidled.h

index 7940dfa..05c4ae5 100644
--- a/include/linux/kidled.h
+++ b/include/linux/kidled.h
@@ -11,6 +11,7 @@
#define _IDLED_H

DECLARE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+DECLARE_PER_CPU(unsigned long, cpu_eager_inject_count);

static inline s64 current_cpu_lazy_inject_count(void)
{
@@ -18,9 +19,16 @@ static inline s64 current_cpu_lazy_inject_count(void)
return __get_cpu_var(cpu_lazy_inject_count);
}

+static inline s64 current_cpu_eager_inject_count(void)
+{
+ /* We update this value in the idle cycle injector */
+ return __get_cpu_var(cpu_eager_inject_count);
+}
+
static inline s64 current_cpu_inject_count(void)
{
- return current_cpu_lazy_inject_count();
+ return current_cpu_lazy_inject_count() +
+ current_cpu_eager_inject_count();
}

@@ -42,4 +50,28 @@ static inline s64 current_cpu_busy_count(void)
void kidled_interrupt_enter(void);

void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);

void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);
+

+enum ici_enum {
+ ICI_LAZY,
+ ICI_EAGER,
+};
+
+DECLARE_PER_CPU(enum ici_enum, ici_state);
+
+static inline int ici_in_eager_mode(void)
+{
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ return (__get_cpu_var(ici_state) == ICI_EAGER);
+#else
+ return 0;
+#endif
+}
+
+int kidled_running(void);
+struct task_struct *get_kidled_task(int cpu);
+int is_ici_thread(struct task_struct *p);

+void kidled_interrupt_enter(void);
+void set_cpu_idle_ratio(int cpu, long idle_time, long busy_time);
+void get_cpu_idle_ratio(int cpu, long *idle_time, long *busy_time);

+extern int should_eager_inject(void);
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..1f94f21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1566,6 +1566,9 @@ struct task_struct {
unsigned long memsw_bytes; /* uncharged mem+swap usage */
} memcg_batch;
#endif
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ int power_interactive;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/kidled.c b/kernel/kidled.c
index f590178..4e7aff3 100644
--- a/kernel/kidled.c
+++ b/kernel/kidled.c
@@ -45,10 +45,16 @@ struct kidled_inputs {
};

static int kidled_init_completed;
+
+DEFINE_PER_CPU(enum ici_enum, ici_state);
static DEFINE_PER_CPU(struct task_struct *, kidled_thread);
static DEFINE_PER_CPU(struct kidled_inputs, kidled_inputs);

DEFINE_PER_CPU(unsigned long, cpu_lazy_inject_count);
+DEFINE_PER_CPU(unsigned long, cpu_eager_inject_count);
+
+static int sysctl_ici_lb_prio;
+static int ici_lb_prio_max = MAX_PRIO - MAX_RT_PRIO - 1;

struct monitor_cpu_data {
int cpu;
@@ -58,10 +64,26 @@ struct monitor_cpu_data {
long max_cpu_time;
long clock_time;
long cpu_time;
+ long eager_inject_goal;
};

static DEFINE_PER_CPU(struct monitor_cpu_data, monitor_cpu_data);

+int get_ici_lb_prio(void)
+{
+ return sysctl_ici_lb_prio;
+}
+
+int is_ici_thread(struct task_struct *p)
+{
+ return per_cpu(kidled_thread, task_cpu(p)) == p;
+}
+
+int kidled_running(void)
+{
+ return __get_cpu_var(kidled_thread)->se.on_rq;
+}
+

static DEFINE_PER_CPU(int, in_lazy_inject);
static DEFINE_PER_CPU(unsigned long, inject_start);
@@ -98,6 +120,40 @@ static void exit_lazy_inject(void)
local_irq_enable();
}

+static DEFINE_PER_CPU(int, in_eager_inject);
+static void __enter_eager_inject(void)
+{
+ if (!__get_cpu_var(in_eager_inject)) {
+ __get_cpu_var(inject_start) = ktime_to_ns(ktime_get());
+ __get_cpu_var(in_eager_inject) = 1;

+ }
+ enter_idle();
+}
+

+static void __exit_eager_inject(void)
+{
+ if (__get_cpu_var(in_eager_inject)) {
+ __get_cpu_var(cpu_eager_inject_count) +=
+ ktime_to_ns(ktime_get()) - __get_cpu_var(inject_start);
+ __get_cpu_var(in_eager_inject) = 0;

+ }
+ __exit_idle();
+}
+

+static void enter_eager_inject(void)
+{
+ local_irq_disable();
+ __enter_eager_inject();
+ local_irq_enable();
+}
+
+static void exit_eager_inject(void)
+{
+ local_irq_disable();
+ __exit_eager_inject();
+ local_irq_enable();
+}
+

/* Caller must have interrupts disabled */

void kidled_interrupt_enter(void)
{
@@ -105,6 +161,7 @@ void kidled_interrupt_enter(void)
return;

__exit_lazy_inject();
+ __exit_eager_inject();
}

static DEFINE_PER_CPU(int, still_lazy_injecting);
@@ -168,8 +225,25 @@ static DEFINE_PER_CPU(int, still_monitoring);
/*

* Tells us when we would need to wake up next.

*/
-long get_next_timer(struct monitor_cpu_data *data)
+static void eager_inject(void)
+{
+ while (should_eager_inject() && __get_cpu_var(still_monitoring)
+ && ici_in_eager_mode()) {
+ enter_eager_inject();
+ do_idle();
+ exit_eager_inject();
+ cond_resched();
+ }
+}
+
+/*

+ * Tells us when we would need to wake up next

+ */
+long get_next_timer(struct monitor_cpu_data *data,
+ enum ici_enum *state)
{
+ long next_timer;
+ long rounded_eager;
long lazy;

lazy = min(data->max_cpu_time - data->cpu_time,

@@ -177,7 +251,19 @@ long get_next_timer(struct monitor_cpu_data *data)

lazy -= SLEEP_GRANULARITY - 1;

- return lazy;
+ if (data->eager_inject_goal > 0) {
+ *state = ICI_EAGER;
+ if (!should_eager_inject())
+ rounded_eager = NSEC_PER_MSEC;
+ else
+ rounded_eager = roundup(data->eager_inject_goal,
+ SLEEP_GRANULARITY);
+ next_timer = min(lazy, rounded_eager);
+ } else {
+ *state = ICI_LAZY;
+ next_timer = lazy;
+ }
+ return next_timer;
}

/*
@@ -191,32 +277,51 @@ long get_next_timer(struct monitor_cpu_data *data)

static enum hrtimer_restart monitor_cpu_timer_func(struct hrtimer *timer)

{
long next_timer;
+ enum ici_enum old_state;

struct monitor_cpu_data *data = &__get_cpu_var(monitor_cpu_data);

BUG_ON(data->cpu != smp_processor_id());

data->clock_time = ktime_to_ns(ktime_get()) - data->base_clock_count;

data->cpu_time = current_cpu_busy_count() - data->base_cpu_count;

+ data->eager_inject_goal = (data->max_clock_time - data->max_cpu_time) -
+ (data->clock_time - data->cpu_time);

if ((data->max_clock_time - data->clock_time < SLEEP_GRANULARITY) ||

(data->max_cpu_time - data->cpu_time < SLEEP_GRANULARITY)) {

__get_cpu_var(still_monitoring) = 0;
+ __get_cpu_var(ici_state) = ICI_LAZY;

wake_up_process(__get_cpu_var(kidled_thread));
return HRTIMER_NORESTART;
} else {
- next_timer = get_next_timer(data);
+ old_state = __get_cpu_var(ici_state);
+ next_timer = get_next_timer(data, &__get_cpu_var(ici_state));
+
+ if (__get_cpu_var(ici_state) != old_state)
+ set_tsk_need_resched(current);
+
+ if (ici_in_eager_mode() && should_eager_inject() &&
+ !kidled_running())
+ wake_up_process(__get_cpu_var(kidled_thread));

hrtimer_forward_now(timer, ktime_set(0, next_timer));
return HRTIMER_RESTART;
}
}

+struct task_struct *get_kidled_task(int cpu)
+{
+ return per_cpu(kidled_thread, cpu);
+}
+
/*

* Allow other processes to use CPU for up to max_clock_time

* clock time, and max_cpu_time CPU time.

* Accurate only up to resolution of hrtimers.

*
+ * Invariant: This function should return with ici_state == ICI_LAZY.
+ *

* @return: Clock time left

static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,

@@ -232,12 +337,14 @@ static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
data->clock_time = 0;
data->cpu_time = 0;
data->cpu = smp_processor_id();
+ data->eager_inject_goal = max_clock_time - max_cpu_time;

- first_timer = get_next_timer(data);
+ first_timer = get_next_timer(data, &__get_cpu_var(ici_state));
if (first_timer <= 0) {
if (left_cpu_time)
*left_cpu_time = max_cpu_time;

+ __get_cpu_var(ici_state) = ICI_LAZY;
return max_clock_time;
}

@@ -247,11 +354,19 @@ static unsigned long monitor_cpu(long max_clock_time, long max_cpu_time,
sleep_timer.function = monitor_cpu_timer_func;
hrtimer_start(&sleep_timer, ktime_set(0, first_timer),
HRTIMER_MODE_REL);
- while (1) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (!__get_cpu_var(still_monitoring))
- break;
- schedule();
+
+ while (__get_cpu_var(still_monitoring)) {

+ while (1) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!__get_cpu_var(still_monitoring) ||

+ (ici_in_eager_mode() && should_eager_inject())) {
+ set_current_state(TASK_RUNNING);
+ break;
+ }
+ schedule();
+ }
+
+ eager_inject();
}

__get_cpu_var(still_monitoring) = 0;
@@ -345,6 +460,25 @@ static void set_kidled_interval(int cpu, long interval)
spin_unlock(&per_cpu(kidled_inputs, cpu).lock);
}

+static int proc_ici_lb_prio(struct ctl_table *table, int write,

+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;

+ int cpu;

+ struct sched_param param = { .sched_priority = KIDLED_PRIO };

+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);

+
+ if (!ret && write) {

+ /* Make the scheduler set the load weight again */
+ for_each_online_cpu(cpu) {

+ sched_setscheduler(per_cpu(kidled_thread, cpu),
+ SCHED_FIFO, &param);
+ }

+ }
+
+ return ret;
+}
+

static int proc_min_idle_percent(struct ctl_table *table, int write,

void __user *buffer, size_t *lenp,
loff_t *ppos)

@@ -427,6 +561,7 @@ static void getstats(void *info)
stats[0] = current_cpu_idle_count();
stats[1] = current_cpu_busy_count();
stats[2] = current_cpu_lazy_inject_count();
+ stats[3] = current_cpu_eager_inject_count();
}

@@ -434,7 +569,7 @@ static int proc_stats(struct ctl_table *table, int write,

void __user *buffer, size_t *lenp, loff_t *ppos)

{
int ret;
- unsigned long stats[3];
+ unsigned long stats[4];

int cpu = (int)((long)table->extra1);

struct ctl_table fake = {};

@@ -442,7 +577,7 @@ static int proc_stats(struct ctl_table *table, int write,
return -EINVAL;

fake.data = stats;
- fake.maxlen = 3*sizeof(unsigned long);
+ fake.maxlen = 4*sizeof(unsigned long);

ret = smp_call_function_single(cpu, getstats, &stats, 1);

if (ret)
@@ -487,6 +622,15 @@ static int zero;

struct ctl_table kidled_table[] = {
{
+ .procname = "lb_prio",
+ .data = &sysctl_ici_lb_prio,
+ .maxlen = sizeof(int),
+ .proc_handler = proc_ici_lb_prio,
+ .extra1 = &zero,
+ .extra2 = &ici_lb_prio_max,

+ .mode = 0644,
+ },
+ {

.procname = "cpu",
.mode = 0555,
.child = kidled_cpu_table,
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..486cab2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -71,6 +71,7 @@
#include <linux/debugfs.h>
#include <linux/ctype.h>
#include <linux/ftrace.h>
+#include <linux/kidled.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>
@@ -257,6 +258,9 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ int power_interactive;
+#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -626,6 +630,10 @@ struct rq {
/* BKL stats */
unsigned int bkl_count;
#endif
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ unsigned int nr_interactive;
+#endif
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -1888,6 +1896,13 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
if (wakeup)
p->se.start_runtime = p->se.sum_exec_runtime;

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (!p->se.on_rq) {
+ p->power_interactive = task_group(p)->power_interactive;
+ rq->nr_interactive += p->power_interactive;
+ }
+#endif
+
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
@@ -1906,6 +1921,11 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
}
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (p->se.on_rq)
+ rq->nr_interactive -= p->power_interactive;
+#endif
+
sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
@@ -5443,6 +5463,19 @@ static void put_prev_task(struct rq *rq, struct task_struct *prev)
prev->sched_class->put_prev_task(rq, prev);
}

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+int curr_rq_has_interactive(void)
+{
+ return (this_rq()->nr_interactive > 0);
+}
+
+int should_eager_inject(void)
+{
+ return !curr_rq_has_interactive() && (!this_rq()->rt.rt_nr_running
+ || ((this_rq()->rt.rt_nr_running == 1) && kidled_running()));
+}
+#endif
+
/*
* Pick up the highest-prio task:
*/
@@ -5452,6 +5485,23 @@ pick_next_task(struct rq *rq)
const struct sched_class *class;
struct task_struct *p;

+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ if (ici_in_eager_mode() && should_eager_inject() &&
+ !kidled_running()) {
+ p = get_kidled_task(cpu_of(rq));
+
+ current->se.last_wakeup = current->se.sum_exec_runtime;
+
+#if defined(CONFIG_SMP) && defined(CONFIG_SCHEDSTATS)
+ schedstat_inc(rq, ttwu_count);
+ schedstat_inc(rq, ttwu_local);
+#endif
+
+ set_task_state(p, TASK_RUNNING);
+ activate_task(rq, p, 1);
+ }
+#endif
+
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
@@ -9567,6 +9617,9 @@ void __init sched_init(void)
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ rq->nr_interactive = 0;
+#endif
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs, rq);
@@ -10604,6 +10657,26 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+static u64 cpu_power_interactive_read_u64(struct cgroup *cgrp,

+ struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);

+ return (u64) tg->power_interactive;
+}
+
+static int cpu_power_interactive_write_u64(struct cgroup *cgrp,
+ struct cftype *cft, u64 interactive)

+{
+ struct task_group *tg = cgroup_tg(cgrp);

+ if ((interactive < 0) || (interactive > 1))
+ return -EINVAL;
+
+ tg->power_interactive = interactive;
+ return 0;
+}
+#endif /* CONFIG_IDLE_CYCLE_INJECTOR */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10637,6 +10710,13 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
+#ifdef CONFIG_IDLE_CYCLE_INJECTOR
+ {

+ .name = "power_interactive",

+ .read_u64 = cpu_power_interactive_read_u64,
+ .write_u64 = cpu_power_interactive_write_u64,
+ },
+#endif
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{

Andi Kleen

unread,

Apr 14, 2010, 6:00:02 AM4/14/10

Salman <sq...@google.com> writes:
> +
> +static int proc_stats(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + unsigned long stats[3];
> + int cpu = (int)((long)table->extra1);
> + struct ctl_table fake = {};
> +
> + if (write)
> + return -EINVAL;
> +
> + fake.data = stats;
> + fake.maxlen = 3*sizeof(unsigned long);
> +
> + ret = smp_call_function_single(cpu, getstats, &stats, 1);
> + if (ret)
> + return ret;

Haven't read the whole thing, but do any of these stats really
need to execute on the target CPU? They seem to be just readable
fields.

Or does it simply not matter because this proc call is too infrequent?

Anyways global broadcasts are discouraged, there is typically
always someone who feels their RT latency be messed up by them.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Salman Qazi

unread,

Apr 14, 2010, 11:50:02 AM4/14/10

On Wed, Apr 14, 2010 at 2:49 AM, Andi Kleen <an...@firstfloor.org> wrote:
> Salman <sq...@google.com> writes:
>> +
>> +static int proc_stats(struct ctl_table *table, int write,
>> + void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> + int ret;
>> + unsigned long stats[3];
>> + int cpu = (int)((long)table->extra1);
>> + struct ctl_table fake = {};
>> +
>> + if (write)
>> + return -EINVAL;
>> +
>> + fake.data = stats;
>> + fake.maxlen = 3*sizeof(unsigned long);
>> +
>> + ret = smp_call_function_single(cpu, getstats, &stats, 1);
>> + if (ret)
>> + return ret;
>
> Haven't read the whole thing, but do any of these stats really
> need to execute on the target CPU? They seem to be just readable
> fields.

To capture all the quantities for a CPU atomically, they must be read
on the CPU. Basically, reading them on that CPU prevents them from
changing as we read them.

Also, if the CPU is idle (injected or otherwise), the quantities won't
get updated.

>
> Or does it simply not matter because this proc call is too infrequent?

It should be infrequent. The idle cycle injector does all the hard
work. These interfaces are for monitoring.

>
> Anyways global broadcasts are discouraged, there is typically
> always someone who feels their RT latency be messed up by them.

I will look at it one more time to see if there is something else that
can be done.

Peter Zijlstra

unread,

Apr 15, 2010, 3:50:01 AM4/15/10

On Wed, 2010-04-14 at 08:41 -0700, Salman Qazi wrote:
>
> To capture all the quantities for a CPU atomically, they must be read
> on the CPU. Basically, reading them on that CPU prevents them from
> changing as we read them.
>
> Also, if the CPU is idle (injected or otherwise), the quantities won't
> get updated.

Who cares? by the time they reach userspace they've changed anyway.

0 new messages