Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes

7 views
Skip to first unread message

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 8 ++++++--
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..91726b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
__entry->dst_cpu, __entry->dst_nid)
);
+
+TRACE_EVENT(update_numa_active_nodes_mask,
+
+ TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults),
+
+ TP_ARGS(pid, gid, nid, set, faults, max_faults),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid)
+ __field( pid_t, gid)
+ __field( int, nid)
+ __field( int, set)
+ __field( long, faults)
+ __field( long, max_faults);
+ ),
+
+ TP_fast_assign(
+ __entry->pid = pid;
+ __entry->gid = gid;
+ __entry->nid = nid;
+ __entry->set = set;
+ __entry->faults = faults;
+ __entry->max_faults = max_faults;
+ ),
+
+ TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld",
+ __entry->pid,
+ __entry->gid,
+ __entry->nid,
+ __entry->set,
+ __entry->faults,
+ __entry->max_faults)
+
+);
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
numa_group->faults_from[task_faults_idx(nid, 1)];
if (!node_isset(nid, numa_group->active_nodes)) {
- if (faults > max_faults * 4 / 10)
+ if (faults > max_faults * 4 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
node_set(nid, numa_group->active_nodes);
- } else if (faults < max_faults * 2 / 10)
+ }
+ } else if (faults < max_faults * 2 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
node_clear(nid, numa_group->active_nodes);
+ }
}
}

--
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops


My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;

-/*
- * After skipping a page migration on a shared page, skip N more numa page
- * migrations unconditionally. This reduces the number of NUMA migrations
- * in shared memory workloads, and has the effect of pulling tasks towards
- * where their memory lives, over pulling the memory towards the task.
- */
-unsigned int sysctl_numa_balancing_migrate_deferred = 16;
-
static unsigned int task_nr_scan_windows(struct task_struct *p)
{
unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 096db74..4d19492 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "numa_balancing_migrate_deferred",
- .data = &sysctl_numa_balancing_migrate_deferred,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
.procname = "numa_balancing",
.data = NULL, /* filled in by handler */
.maxlen = sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

-#ifdef CONFIG_NUMA_BALANCING
-static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- /* Never defer a private fault */
- if (cpupid_match_pid(p, last_cpupid))
- return false;
-
- if (p->numa_migrate_deferred) {
- p->numa_migrate_deferred--;
- return true;
- }
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
- p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
-}
-#else
-static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
/**
* mpol_misplaced - check whether current page node is valid in policy
*
@@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
*/
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_from;
unsigned long faults[0];
@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (!node_isset(nid, numa_group->active_nodes)) {
+ if (faults > max_faults * 4 / 10)
+ node_set(nid, numa_group->active_nodes);
+ } else if (faults < max_faults * 2 / 10)
+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks where faults come from */
grp->faults_from = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;

struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..1601c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

+/*
+ * Get the fraction of time the task has been running since the last
+ * NUMA placement cycle. The scheduler keeps similar statistics, but
+ * decays those on a 32ms period, which is orders of magnitude off
+ * from the dozens-of-seconds NUMA balancing period. Use the scheduler
+ * stats only if the task is so new there are no NUMA statistics yet.
+ */
+static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
+{
+ u64 runtime, delta, now;
+ /* Use the start of this time slice to avoid calculations. */
+ now = p->se.exec_start;
+ runtime = p->se.sum_exec_runtime;
+
+ if (p->last_task_numa_placement) {
+ delta = runtime - p->last_sum_exec_runtime;
+ *period = now - p->last_task_numa_placement;
+ } else {
+ delta = p->se.avg.runnable_avg_sum;
+ *period = p->se.avg.runnable_avg_period;
+ }
+
+ p->last_sum_exec_runtime = runtime;
+ p->last_task_numa_placement = now;
+
+ return delta;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
@@ -1459,8 +1493,19 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. This prevents the situation where
+ * the garbage collector totally dominates the stats,
+ * and the access patterns of the worker threads are
+ * ignored.
+ */
+ f_weight = (1024 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period);
p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;
p->numa_faults_from_buffer[i] = 0;

faults += p->numa_faults[i];
--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+
faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);

grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;

- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

double_lock(&my_grp->lock, &grp->lock);

- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)

if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;

@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)

p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}

@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;

if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;

BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;

p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 6:30:02 AM1/17/14
to
From: Rik van Riel <ri...@surriel.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
}

+bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ struct numa_group *ng = p->numa_group;
+
+ /* Always allow migrate on private faults */
+ if (cpupid_match_pid(p, last_cpupid))
+ return true;
+
+ /* A shared fault, but p->numa_group has not been set up yet. */
+ if (!ng)
+ return true;
+
+ /*
+ * Do not migrate if the destination is not a node that
+ * is actively used by this numa group.
+ */
+ if (!node_isset(dst_nid, ng->active_nodes))
+ return false;
+
+ /*
+ * Source is a node that is not actively used by this
+ * numa group, while the destination is. Migrate.
+ */
+ if (!node_isset(src_nid, ng->active_nodes))
+ return true;
+
+ /*
+ * Both source and destination are nodes in active
+ * use by this numa group. Maximize memory bandwidth
+ * by migrating from more heavily used groups, to less
+ * heavily used ones, spreading the load around.
+ * Use a 1/4 hysteresis to avoid spurious page movement.
+ */
+ return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
goto out;
}
+
+ if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+ goto out;
}

if (curnid != polnid)
--
1.8.4.2

Peter Zijlstra

unread,
Jan 17, 2014, 10:00:02 AM1/17/14
to
On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
> + /*
> + * Normalize the faults_from, so all tasks in a group
> + * count according to CPU use, instead of by the raw
> + * number of faults. This prevents the situation where
> + * the garbage collector totally dominates the stats,
> + * and the access patterns of the worker threads are
> + * ignored.
> + */

Instead of focusing on the one example (GC) here, I would suggest
saying something along the lines of: Tasks with little runtime have
little over-all impact on throughput and thus their faults are less
important.

Rik van Riel

unread,
Jan 17, 2014, 3:00:03 PM1/17/14
to
On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
>> + /*
>> + * Normalize the faults_from, so all tasks in a group
>> + * count according to CPU use, instead of by the raw
>> + * number of faults. This prevents the situation where
>> + * the garbage collector totally dominates the stats,
>> + * and the access patterns of the worker threads are
>> + * ignored.
>> + */
>
> Instead of focusing on the one example (GC) here, I would suggest
> saying something along the lines of: Tasks with little runtime have
> little over-all impact on throughput and thus their faults are less
> important.

Thanks for the review, I have updated the comment
accordingly.

Does anybody else have comments on this series, or should
I start begging for Acked-by: and Reviewed-by: lines? :)

--
All rights reversed

Peter Zijlstra

unread,
Jan 17, 2014, 3:10:03 PM1/17/14
to
On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> > On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
> >> + /*
> >> + * Normalize the faults_from, so all tasks in a group
> >> + * count according to CPU use, instead of by the raw
> >> + * number of faults. This prevents the situation where
> >> + * the garbage collector totally dominates the stats,
> >> + * and the access patterns of the worker threads are
> >> + * ignored.
> >> + */
> >
> > Instead of focusing on the one example (GC) here, I would suggest
> > saying something along the lines of: Tasks with little runtime have
> > little over-all impact on throughput and thus their faults are less
> > important.
>
> Thanks for the review, I have updated the comment
> accordingly.
>
> Does anybody else have comments on this series, or should
> I start begging for Acked-by: and Reviewed-by: lines? :)

I still need to get me head around part of the previous patches, and I
suppose Mel might have a look, give it a few days ;-)

Rik van Riel

unread,
Jan 17, 2014, 6:50:02 PM1/17/14
to
On 01/17/2014 10:04 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
>> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
>>> On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
>>>> + /*
>>>> + * Normalize the faults_from, so all tasks in a group
>>>> + * count according to CPU use, instead of by the raw
>>>> + * number of faults. This prevents the situation where
>>>> + * the garbage collector totally dominates the stats,
>>>> + * and the access patterns of the worker threads are
>>>> + * ignored.
>>>> + */
>>>
>>> Instead of focusing on the one example (GC) here, I would suggest
>>> saying something along the lines of: Tasks with little runtime have
>>> little over-all impact on throughput and thus their faults are less
>>> important.
>>
>> Thanks for the review, I have updated the comment
>> accordingly.
>>
>> Does anybody else have comments on this series, or should
>> I start begging for Acked-by: and Reviewed-by: lines? :)
>
> I still need to get me head around part of the previous patches, and I
> suppose Mel might have a look, give it a few days ;-)

Well, you didn't spot the divide by zero that Vinod somehow
managed to trigger :)

I am also going to add another patch, that prevents the temporary
values from "p->numa_faults[i] >>= 1;" becoming visible, by using
local variables for those calculations.

Expect a v2 soonish.

--
All rights reversed

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:01 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
numa_group->faults_from[task_faults_idx(nid, 1)];
if (!node_isset(nid, numa_group->active_nodes)) {
- if (faults > max_faults * 4 / 10)
+ if (faults > max_faults * 4 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
node_set(nid, numa_group->active_nodes);
- } else if (faults < max_faults * 2 / 10)
+ }
+ } else if (faults < max_faults * 2 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
node_clear(nid, numa_group->active_nodes);
+ }
}
}

--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:02 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:02 PM1/17/14
to
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v1:
- fix divide by zero found by Chegu Vinod
- improve comment, as suggested by Peter Zijlstra
- do stats calculations in task_numa_placement in local variables


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops


My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:03 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:03 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_from;
unsigned long faults[0];
@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (!node_isset(nid, numa_group->active_nodes)) {
+ if (faults > max_faults * 4 / 10)
+ node_set(nid, numa_group->active_nodes);
+ } else if (faults < max_faults * 2 / 10)
+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks where faults come from */
grp->faults_from = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 9:20:03 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 9:30:02 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;

struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..0d395a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
@@ -1459,8 +1493,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. Tasks with little runtime have
+ * little over-all impact on throughput, and thus their
+ * faults are less important.
+ */
+ f_weight = (1024 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period + 1);
p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;
p->numa_faults_from_buffer[i] = 0;

faults += p->numa_faults[i];
--
1.8.4.2

ri...@redhat.com

unread,
Jan 17, 2014, 9:30:02 PM1/17/14
to
From: Rik van Riel <ri...@redhat.com>

The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.

The NUMA balancing code looks at those per-process variables, and
having other tasks temporarily see halved statistics could lead to
unwanted numa migrations. This can be avoided by doing all the math
in local variables.

This change also simplifies the code a little.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d395a0..0f48382 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1484,12 +1484,9 @@ static void task_numa_placement(struct task_struct *p)
long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
- diff = -p->numa_faults[i];
- f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
- p->numa_faults[i] >>= 1;
- p->numa_faults[i] += p->numa_faults_buffer[i];
+ diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2;
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

@@ -1503,13 +1500,12 @@ static void task_numa_placement(struct task_struct *p)
f_weight = (1024 * runtime *
p->numa_faults_from_buffer[i]) /
(total_faults * period + 1);
- p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += f_weight;
+ f_diff = f_weight - p->numa_faults_from[i] / 2;
p->numa_faults_from_buffer[i] = 0;

+ p->numa_faults[i] += diff;
+ p->numa_faults_from[i] += f_diff;
faults += p->numa_faults[i];
- diff += p->numa_faults[i];
- f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
--
1.8.4.2

Rik van Riel

unread,
Jan 18, 2014, 3:40:01 AM1/18/14
to
On 01/17/2014 04:12 PM, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> The current code in task_numa_placement calculates the difference
> between the old and the new value, but also temporarily stores half
> of the old value in the per-process variables.
>
> The NUMA balancing code looks at those per-process variables, and
> having other tasks temporarily see halved statistics could lead to
> unwanted numa migrations. This can be avoided by doing all the math
> in local variables.
>
> This change also simplifies the code a little.

I am seeing what looks like a performance improvement
with this patch, so it is not just a theoretical bug.
The improvement is small, as is to be expected with
such a small race, but with two 32-warehouse specjbb
instances on a 4-node, 10core/20thread per node system,
I see the following change in performance, and reduced
numa page migrations.

Without the patch:
run 1: throughput 367660 367660, migrated 3112982
run 2: throughput 353821 355612, migrated 2881317
run 3: throughput 355027 355027, migrated 3358105
run 4: throughput 354366 354366, migrated 3466687
run 5: throughput 356186 356186, migrated 3152194
run 6: throughput 361431 361431, migrated 3336219
run 7: throughput 354704 354704, migrated 3345418
run 8: throughput 363770 363770, migrated 3642925
run 9: throughput 363380 363380, migrated 3192836
run 10: throughput 358440 358440, migrated 3354028
avg: througphut 358968, migrated 3284271

With the patch:
run 1: throughput 360580 360580, migrated 3169872
run 2: throughput 361303 361303, migrated 3220280
run 3: throughput 367692 367692, migrated 3096093
run 4: throughput 362320 362320, migrated 2981762
run 5: throughput 364201 364201, migrated 3089107
run 6: throughput 364561 364561, migrated 2892364
run 7: throughput 360771 360771, migrated 3086638
run 8: throughput 361530 361530, migrated 2933256
run 9: throughput 365841 365841, migrated 3356944
run 10: throughput 359188 359188, migrated 3394545
avg: througphut 362798, migrated 3122086


--
All rights reversed

Chegu Vinod

unread,
Jan 18, 2014, 10:10:02 PM1/18/14
to
> .
>


Acked-by: Chegu Vinod <chegu...@hp.com>

----

Here are some results using the v2 version of the patches
on an 8 socket box using SPECjbb2005 as a workload :

I) Eight 1-socket wide instances(10 warehouse threads each) :

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 270620.04 273675.10
spec2.txt: throughput = 274115.33 272845.17
spec3.txt: throughput = 277830.09 272057.33
spec4.txt: throughput = 270898.52 270670.54
spec5.txt: throughput = 270397.30 270906.82
spec6.txt: throughput = 270451.93 268217.55
spec7.txt: throughput = 269511.07 269354.46
spec8.txt: throughput = 269386.06 270540.00

b)Automatic NUMA balancing results
spec1.txt: throughput = 244333.41 248072.72
spec2.txt: throughput = 252166.99 251818.30
spec3.txt: throughput = 251365.58 258266.24
spec4.txt: throughput = 245247.91 256873.51
spec5.txt: throughput = 245579.68 247743.18
spec6.txt: throughput = 249767.38 256285.86
spec7.txt: throughput = 244570.64 255343.99
spec8.txt: throughput = 245703.60 254434.36

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 132959.73 136957.12
spec2.txt: throughput = 127937.11 129326.23
spec3.txt: throughput = 130697.10 125772.11
spec4.txt: throughput = 134978.49 141607.58
spec5.txt: throughput = 127574.34 126748.18
spec6.txt: throughput = 138699.99 128597.95
spec7.txt: throughput = 133247.25 137344.57
spec8.txt: throughput = 124548.00 139040.98

------

II) Four 2-socket wide instances(20 warehouse threads each) :

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 479931.16 472467.58
spec2.txt: throughput = 466652.15 466237.10
spec3.txt: throughput = 473591.51 466891.98
spec4.txt: throughput = 462346.62 466891.98

b)Automatic NUMA balancing results
spec1.txt: throughput = 383758.29 437489.99
spec2.txt: throughput = 370926.06 435692.97
spec3.txt: throughput = 368872.72 444615.08
spec4.txt: throughput = 404422.82 435236.20

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 252752.12 231762.30
spec2.txt: throughput = 255391.51 253250.95
spec3.txt: throughput = 264764.00 263721.03
spec4.txt: throughput = 254833.39 242892.72

------

III) Two 4-socket wide instances(40 warehouse threads each)

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 771340.84 769039.53
spec2.txt: throughput = 762184.48 760745.65

b)Automatic NUMA balancing results
spec1.txt: throughput = 667182.98 720197.01
spec2.txt: throughput = 692564.11 739872.51

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 457079.28 467199.30
spec2.txt: throughput = 479790.47 456279.07

-----

IV) One 8-socket wide instance(80 warehouse threads)

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 982113.03 985836.96

b)Automatic NUMA balancing results
spec1.txt: throughput = 755615.94 843632.09

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 671583.26 661768.54

Peter Zijlstra

unread,
Jan 20, 2014, 4:40:02 PM1/20/14
to
Why not use 6/16 and 3/16 resp.? That avoids an actual division.

Peter Zijlstra

unread,
Jan 20, 2014, 5:00:01 PM1/20/14
to
On Fri, Jan 17, 2014 at 04:12:08PM -0500, ri...@redhat.com wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0af6c1a..52de567 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1471,6 +1471,8 @@ struct task_struct {
> int numa_preferred_nid;
> unsigned long numa_migrate_retry;
> u64 node_stamp; /* migration stamp */
> + u64 last_task_numa_placement;
> + u64 last_sum_exec_runtime;
> struct callback_head numa_work;
>
> struct list_head numa_entry;

Have you tried what happens if you use p->se.avg.runnable_avg_sum /
p->se.avg.runnable_avg_period instead? If that also works it avoids
growing the datastructures and keeping of yet another set of runtime
stats.

Peter Zijlstra

unread,
Jan 20, 2014, 5:00:02 PM1/20/14
to
On Fri, Jan 17, 2014 at 04:12:05PM -0500, ri...@redhat.com wrote:
> /*
> + * Iterate over the nodes from which NUMA hinting faults were triggered, in
> + * other words where the CPUs that incurred NUMA hinting faults are. The
> + * bitmask is used to limit NUMA page migrations, and spread out memory
> + * between the actively used nodes. To prevent flip-flopping, and excessive
> + * page migrations, nodes are added when they cause over 40% of the maximum
> + * number of faults, but only removed when they drop below 20%.
> + */

Maybe break the above into two paragraphs for added readability.

Also, I think this might be a good spot to explain why you need the
second fault metric -- that is, why can't we create the interleave mask
from the existing memory location faults.

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> + unsigned long faults, max_faults = 0;
> + struct numa_group *numa_group = p->numa_group;
> + int nid;
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];
> + if (faults > max_faults)
> + max_faults = faults;
> + }
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];
> + if (!node_isset(nid, numa_group->active_nodes)) {
> + if (faults > max_faults * 4 / 10)
> + node_set(nid, numa_group->active_nodes);
> + } else if (faults < max_faults * 2 / 10)
> + node_clear(nid, numa_group->active_nodes);
> + }
> +}

Peter Zijlstra

unread,
Jan 20, 2014, 5:00:02 PM1/20/14
to
On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> Being able to see how the active nodemask changes over time, and why,
> can be quite useful.
>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Mel Gorman <mgo...@suse.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Chegu Vinod <chegu...@hp.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>
> ---
> include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 8 ++++++--
> 2 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 67e1bbf..91726b6 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
> __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> __entry->dst_cpu, __entry->dst_nid)
> );
> +
> +TRACE_EVENT(update_numa_active_nodes_mask,

Please stick to the sched_ naming for these things.

Ideally we'd rename the sysctls too :/

> +++ b/kernel/sched/fair.c
> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
> faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> numa_group->faults_from[task_faults_idx(nid, 1)];
> if (!node_isset(nid, numa_group->active_nodes)) {
> - if (faults > max_faults * 4 / 10)
> + if (faults > max_faults * 4 / 10) {
> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);

While I think the tracepoint hookery is smart enough to avoid evaluating
arguments when they're disabled, it might be best to simply pass:
current and numa_group and do the dereference in fast_assign().

That said, this is the first and only numa tracepoint, I'm not sure why
this qualifies and other metrics do not.

Rik van Riel

unread,
Jan 20, 2014, 7:00:02 PM1/20/14
to
On 01/20/2014 11:52 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:

>> +++ b/kernel/sched/fair.c
>> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
>> faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> numa_group->faults_from[task_faults_idx(nid, 1)];
>> if (!node_isset(nid, numa_group->active_nodes)) {
>> - if (faults > max_faults * 4 / 10)
>> + if (faults > max_faults * 4 / 10) {
>> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
>
> While I think the tracepoint hookery is smart enough to avoid evaluating
> arguments when they're disabled, it might be best to simply pass:
> current and numa_group and do the dereference in fast_assign().
>
> That said, this is the first and only numa tracepoint, I'm not sure why
> this qualifies and other metrics do not.

It's there because I needed it in development.

If you think it is not merge material, I would be comfortable
leaving it out.

--
All rights reversed

Steven Rostedt

unread,
Jan 20, 2014, 7:10:01 PM1/20/14
to
On Mon, 20 Jan 2014 17:52:05 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:
> > From: Rik van Riel <ri...@redhat.com>
> >

> > +++ b/kernel/sched/fair.c
> > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
> > faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> > numa_group->faults_from[task_faults_idx(nid, 1)];
> > if (!node_isset(nid, numa_group->active_nodes)) {
> > - if (faults > max_faults * 4 / 10)
> > + if (faults > max_faults * 4 / 10) {
> > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
>
> While I think the tracepoint hookery is smart enough to avoid evaluating
> arguments when they're disabled, it might be best to simply pass:
> current and numa_group and do the dereference in fast_assign().

It's really up to gcc to optimize it. But that said, it is more
efficient to just past the pointer and do the dereferencing in the
fast_assign(). At least it keeps any bad optimization in gcc from
infecting the tracepoint caller.

It also makes it easier to get other information if you want to later
extend that tracepoint.

Does this tracepoint always use current? If so, why bother passing it
in?

-- Steve

Rik van Riel

unread,
Jan 20, 2014, 7:10:01 PM1/20/14
to
That is what I started out with, and the results were not
as stable as with this calculation.

Having said that, I did that before I came up with patch 7/7,
so maybe the effect would no longer be as pronounced any more
as it was before...

I can send in a simplified version, if you prefer.

--
All rights reversed

Peter Zijlstra

unread,
Jan 20, 2014, 7:20:01 PM1/20/14
to
On Mon, Jan 20, 2014 at 02:02:32PM -0500, Rik van Riel wrote:
> That is what I started out with, and the results were not
> as stable as with this calculation.
>
> Having said that, I did that before I came up with patch 7/7,
> so maybe the effect would no longer be as pronounced any more
> as it was before...
>
> I can send in a simplified version, if you prefer.

If you could retry with 7/7, I don't mind adding the extra stats too
much, but it would be nice if we can avoid it.

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v2:
- dropped tracepoint (for now?)
- implement obvious improvements suggested by Peter
- use the scheduler maintained CPU use statistics, drop
the NUMA specific ones for now. We can add those later
if they turn out to be beneficial
Changes since v1:
- fix divide by zero found by Chegu Vinod
- improve comment, as suggested by Peter Zijlstra
- do stats calculations in task_numa_placement in local variables


Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops


My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea8b2ae..ea873b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
goto out;
}
+
+ if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+ goto out;
}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea873b6..203877d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = p->se.avg.runnable_avg_sum;
+ period = p->se.avg.runnable_avg_period;
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
@@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. Tasks with little runtime have
+ * little over-all impact on throughput, and thus their
+ * faults are less important.
+ */
+ f_weight = (16384 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period + 1);
p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;
p->numa_faults_from_buffer[i] = 0;

faults += p->numa_faults[i];
--
1.8.4.2

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+
faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);

grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;

- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

double_lock(&my_grp->lock, &grp->lock);

- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)

if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;

@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)

p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}

@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;

if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;

BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;

p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

--
1.8.4.2

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..ea8b2ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_from;
unsigned long faults[0];
@@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Find the nodes on which the workload is actively running. We do this by
+ * tracking the nodes from which NUMA hinting faults are triggered. This can
+ * be different from the set of nodes where the workload's memory is currently
+ * located.
+ *
+ * The bitmask is used to make smarter decisions on when to do NUMA page
+ * migrations, To prevent flip-flopping, and excessive page migrations, nodes
+ * are added when they cause over 6/16 of the maximum number of faults, but
+ * only removed when they drop below 3/16.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
+ numa_group->faults_from[task_faults_idx(nid, 1)];
+ if (!node_isset(nid, numa_group->active_nodes)) {
+ if (faults > max_faults * 6 / 16)
+ node_set(nid, numa_group->active_nodes);
+ } else if (faults < max_faults * 3 / 16)
+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks where faults come from */
grp->faults_from = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

@@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,
Jan 20, 2014, 7:30:02 PM1/20/14
to
From: Rik van Riel <ri...@redhat.com>

The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.

The NUMA balancing code looks at those per-process variables, and
having other tasks temporarily see halved statistics could lead to
unwanted numa migrations. This can be avoided by doing all the math
in local variables.

This change also simplifies the code a little.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 203877d..ad30d14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1456,12 +1456,9 @@ static void task_numa_placement(struct task_struct *p)
long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
- diff = -p->numa_faults[i];
- f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
- p->numa_faults[i] >>= 1;
- p->numa_faults[i] += p->numa_faults_buffer[i];
+ diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2;
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

@@ -1475,13 +1472,12 @@ static void task_numa_placement(struct task_struct *p)
f_weight = (16384 * runtime *
p->numa_faults_from_buffer[i]) /
(total_faults * period + 1);
- p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += f_weight;
+ f_diff = f_weight - p->numa_faults_from[i] / 2;
p->numa_faults_from_buffer[i] = 0;

+ p->numa_faults[i] += diff;
+ p->numa_faults_from[i] += f_diff;
faults += p->numa_faults[i];
- diff += p->numa_faults[i];
- f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
--
1.8.4.2

Rik van Riel

unread,
Jan 20, 2014, 7:40:01 PM1/20/14
to
OK, will do.

--
All rights reversed

Mel Gorman

unread,
Jan 21, 2014, 12:00:08 PM1/21/14
to
On Mon, Jan 20, 2014 at 02:21:02PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> Excessive migration of pages can hurt the performance of workloads
> that span multiple NUMA nodes. However, it turns out that the
> p->numa_migrate_deferred knob is a really big hammer, which does
> reduce migration rates, but does not actually help performance.
>
> Now that the second stage of the automatic numa balancing code
> has stabilized, it is time to replace the simplistic migration
> deferral code with something smarter.
>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Mel Gorman <mgo...@suse.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Chegu Vinod <chegu...@hp.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

When I added a tracepoint to track deferred migration I was surprised how
often it triggered for some workloads. I agree that we want to do something
better because it was a crutch albeit a necessary one at the time.

Note that the knob was not about performance as such, it was about avoiding
worst-case behaviour. We should keep an eye out for bugs that look like
excessive migration on workloads that are not converging. Reintroducing this
hammer would be a last resort for working around the problem.

Finally, the sysctl is documented in Documentation/sysctl/kernel.txt and
this patch should also remove it.

Functionally, the patch looks fine and it's time to reinvestigate if
it's necessary so assuming the documentation gets removed;

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 21, 2014, 12:30:02 PM1/21/14
to
As an aside I wonder if we can derive any useful metric from this. One
potential santiy check would be the number of nodes that a task is incurring
faults on. It would be best if the highest number of faults were recorded
on the node the task is currently running on. After that we either want
to minimise the number of nodes trapping faults or interleave between
all available nodes to avoid applying too much memory pressure on any
one node. For interleaving to always be the best option we would have to
assume that all nodes are equal distance but that would be a reasonable
assumption to start with.

> + /*
> * numa_faults_locality tracks if faults recorded during the last
> * scan window were remote/local. The task scan period is adapted
> * based on the locality of the faults with different weights
> @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
> extern pid_t task_numa_group_id(struct task_struct *p);
> extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p);
> -
> -extern unsigned int sysctl_numa_balancing_migrate_deferred;
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41e2176..1945ddc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -886,6 +886,7 @@ struct numa_group {
>
> struct rcu_head rcu;
> unsigned long total_faults;
> + unsigned long *faults_from;
> unsigned long faults[0];
> };
>

faults_from is not ambiguous but it does not tell us a lot of information
either. If I am reading this right then fundamentally this patch means we
are tracking two pieces of information

1. The node the data resided on at the time of the hinting fault (numa_faults)
2. The node the accessing task was residing on at the time of the fault (faults_from)

We should be able to have names that reflect that. How about
memory_faults_locality and cpu_faults_locality with a prepartion patch
doing a simple rename for numa_faults and this patch adding
cpu_faults_locality?

It will be tough to be consistent about this but the clearer we are about
making decisions based on task locationo vs data location the happier we
will be in the long run.
Should we convert that magic number to a define? NR_NUMA_HINT_FAULT_STATS?

> grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> if (!grp)
> @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> spin_lock_init(&grp->lock);
> INIT_LIST_HEAD(&grp->task_list);
> grp->gid = p->pid;
> + /* Second half of the array tracks where faults come from */
> + grp->faults_from = grp->faults + 2 * nr_node_ids;
>

We have accessors when we overload arrays like this such as task_faults_idx
for example. We should have similar accessors for this in case those
offsets very change.

> - for (i = 0; i < 2*nr_node_ids; i++)
> + for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] = p->numa_faults[i];
>

This is a little obscure now. Functionally it is copying both numa_faults and
numa_faults_from but a casual reading of that will get confused. Minimally
it needs a comment explaining what is being copied here. Also, why did we
not use memcpy?

> grp->total_faults = p->total_numa_faults;
> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>
> double_lock(&my_grp->lock, &grp->lock);
>
> - for (i = 0; i < 2*nr_node_ids; i++) {
> + for (i = 0; i < 4*nr_node_ids; i++) {
> my_grp->faults[i] -= p->numa_faults[i];
> grp->faults[i] += p->numa_faults[i];
> }

The same obscure trick is used throughout and I'm not sure how
maintainable that will be. Would it be better to be explicit about this?

/* NUMA hinting faults may be either shared or private faults */
#define NR_NUMA_HINT_FAULT_TYPES 2

/* Track shared and private faults
#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES*2)
this_node and node is similarly ambiguous in terms of name. Rename of
data_node and cpu_node would have been clearer.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 21, 2014, 2:20:02 PM1/21/14
to
On Mon, Jan 20, 2014 at 02:21:04PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> The faults_from statistics are used to maintain an active_nodes nodemask
> per numa_group. This allows us to be smarter about when to do numa migrations.
>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Mel Gorman <mgo...@suse.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Chegu Vinod <chegu...@hp.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1945ddc..ea8b2ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -885,6 +885,7 @@ struct numa_group {
> struct list_head task_list;
>
> struct rcu_head rcu;
> + nodemask_t active_nodes;
> unsigned long total_faults;
> unsigned long *faults_from;
> unsigned long faults[0];

It's not a concern for now but in the land of unicorns and ponies we'll
relook at the size of some of these structures and see what can be
optimised.

Similar to my comment on faults_from I think we could potentially evaluate
the fitness of the automatic NUMA balancing feature by looking at the
weight of the active_nodes for a numa_group. If
bitmask_weight(active_nodes) == nr_online_nodes
for all numa_groups in the system then I think it would be an indication
that the algorithm has collapsed.

It's not a comment on the patch itself. We could could just do with more
metrics that help analyse this thing when debugging problems.

> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
> }
>
> /*
> + * Find the nodes on which the workload is actively running. We do this by

hmm, it's not the workload though, it's a single NUMA group and a workload
may consist of multiple NUMA groups. For example, in an ideal world and
a JVM-based workload the application threads and the GC threads would be
in different NUMA groups.

The signature is even more misleading because the signature implies that
the function is concerned with tasks. Pass in p->numa_group


> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> + * be different from the set of nodes where the workload's memory is currently
> + * located.
> + *
> + * The bitmask is used to make smarter decisions on when to do NUMA page
> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> + * are added when they cause over 6/16 of the maximum number of faults, but
> + * only removed when they drop below 3/16.
> + */

Looking at the values, I'm guessing you did it this way to use shifts
instead of divides. That's fine, but how did you arrive at those values?
Experimentally or just felt reasonable?

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> + unsigned long faults, max_faults = 0;
> + struct numa_group *numa_group = p->numa_group;
> + int nid;
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];

task_faults() implements a helper for p->numa_faults equivalent of this.
Just as with the other renaming, it would not hurt to rename task_faults()
to something like task_faults_memory() and add a task_faults_cpu() for
this. The objective again is to be clear about whether we care about CPU
or memory locality information.

> + if (faults > max_faults)
> + max_faults = faults;
> + }
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];

group_faults would need similar adjustment.

> + if (!node_isset(nid, numa_group->active_nodes)) {
> + if (faults > max_faults * 6 / 16)
> + node_set(nid, numa_group->active_nodes);
> + } else if (faults < max_faults * 3 / 16)
> + node_clear(nid, numa_group->active_nodes);
> + }
> +}
> +

I think there is a subtle problem here

/*
* Be mindful that this is subject to sampling error. As we only have
* data on hinting faults active_nodes may miss a heavily referenced
* node due to the references being to a small number of pages. If
* there is a large linear scanner in the same numa group as a
* task operating on a small amount of memory then the latter task
* may be ignored.
*/

I have no suggestion on how to handle this because we're vulnerable to
sampling errors in a number of places but it does not hurt to be reminded
of that in a few places.

> +/*
> * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
> * increments. The more local the fault statistics are, the higher the scan
> * period will be for the next scan window. If local/remote ratio is below
> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
> update_task_scan_period(p, fault_types[0], fault_types[1]);
>
> if (p->numa_group) {
> + update_numa_active_node_mask(p);

We are updating that thing once per scan window, that's fine. There is
potentially a wee issue though. If all the tasks in the group are threads
then they share p->mm->numa_scan_seq and only one task does the update
per scan window. If they are different processes then we could be updating
more frequently than necessary.

Functionally it'll be fine but higher cost than necessary. I do not have a
better suggestion right now as superficially a numa_scan_seq per numa_group
would not be a good fit.

If we think of nothing better and the issue is real then we can at least
stick a comment there for future reference.

> /*
> * If the preferred task and group nids are different,
> * iterate over the nodes again to find the best place.
> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> /* Second half of the array tracks where faults come from */
> grp->faults_from = grp->faults + 2 * nr_node_ids;
>
> + node_set(task_node(current), grp->active_nodes);
> +
> for (i = 0; i < 4*nr_node_ids; i++)
> grp->faults[i] = p->numa_faults[i];
>
> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> my_grp->nr_tasks--;
> grp->nr_tasks++;
>
> + update_numa_active_node_mask(p);
> +

This may be subtle enough to deserve a comment

/* Tasks have joined/left groups and the active_mask is no longer valid */

If we left a group, we update our new group. Is the old group now out of
date and in need of updating too? If so, then we should update both and
only update the old group if it still has tasks in it.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 21, 2014, 3:10:02 PM1/21/14
to
On Mon, Jan 20, 2014 at 02:21:05PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
>
> In order to maximize performance of workloads that do not fit in one NUMA
> node, we want to satisfy the following criteria:
> 1) keep private memory local to each thread
> 2) avoid excessive NUMA migration of pages
> 3) distribute shared memory across the active nodes, to
> maximize memory bandwidth available to the workload
>
> This patch accomplishes that by implementing the following policy for
> NUMA migrations:
> 1) always migrate on a private fault

Makes sense

> 2) never migrate to a node that is not in the set of active nodes
> for the numa_group

This will work out in every case *except* where we miss an active node
because the task running there is faulting a very small number of pages.
Worth recording that in case we ever see a bug that could be explained
by it.

> 3) always migrate from a node outside of the set of active nodes,
> to a node that is in that set

Clever

A *potential* consequence of this is that we may see large amounts of
migration traffic if we ever implement something that causes tasks to
enter/leave numa groups frequently.

> 4) within the set of active nodes in the numa_group, only migrate
> from a node with more NUMA page faults, to a node with fewer
> NUMA page faults, with a 25% margin to avoid ping-ponging
>

Of the four, this is the highest risk again because we might miss tasks
in an active node due to them accessing a small number of pages.

Not suggesting you change the policy at this point, we should just keep
an eye out for it. It could be argued that a task accessing a small amount
of memory on a large NUMA machine is not a task we care about anyway :/
In light of the memory/data distinction, how about

should_numa_migrate_memory?

> + struct numa_group *ng = p->numa_group;
> +
> + /* Always allow migrate on private faults */
> + if (cpupid_match_pid(p, last_cpupid))
> + return true;
> +

We now have the two-stage filter detection in mpol_misplaced and the rest
of the migration decision logic here. Keep them in the same place? It
might necessitate passing in the page being faulted as well but then the
return value will be clearer

/*
* This function returns true if the @page is misplaced and should be
* migrated.
*/

It may need a name change as well if you decide to move everything into
this function including the call to page_cpupid_xchg_last

> + /* A shared fault, but p->numa_group has not been set up yet. */
> + if (!ng)
> + return true;
> +
> + /*
> + * Do not migrate if the destination is not a node that
> + * is actively used by this numa group.
> + */
> + if (!node_isset(dst_nid, ng->active_nodes))
> + return false;
> +

If I'm right about the sampling error potentially missing tasks accessing a
small number of pages then a reminder about the sampling error would not hurt

> + /*
> + * Source is a node that is not actively used by this
> + * numa group, while the destination is. Migrate.
> + */
> + if (!node_isset(src_nid, ng->active_nodes))
> + return true;
> +
> + /*
> + * Both source and destination are nodes in active
> + * use by this numa group. Maximize memory bandwidth
> + * by migrating from more heavily used groups, to less
> + * heavily used ones, spreading the load around.
> + * Use a 1/4 hysteresis to avoid spurious page movement.
> + */
> + return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
> +}

I worried initially about how this would interact with the scheduler
placement which is concerned with the number of faults per node. I think
it's ok though because it should flatten out and the interleaved nodes
should not look like good scheduling candidates. Something to keep in
mind in the future.

I do not see why this is a 1/4 hysteresis though. It looks more like a
threshold based on the number of faults than anything to do with
hysteresis.

Finally, something like this is approximately the same as three-quarters
but does not use divides as a micro-optimisation. The approximation will
always be a greater value but the difference in error is marginal

src_group_faults = group_faults(p, src_nid);
src_group_faults -= src_group_faults >> 2;


> +
> static unsigned long weighted_cpuload(const int cpu);
> static unsigned long source_load(int cpu, int type);
> static unsigned long target_load(int cpu, int type);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 052abac..050962b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
> goto out;
> }
> +
> + if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
> + goto out;
> }
>
> if (curnid != polnid)
> --
> 1.8.4.2
>

--
Mel Gorman
SUSE Labs

Rik van Riel

unread,
Jan 21, 2014, 3:20:02 PM1/21/14
to
On 01/21/2014 09:19 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:04PM -0500, ri...@redhat.com wrote:

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1945ddc..ea8b2ae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -885,6 +885,7 @@ struct numa_group {
>> struct list_head task_list;
>>
>> struct rcu_head rcu;
>> + nodemask_t active_nodes;
>> unsigned long total_faults;
>> unsigned long *faults_from;
>> unsigned long faults[0];
>
> It's not a concern for now but in the land of unicorns and ponies we'll
> relook at the size of some of these structures and see what can be
> optimised.

Unsigned int should be enough for systems with less than 8TB
of memory per node :)

> Similar to my comment on faults_from I think we could potentially evaluate
> the fitness of the automatic NUMA balancing feature by looking at the
> weight of the active_nodes for a numa_group. If
> bitmask_weight(active_nodes) == nr_online_nodes
> for all numa_groups in the system then I think it would be an indication
> that the algorithm has collapsed.

If the system runs one very large workload, I would expect the
scheduler to spread that workload across all nodes.

In that situation, it is perfectly legitimate for all nodes
to end up being marked as active nodes, and for the system
to try distribute the workload's memory somewhat evenly
between them.

> It's not a comment on the patch itself. We could could just do with more
> metrics that help analyse this thing when debugging problems.
>
>> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
>> }
>>
>> /*
>> + * Find the nodes on which the workload is actively running. We do this by
>
> hmm, it's not the workload though, it's a single NUMA group and a workload
> may consist of multiple NUMA groups. For example, in an ideal world and
> a JVM-based workload the application threads and the GC threads would be
> in different NUMA groups.

Why should they be in a different numa group?

The rest of the series contains patches to make sure they
should be just fine together in the same group...

> The signature is even more misleading because the signature implies that
> the function is concerned with tasks. Pass in p->numa_group

Will do.

>> + * tracking the nodes from which NUMA hinting faults are triggered. This can
>> + * be different from the set of nodes where the workload's memory is currently
>> + * located.
>> + *
>> + * The bitmask is used to make smarter decisions on when to do NUMA page
>> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
>> + * are added when they cause over 6/16 of the maximum number of faults, but
>> + * only removed when they drop below 3/16.
>> + */
>
> Looking at the values, I'm guessing you did it this way to use shifts
> instead of divides. That's fine, but how did you arrive at those values?
> Experimentally or just felt reasonable?

Experimentally I got to 20% and 40%. Peter suggested I change it
to 3/16 and 6/16, which appear to give identical performance.

>> +static void update_numa_active_node_mask(struct task_struct *p)
>> +{
>> + unsigned long faults, max_faults = 0;
>> + struct numa_group *numa_group = p->numa_group;
>> + int nid;
>> +
>> + for_each_online_node(nid) {
>> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> + numa_group->faults_from[task_faults_idx(nid, 1)];
>
> task_faults() implements a helper for p->numa_faults equivalent of this.
> Just as with the other renaming, it would not hurt to rename task_faults()
> to something like task_faults_memory() and add a task_faults_cpu() for
> this. The objective again is to be clear about whether we care about CPU
> or memory locality information.

Will do.

>> + if (faults > max_faults)
>> + max_faults = faults;
>> + }
>> +
>> + for_each_online_node(nid) {
>> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> + numa_group->faults_from[task_faults_idx(nid, 1)];
>
> group_faults would need similar adjustment.
>
>> + if (!node_isset(nid, numa_group->active_nodes)) {
>> + if (faults > max_faults * 6 / 16)
>> + node_set(nid, numa_group->active_nodes);
>> + } else if (faults < max_faults * 3 / 16)
>> + node_clear(nid, numa_group->active_nodes);
>> + }
>> +}
>> +
>
> I think there is a subtle problem here

Can you be more specific about what problem you think the hysteresis
could be causing?

> /*
> * Be mindful that this is subject to sampling error. As we only have
> * data on hinting faults active_nodes may miss a heavily referenced
> * node due to the references being to a small number of pages. If
> * there is a large linear scanner in the same numa group as a
> * task operating on a small amount of memory then the latter task
> * may be ignored.
> */
>
> I have no suggestion on how to handle this

Since the numa_faults_cpu statistics are all about driving
memory-follows-cpu, there actually is a decent way to handle
it. See patch 5 :)

>> +/*
>> * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
>> * increments. The more local the fault statistics are, the higher the scan
>> * period will be for the next scan window. If local/remote ratio is below
>> @@ -1416,6 +1452,7 @@ static void task_numa_placement(struct task_struct *p)
>> update_task_scan_period(p, fault_types[0], fault_types[1]);
>>
>> if (p->numa_group) {
>> + update_numa_active_node_mask(p);
>
> We are updating that thing once per scan window, that's fine. There is
> potentially a wee issue though. If all the tasks in the group are threads
> then they share p->mm->numa_scan_seq and only one task does the update
> per scan window. If they are different processes then we could be updating
> more frequently than necessary.
>
> Functionally it'll be fine but higher cost than necessary. I do not have a
> better suggestion right now as superficially a numa_scan_seq per numa_group
> would not be a good fit.

I suspect this cost will be small anyway, compared to the costs
incurred in both the earlier part of task_numa_placement, and
in the code where we may look for a better place to migrate the
task to.

This just iterates over memory we have already touched before
(likely to still be cached), and does some cheap comparisons.

>> /*
>> * If the preferred task and group nids are different,
>> * iterate over the nodes again to find the best place.
>> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>> /* Second half of the array tracks where faults come from */
>> grp->faults_from = grp->faults + 2 * nr_node_ids;
>>
>> + node_set(task_node(current), grp->active_nodes);
>> +
>> for (i = 0; i < 4*nr_node_ids; i++)
>> grp->faults[i] = p->numa_faults[i];
>>
>> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>> my_grp->nr_tasks--;
>> grp->nr_tasks++;
>>
>> + update_numa_active_node_mask(p);
>> +
>
> This may be subtle enough to deserve a comment
>
> /* Tasks have joined/left groups and the active_mask is no longer valid */

I have added a comment.

> If we left a group, we update our new group. Is the old group now out of
> date and in need of updating too?

The entire old group will join the new group, and the old group
is freed.

--
All rights reversed

Mel Gorman

unread,
Jan 21, 2014, 3:50:02 PM1/21/14
to
On Tue, Jan 21, 2014 at 10:09:14AM -0500, Rik van Riel wrote:
> On 01/21/2014 09:19 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:04PM -0500, ri...@redhat.com wrote:
>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1945ddc..ea8b2ae 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -885,6 +885,7 @@ struct numa_group {
> >> struct list_head task_list;
> >>
> >> struct rcu_head rcu;
> >> + nodemask_t active_nodes;
> >> unsigned long total_faults;
> >> unsigned long *faults_from;
> >> unsigned long faults[0];
> >
> > It's not a concern for now but in the land of unicorns and ponies we'll
> > relook at the size of some of these structures and see what can be
> > optimised.
>
> Unsigned int should be enough for systems with less than 8TB
> of memory per node :)
>

Is it not bigger than that?

typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;

so it depends on the value of NODES_SHIFT? Anyway, not worth getting
into a twist over.

> > Similar to my comment on faults_from I think we could potentially evaluate
> > the fitness of the automatic NUMA balancing feature by looking at the
> > weight of the active_nodes for a numa_group. If
> > bitmask_weight(active_nodes) == nr_online_nodes
> > for all numa_groups in the system then I think it would be an indication
> > that the algorithm has collapsed.
>
> If the system runs one very large workload, I would expect the
> scheduler to spread that workload across all nodes.
>
> In that situation, it is perfectly legitimate for all nodes
> to end up being marked as active nodes, and for the system
> to try distribute the workload's memory somewhat evenly
> between them.
>

In the specific case where the workload is not partitioned and really
accessing all of memory then sure, it'll be spread throughout the
system. However, if we are looking at a case like multiple JVMs sized to
fit within nodes then the metric would hold.

> > It's not a comment on the patch itself. We could could just do with more
> > metrics that help analyse this thing when debugging problems.
> >
> >> @@ -1275,6 +1276,41 @@ static void numa_migrate_preferred(struct task_struct *p)
> >> }
> >>
> >> /*
> >> + * Find the nodes on which the workload is actively running. We do this by
> >
> > hmm, it's not the workload though, it's a single NUMA group and a workload
> > may consist of multiple NUMA groups. For example, in an ideal world and
> > a JVM-based workload the application threads and the GC threads would be
> > in different NUMA groups.
>
> Why should they be in a different numa group?
>

It would be ideal that they are in different groups so the hinting faults
incurred by the garbage collector (linear scan of the address space)
does not affect scheduling and placement decisions based on the numa
groups fault statistics.

> The rest of the series contains patches to make sure they
> should be just fine together in the same group...
>
> > The signature is even more misleading because the signature implies that
> > the function is concerned with tasks. Pass in p->numa_group
>
> Will do.
>
> >> + * tracking the nodes from which NUMA hinting faults are triggered. This can
> >> + * be different from the set of nodes where the workload's memory is currently
> >> + * located.
> >> + *
> >> + * The bitmask is used to make smarter decisions on when to do NUMA page
> >> + * migrations, To prevent flip-flopping, and excessive page migrations, nodes
> >> + * are added when they cause over 6/16 of the maximum number of faults, but
> >> + * only removed when they drop below 3/16.
> >> + */
> >
> > Looking at the values, I'm guessing you did it this way to use shifts
> > instead of divides. That's fine, but how did you arrive at those values?
> > Experimentally or just felt reasonable?
>
> Experimentally I got to 20% and 40%. Peter suggested I change it
> to 3/16 and 6/16, which appear to give identical performance.
>

Cool
Lets say

Thread A: Most important thread for performance, accesses small amounts
of memory during each scan window. Lets say it's doing calculations
over a large cache-aware structure of some description

Thread B: Big stupid linear scanner accessing all of memory for whatever
reason.

Thread B will incur more NUMA hinting faults because it is accessing
idle memory that is unused by Thread A. The fault stats and placement
decisions are then skewed in favour of Thread B because Thread A did not
trap enough hinting faults.

It's a theoretical problem.
Fair enough. It'll show up in profiles if it's a problem anyway.

> >> /*
> >> * If the preferred task and group nids are different,
> >> * iterate over the nodes again to find the best place.
> >> @@ -1478,6 +1515,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >> /* Second half of the array tracks where faults come from */
> >> grp->faults_from = grp->faults + 2 * nr_node_ids;
> >>
> >> + node_set(task_node(current), grp->active_nodes);
> >> +
> >> for (i = 0; i < 4*nr_node_ids; i++)
> >> grp->faults[i] = p->numa_faults[i];
> >>
> >> @@ -1547,6 +1586,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >> my_grp->nr_tasks--;
> >> grp->nr_tasks++;
> >>
> >> + update_numa_active_node_mask(p);
> >> +
> >
> > This may be subtle enough to deserve a comment
> >
> > /* Tasks have joined/left groups and the active_mask is no longer valid */
>
> I have added a comment.
>
> > If we left a group, we update our new group. Is the old group now out of
> > date and in need of updating too?
>
> The entire old group will join the new group, and the old group
> is freed.
>

We reference count the old group so that it only gets freed when the
last task leaves it. If the old group was guaranteed to be destroyed
there would be no need to do stuff like

list_move(&p->numa_entry, &grp->task_list);
my_grp->total_faults -= p->total_numa_faults;
my_grp->nr_tasks--;

All that reads as "a single task is moving group" and not the entire
old group joins the new group. I expected that the old group was only
guaranteed to be destroyed in the case where we had just allocated it
because p->numa_group was NULL when task_numa_group was called.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 21, 2014, 4:00:02 PM1/21/14
to
On Mon, Jan 20, 2014 at 02:21:06PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> The tracepoint has made it abundantly clear that the naive
> implementation of the faults_from code has issues.
>
> Specifically, the garbage collector in some workloads will
> access orders of magnitudes more memory than the threads
> that do all the active work. This resulted in the node with
> the garbage collector being marked the only active node in
> the group.
>

Maybe I should have read this patch before getting into a twist about the
earlier patches in the series and the treatment of active_mask :(. On the
plus side, even without reading the code I can still see the motivation
for this paragraph.

> This issue is avoided if we weigh the statistics by CPU use
> of each task in the numa group, instead of by how many faults
> each thread has occurred.
>

Bah, yes. Because in my earlier review I was worried about the faults
being missed. If the faults stats are scaled by the CPU statistics then it
is a very rough proxy measure for how heavily a particular node is being
referenced by a process.
Depending on how you reacted to the review of other patches this may or
may not have a helper now.

> + runtime = p->se.avg.runnable_avg_sum;
> + period = p->se.avg.runnable_avg_period;
> +

Ok, IIRC these stats are based a decaying average based on recent
history so heavy activity followed by long periods of idle will not skew
the stats.

> /* If the task is part of a group prevent parallel updates to group stats */
> if (p->numa_group) {
> group_lock = &p->numa_group->lock;
> @@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)
> int priv, i;
>
> for (priv = 0; priv < 2; priv++) {
> - long diff, f_diff;
> + long diff, f_diff, f_weight;
>
> i = task_faults_idx(nid, priv);
> diff = -p->numa_faults[i];
> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
> fault_types[priv] += p->numa_faults_buffer[i];
> p->numa_faults_buffer[i] = 0;
>
> + /*
> + * Normalize the faults_from, so all tasks in a group
> + * count according to CPU use, instead of by the raw
> + * number of faults. Tasks with little runtime have
> + * little over-all impact on throughput, and thus their
> + * faults are less important.
> + */
> + f_weight = (16384 * runtime *
> + p->numa_faults_from_buffer[i]) /
> + (total_faults * period + 1);

Why 16384? It looks like a scaling factor to deal with integer approximations
but I'm not 100% sure and I do not see how you arrived at that value.

> p->numa_faults_from[i] >>= 1;
> - p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
> + p->numa_faults_from[i] += f_weight;
> p->numa_faults_from_buffer[i] = 0;
>

numa_faults_from needs a big comment that it's no longer about the
number of faults in it. It's the sum of faults measured by the group
weighted by the CPU

> faults += p->numa_faults[i];
> --
> 1.8.4.2
>

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 21, 2014, 4:20:02 PM1/21/14
to
On Mon, Jan 20, 2014 at 02:21:07PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> The current code in task_numa_placement calculates the difference
> between the old and the new value, but also temporarily stores half
> of the old value in the per-process variables.
>
> The NUMA balancing code looks at those per-process variables, and
> having other tasks temporarily see halved statistics could lead to
> unwanted numa migrations. This can be avoided by doing all the math
> in local variables.
>
> This change also simplifies the code a little.
>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Mel Gorman <mgo...@suse.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Chegu Vinod <chegu...@hp.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs

Rik van Riel

unread,
Jan 21, 2014, 9:10:02 PM1/21/14
to
On 01/21/2014 10:56 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:06PM -0500, ri...@redhat.com wrote:

>> @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)
>> p->numa_scan_seq = seq;
>> p->numa_scan_period_max = task_scan_max(p);
>>
>> + total_faults = p->numa_faults_locality[0] +
>> + p->numa_faults_locality[1] + 1;
>
> Depending on how you reacted to the review of other patches this may or
> may not have a helper now.

This is a faults "buffer", zeroed quickly after we take these
faults, so we should probably not tempt others by having a helper
function to get these numbers...

>> + runtime = p->se.avg.runnable_avg_sum;
>> + period = p->se.avg.runnable_avg_period;
>> +
>
> Ok, IIRC these stats are based a decaying average based on recent
> history so heavy activity followed by long periods of idle will not skew
> the stats.

Turns out that using a longer time statistic results in a 1% performance
gain, so expect this code to change again in the next version :)

>> @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)
>> fault_types[priv] += p->numa_faults_buffer[i];
>> p->numa_faults_buffer[i] = 0;
>>
>> + /*
>> + * Normalize the faults_from, so all tasks in a group
>> + * count according to CPU use, instead of by the raw
>> + * number of faults. Tasks with little runtime have
>> + * little over-all impact on throughput, and thus their
>> + * faults are less important.
>> + */
>> + f_weight = (16384 * runtime *
>> + p->numa_faults_from_buffer[i]) /
>> + (total_faults * period + 1);
>
> Why 16384? It looks like a scaling factor to deal with integer approximations
> but I'm not 100% sure and I do not see how you arrived at that value.

Indeed, it is simply a fixed point math scaling factor.

I used 1024 before, but that is kind of a small number when we could
be dealing with a node that has 20% of the accesses, and a task that
used 10% CPU time.

Having the numbers a little larger could help, and certainly should
not hurt, as long as we keep the number small enough to avoid overflows.

>> p->numa_faults_from[i] >>= 1;
>> - p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
>> + p->numa_faults_from[i] += f_weight;
>> p->numa_faults_from_buffer[i] = 0;
>>
>
> numa_faults_from needs a big comment that it's no longer about the
> number of faults in it. It's the sum of faults measured by the group
> weighted by the CPU

Agreed.

--
All rights reversed

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:01 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8f8476..d14d9fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer_memory;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_cpu;
+ unsigned long *numa_faults_buffer_cpu;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6560145..b98ed61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_cpu;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults_memory[i];
+ f_diff = -p->numa_faults_cpu[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults_memory[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer_memory[i];
p->numa_faults_buffer_memory[i] = 0;

+ p->numa_faults_cpu[i] >>= 1;
+ p->numa_faults_cpu[i] += p->numa_faults_buffer_cpu[i];
+ p->numa_faults_buffer_cpu[i] = 0;
+
faults += p->numa_faults_memory[i];
diff += p->numa_faults_memory[i];
+ f_diff += p->numa_faults_cpu[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_cpu[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);

grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks nids where faults happen */
+ grp->faults_cpu = grp->faults + 2 * nr_node_ids;

- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults_memory[i];

grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

double_lock(&my_grp->lock, &grp->lock);

- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults_memory[i];
grp->faults[i] += p->numa_faults_memory[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)

if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)
+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)

p->numa_faults_memory = NULL;
p->numa_faults_buffer_memory = NULL;
+ p->numa_faults_cpu= NULL;
+ p->numa_faults_buffer_cpu = NULL;
kfree(numa_faults);
}

@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;

if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults_memory)) {
- int size = sizeof(*p->numa_faults_memory) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults_memory) * 4 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults_memory = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;

BUG_ON(p->numa_faults_buffer_memory);
- p->numa_faults_buffer_memory = p->numa_faults_memory + (2 * nr_node_ids);
+ p->numa_faults_cpu = p->numa_faults_memory + (2 * nr_node_ids);
+ p->numa_faults_buffer_memory = p->numa_faults_memory + (4 * nr_node_ids);
+ p->numa_faults_buffer_cpu = p->numa_faults_memory + (6 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;

p->numa_faults_buffer_memory[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_buffer_cpu[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

--
1.8.4.2

Rik van Riel

unread,
Jan 21, 2014, 10:30:02 PM1/21/14
to
On 01/21/2014 07:21 AM, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 02:21:03PM -0500, ri...@redhat.com wrote:

>> +++ b/include/linux/sched.h
>> @@ -1492,6 +1492,14 @@ struct task_struct {
>> unsigned long *numa_faults_buffer;
>>
>> /*
>> + * Track the nodes where faults are incurred. This is not very
>> + * interesting on a per-task basis, but it help with smarter
>> + * numa memory placement for groups of processes.
>> + */
>> + unsigned long *numa_faults_from;
>> + unsigned long *numa_faults_from_buffer;
>> +
>
> As an aside I wonder if we can derive any useful metric from this

It may provide for a better way to tune the numa scan interval
than the current code, since the "local vs remote" ratio is not
going to provide us much useful info when dealing with a workload
that is spread across multiple numa nodes.

>> grp->total_faults = p->total_numa_faults;
>> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>>
>> double_lock(&my_grp->lock, &grp->lock);
>>
>> - for (i = 0; i < 2*nr_node_ids; i++) {
>> + for (i = 0; i < 4*nr_node_ids; i++) {
>> my_grp->faults[i] -= p->numa_faults[i];
>> grp->faults[i] += p->numa_faults[i];
>> }
>
> The same obscure trick is used throughout and I'm not sure how
> maintainable that will be. Would it be better to be explicit about this?

I have made a cleanup patch for this, using the defines you
suggested.

>> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
>> p->numa_pages_migrated += pages;
>>
>> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
>> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
>> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
>
> this_node and node is similarly ambiguous in terms of name. Rename of
> data_node and cpu_node would have been clearer.

I added a patch in the next version of the series.

Don't want to make the series too large, though :)

--
All rights reversed

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:02 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

The numa_faults_cpu statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b98ed61..d4f6df5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_cpu;
unsigned long faults[0];
@@ -918,6 +919,12 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
p->numa_group->faults[task_faults_idx(nid, 1)];
}

+static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
+{
+ return group->faults_cpu[task_faults_idx(nid, 0)] +
+ group->faults_cpu[task_faults_idx(nid, 1)];
+}
+
/*
* These return the fraction of accesses done by a particular task, or
* task group, on a particular numa node. The group weight is given a
@@ -1275,6 +1282,40 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Find the nodes on which the workload is actively running. We do this by
+ * tracking the nodes from which NUMA hinting faults are triggered. This can
+ * be different from the set of nodes where the workload's memory is currently
+ * located.
+ *
+ * The bitmask is used to make smarter decisions on when to do NUMA page
+ * migrations, To prevent flip-flopping, and excessive page migrations, nodes
+ * are added when they cause over 6/16 of the maximum number of faults, but
+ * only removed when they drop below 3/16.
+ */
+static void update_numa_active_node_mask(struct numa_group *numa_group)
+{
+ unsigned long faults, max_faults = 0;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_cpu[task_faults_idx(nid, 0)] +
+ numa_group->faults_cpu[task_faults_idx(nid, 1)];
+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_cpu[task_faults_idx(nid, 0)] +
+ numa_group->faults_cpu[task_faults_idx(nid, 1)];
+ if (!node_isset(nid, numa_group->active_nodes)) {
+ if (faults > max_faults * 6 / 16)
+ node_set(nid, numa_group->active_nodes);
+ } else if (faults < max_faults * 3 / 16)
+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1457,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p->numa_group);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1520,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks nids where faults happen */
grp->faults_cpu = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults_memory[i];

@@ -1547,6 +1591,13 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ /*
+ * We just joined a new group, the set of active nodes may have
+ * changed. Do not update the nodemask of the old group, since
+ * the tasks in that group will probably join the new group soon.
+ */
+ update_numa_active_node_mask(grp);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:02 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

Tracing the code that decides the active nodes has made it abundantly clear
that the naive implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will access orders
of magnitudes more memory than the threads that do all the active work.
This resulted in the node with the garbage collector being marked the only
active node in the group.

This issue is avoided if we weigh the statistics by CPU use of each task in
the numa group, instead of by how many faults each thread has occurred.

To achieve this, we normalize the number of faults to the fraction of faults
that occurred on each node, and then multiply that fraction by the fraction
of CPU time the task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones where the tasks
from the numa group are most actively running, and the influence of eg. the
garbage collector and other do-little threads is properly minimized.

On a 4 node system, using CPU use statistics calculated over a longer interval
results in about 1% fewer page migrations with two 32-warehouse specjbb runs
on a 4 node system, and about 5% fewer page migrations, as well as 1% better
throughput, with two 8-warehouse specjbb runs, as compared with the shorter
term statistics kept by the scheduler.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0bf4ac2..2b3cd41 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;

struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 224871f..1bd0727 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults_memory = NULL;
p->numa_faults_buffer_memory = NULL;
+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddbf20c..bd2100c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -887,6 +887,11 @@ struct numa_group {
struct rcu_head rcu;
nodemask_t active_nodes;
unsigned long total_faults;
+ /*
+ * Faults_cpu is used to decide whether memory should move
+ * towards the CPU. As a consequence, these stats are weighted
+ * more by CPU use than by memory faults.
+ */
unsigned long *faults_cpu;
unsigned long faults[0];
};
@@ -1451,11 +1456,41 @@ static void update_task_scan_period(struct task_struct *p,
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1464,6 +1499,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1476,7 +1515,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults_memory[i];
@@ -1488,8 +1527,18 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer_memory[i];
p->numa_faults_buffer_memory[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. Tasks with little runtime have
+ * little over-all impact on throughput, and thus their
+ * faults are less important.
+ */
+ f_weight = (16384 * runtime *
+ p->numa_faults_buffer_cpu[i]) /
+ (total_faults * period + 1);
p->numa_faults_cpu[i] >>= 1;
- p->numa_faults_cpu[i] += p->numa_faults_buffer_cpu[i];
+ p->numa_faults_cpu[i] += f_weight;
p->numa_faults_buffer_cpu[i] = 0;

faults += p->numa_faults_memory[i];

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:03 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

In order to get a more consistent naming scheme, making it clear
which fault statistics track memory locality, and which track
CPU locality, rename the memory fault statistics.

Suggested-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 8 ++++----
kernel/sched/core.c | 4 ++--
kernel/sched/debug.c | 6 +++---
kernel/sched/fair.c | 56 +++++++++++++++++++++++++--------------------------
4 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..b8f8476 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,15 +1481,15 @@ struct task_struct {
* Scheduling placement decisions are made based on the these counts.
* The values remain static for the duration of a PTE scan
*/
- unsigned long *numa_faults;
+ unsigned long *numa_faults_memory;
unsigned long total_numa_faults;

/*
* numa_faults_buffer records faults per node during the current
- * scan window. When the scan completes, the counts in numa_faults
- * decay and these values are copied.
+ * scan window. When the scan completes, the counts in
+ * numa_faults_memory decay and these values are copied.
*/
- unsigned long *numa_faults_buffer;
+ unsigned long *numa_faults_buffer_memory;

/*
* numa_faults_locality tracks if faults recorded during the last
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..224871f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1756,8 +1756,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
- p->numa_faults = NULL;
- p->numa_faults_buffer = NULL;
+ p->numa_faults_memory = NULL;
+ p->numa_faults_buffer_memory = NULL;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..31b908d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -533,15 +533,15 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
unsigned long nr_faults = -1;
int cpu_current, home_node;

- if (p->numa_faults)
- nr_faults = p->numa_faults[2*node + i];
+ if (p->numa_faults_memory)
+ nr_faults = p->numa_faults_memory[2*node + i];

cpu_current = !i ? (task_node(p) == node) :
(pol && node_isset(node, pol->v.nodes));

home_node = (p->numa_preferred_nid == node);

- SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+ SEQ_printf(m, "numa_faults_memory, %d, %d, %d, %d, %ld\n",
i, node, cpu_current, home_node, nr_faults);
}
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..6560145 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,11 +901,11 @@ static inline int task_faults_idx(int nid, int priv)

static inline unsigned long task_faults(struct task_struct *p, int nid)
{
- if (!p->numa_faults)
+ if (!p->numa_faults_memory)
return 0;

- return p->numa_faults[task_faults_idx(nid, 0)] +
- p->numa_faults[task_faults_idx(nid, 1)];
+ return p->numa_faults_memory[task_faults_idx(nid, 0)] +
+ p->numa_faults_memory[task_faults_idx(nid, 1)];
}

static inline unsigned long group_faults(struct task_struct *p, int nid)
@@ -927,7 +927,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
{
unsigned long total_faults;

- if (!p->numa_faults)
+ if (!p->numa_faults_memory)
return 0;

total_faults = p->total_numa_faults;
@@ -1259,7 +1259,7 @@ static int task_numa_migrate(struct task_struct *p)
static void numa_migrate_preferred(struct task_struct *p)
{
/* This task has no NUMA fault statistics yet */
- if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
+ if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults_memory))
return;

/* Periodically retry migrating the task to the preferred node */
@@ -1375,16 +1375,16 @@ static void task_numa_placement(struct task_struct *p)
long diff;

i = task_faults_idx(nid, priv);
- diff = -p->numa_faults[i];
+ diff = -p->numa_faults_memory[i];

/* Decay existing window, copy faults since last scan */
- p->numa_faults[i] >>= 1;
- p->numa_faults[i] += p->numa_faults_buffer[i];
- fault_types[priv] += p->numa_faults_buffer[i];
- p->numa_faults_buffer[i] = 0;
+ p->numa_faults_memory[i] >>= 1;
+ p->numa_faults_memory[i] += p->numa_faults_buffer_memory[i];
+ fault_types[priv] += p->numa_faults_buffer_memory[i];
+ p->numa_faults_buffer_memory[i] = 0;

- faults += p->numa_faults[i];
- diff += p->numa_faults[i];
+ faults += p->numa_faults_memory[i];
+ diff += p->numa_faults_memory[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
@@ -1469,7 +1469,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
grp->gid = p->pid;

for (i = 0; i < 2*nr_node_ids; i++)
- grp->faults[i] = p->numa_faults[i];
+ grp->faults[i] = p->numa_faults_memory[i];

grp->total_faults = p->total_numa_faults;

@@ -1527,8 +1527,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
double_lock(&my_grp->lock, &grp->lock);

for (i = 0; i < 2*nr_node_ids; i++) {
- my_grp->faults[i] -= p->numa_faults[i];
- grp->faults[i] += p->numa_faults[i];
+ my_grp->faults[i] -= p->numa_faults_memory[i];
+ grp->faults[i] += p->numa_faults_memory[i];
}
my_grp->total_faults -= p->total_numa_faults;
grp->total_faults += p->total_numa_faults;
@@ -1554,12 +1554,12 @@ void task_numa_free(struct task_struct *p)
{
struct numa_group *grp = p->numa_group;
int i;
- void *numa_faults = p->numa_faults;
+ void *numa_faults = p->numa_faults_memory;

if (grp) {
spin_lock(&grp->lock);
for (i = 0; i < 2*nr_node_ids; i++)
- grp->faults[i] -= p->numa_faults[i];
+ grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

list_del(&p->numa_entry);
@@ -1569,8 +1569,8 @@ void task_numa_free(struct task_struct *p)
put_numa_group(grp);
}

- p->numa_faults = NULL;
- p->numa_faults_buffer = NULL;
+ p->numa_faults_memory = NULL;
+ p->numa_faults_buffer_memory = NULL;
kfree(numa_faults);
}

@@ -1595,16 +1595,16 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;

/* Allocate buffer to track faults on a per-node basis */
- if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ if (unlikely(!p->numa_faults_memory)) {
+ int size = sizeof(*p->numa_faults_memory) * 2 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
- p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
- if (!p->numa_faults)
+ p->numa_faults_memory = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
+ if (!p->numa_faults_memory)
return;

- BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ BUG_ON(p->numa_faults_buffer_memory);
+ p->numa_faults_buffer_memory = p->numa_faults_memory + (2 * nr_node_ids);
p->total_numa_faults = 0;
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}
@@ -1633,7 +1633,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
if (migrated)
p->numa_pages_migrated += pages;

- p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_buffer_memory[task_faults_idx(node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

@@ -4781,7 +4781,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
{
int src_nid, dst_nid;

- if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+ if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults_memory ||
!(env->sd->flags & SD_NUMA)) {
return false;
}
@@ -4812,7 +4812,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
return false;

- if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ if (!p->numa_faults_memory || !(env->sd->flags & SD_NUMA))
return false;

src_nid = cpu_to_node(env->src_cpu);

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:02 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

We track both the node of the memory after a NUMA fault, and the node
of the CPU on which the fault happened. Rename the local variables in
task_numa_fault to make things more explicit.

Suggested-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f713f3a..43ca8c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1747,11 +1747,11 @@ void task_numa_free(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_cpupid, int node, int pages, int flags)
+void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
- int this_node = task_node(current);
+ int cpu_node = task_node(current);
int priv;

if (!numabalancing_enabled)
@@ -1806,8 +1806,8 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
if (migrated)
p->numa_pages_migrated += pages;

- p->numa_faults_buffer_memory[task_faults_idx(node, priv)] += pages;
- p->numa_faults_buffer_cpu[task_faults_idx(this_node, priv)] += pages;
+ p->numa_faults_buffer_memory[task_faults_idx(mem_node, priv)] += pages;
+ p->numa_faults_buffer_cpu[task_faults_idx(cpu_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:03 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

Cleanup suggested by Mel Gorman. Now the code contains some more
hints on what statistics go where.

Suggested-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ca8c4..6b9d27c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,6 +896,15 @@ struct numa_group {
unsigned long faults[0];
};

+/* Shared or private faults. */
+#define NR_NUMA_HINT_FAULT_TYPES 2
+
+/* Memory and CPU locality */
+#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2)
+
+/* Averaged statistics, and temporary buffers. */
+#define NR_NUMA_HINT_BUCKETS (NUMA_HINT_FAULT_STATS * 2)
+
pid_t task_numa_group_id(struct task_struct *p)
{
return p->numa_group ? p->numa_group->gid : 0;
@@ -903,7 +912,7 @@ pid_t task_numa_group_id(struct task_struct *p)

static inline int task_faults_idx(int nid, int priv)
{
- return 2 * nid + priv;
+ return NR_NUMA_HINT_FAULT_TYPES * nid + priv;
}

static inline unsigned long task_faults(struct task_struct *p, int nid)
@@ -1514,7 +1523,7 @@ static void task_numa_placement(struct task_struct *p)
unsigned long faults = 0, group_faults = 0;
int priv, i;

- for (priv = 0; priv < 2; priv++) {
+ for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) {
long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
@@ -1625,11 +1634,12 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
/* Second half of the array tracks nids where faults happen */
- grp->faults_cpu = grp->faults + 2 * nr_node_ids;
+ grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES *
+ nr_node_ids;

node_set(task_node(current), grp->active_nodes);

- for (i = 0; i < 4*nr_node_ids; i++)
+ for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] = p->numa_faults_memory[i];

grp->total_faults = p->total_numa_faults;
@@ -1687,7 +1697,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

double_lock(&my_grp->lock, &grp->lock);

- for (i = 0; i < 4*nr_node_ids; i++) {
+ for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults_memory[i];
grp->faults[i] += p->numa_faults_memory[i];
}
@@ -1726,7 +1736,7 @@ void task_numa_free(struct task_struct *p)

if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 4*nr_node_ids; i++)
+ for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] -= p->numa_faults_memory[i];
grp->total_faults -= p->total_numa_faults;

@@ -1767,14 +1777,20 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults_memory)) {
- int size = sizeof(*p->numa_faults_memory) * 4 * nr_node_ids;
+ int size = sizeof(*p->numa_faults_memory) *
+ NR_NUMA_HINT_FAULT_BUCKETS * nr_node_ids;

- /* numa_faults and numa_faults_buffer share the allocation */
- p->numa_faults_memory = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
+ p->numa_faults_memory = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
if (!p->numa_faults_memory)
return;

BUG_ON(p->numa_faults_buffer_memory);
+ /*
+ * The averaged statistics, shared & private, memory & cpu,
+ * occupy the first half of the array. The second half of the
+ * array is for current counters, which are averaged into the
+ * first set by task_numa_placement.
+ */
p->numa_faults_cpu = p->numa_faults_memory + (2 * nr_node_ids);
p->numa_faults_buffer_memory = p->numa_faults_memory + (4 * nr_node_ids);
p->numa_faults_buffer_cpu = p->numa_faults_memory + (6 * nr_node_ids);

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:02 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Acked-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
Documentation/sysctl/kernel.txt | 10 +--------
include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 -----------------------------------------
5 files changed, 1 insertion(+), 70 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ee9a2f9..8c78658 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -399,8 +399,7 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
-numa_balancing_migrate_deferred.
+numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.

==============================================================

@@ -441,13 +440,6 @@ rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.

-numa_balancing_migrate_deferred is how many page migrations get skipped
-unconditionally, after a page migration is skipped because a page is shared
-with other tasks. This reduces page migration overhead, and determines
-how much stronger the "move task near its memory" policy scheduler becomes,
-versus the "move memory near its task" memory management policy, for workloads
-with shared memory.
-
==============================================================

osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}

if (curnid != polnid)

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:03 PM1/21/14
to
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v3:
- various code cleanups suggested by Mel Gorman (some in their own patches)
- after some testing, switch back to the NUMA specific CPU use stats,
since that results in a 1% performance increase for two 8-warehouse
specjbb instances on a 4-node system, and reduced page migration across
the board

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:03 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.

The NUMA balancing code looks at those per-process variables, and
having other tasks temporarily see halved statistics could lead to
unwanted numa migrations. This can be avoided by doing all the math
in local variables.

This change also simplifies the code a little.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Acked-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd2100c..f713f3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1518,12 +1518,9 @@ static void task_numa_placement(struct task_struct *p)
long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
- diff = -p->numa_faults_memory[i];
- f_diff = -p->numa_faults_cpu[i];

/* Decay existing window, copy faults since last scan */
- p->numa_faults_memory[i] >>= 1;
- p->numa_faults_memory[i] += p->numa_faults_buffer_memory[i];
+ diff = p->numa_faults_buffer_memory[i] - p->numa_faults_memory[i] / 2;
fault_types[priv] += p->numa_faults_buffer_memory[i];
p->numa_faults_buffer_memory[i] = 0;

@@ -1537,13 +1534,12 @@ static void task_numa_placement(struct task_struct *p)
f_weight = (16384 * runtime *
p->numa_faults_buffer_cpu[i]) /
(total_faults * period + 1);
- p->numa_faults_cpu[i] >>= 1;
- p->numa_faults_cpu[i] += f_weight;
+ f_diff = f_weight - p->numa_faults_cpu[i] / 2;
p->numa_faults_buffer_cpu[i] = 0;

+ p->numa_faults_memory[i] += diff;
+ p->numa_faults_cpu[i] += f_diff;
faults += p->numa_faults_memory[i];
- diff += p->numa_faults_memory[i];
- f_diff += p->numa_faults_cpu[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
--
1.8.4.2

ri...@redhat.com

unread,
Jan 21, 2014, 10:30:03 PM1/21/14
to
From: Rik van Riel <ri...@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
include/linux/sched.h | 7 ++++++
kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 29 +-----------------------
3 files changed, 70 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d14d9fe..0bf4ac2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate_memory(struct task_struct *p,
+ struct page *page, int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4f6df5..ddbf20c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -954,6 +954,68 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
}

+bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
+ int src_nid, int dst_nid)
+{
+ struct numa_group *ng = p->numa_group;
+ int last_cpupid, this_cpupid;
+
+ this_cpupid = cpu_pid_to_cpupid(dst_nid, current->pid);
+ last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+
+ /*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By using
+ * a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
+ * a task's usage of a particular page (n_p) per total usage of this
+ * page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will sample this probability and getting the
+ * same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely we
+ * act on an unlikely task<->page relation.
+ */
+ if (!cpupid_pid_unset(last_cpupid) &&
+ cpupid_to_nid(last_cpupid) != dst_nid)
+ return false;
+
+ /* Always allow migrate on private faults */
+ if (cpupid_match_pid(p, last_cpupid))
+ return true;
+
+ /* A shared fault, but p->numa_group has not been set up yet. */
+ if (!ng)
+ return true;
+
+ /*
+ * Do not migrate if the destination is not a node that
+ * is actively used by this numa group.
+ */
+ if (!node_isset(dst_nid, ng->active_nodes))
+ return false;
+
+ /*
+ * Source is a node that is not actively used by this
+ * numa group, while the destination is. Migrate.
+ */
+ if (!node_isset(src_nid, ng->active_nodes))
+ return true;
+
+ /*
+ * Both source and destination are nodes in active
+ * use by this numa group. Maximize memory bandwidth
+ * by migrating from more heavily used groups, to less
+ * heavily used ones, spreading the load around.
+ * Use a 1/4 hysteresis to avoid spurious page movement.
+ */
+ return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..95cdfe5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2374,37 +2374,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_cpupid;
- int this_cpupid;
-
polnid = thisnid;
- this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);

- /*
- * Multi-stage node selection is used in conjunction
- * with a periodic migration fault to build a temporal
- * task<->page relation. By using a two-stage filter we
- * remove short/unlikely relations.
- *
- * Using P(p) ~ n_p / n_t as per frequentist
- * probability, we can equate a task's usage of a
- * particular page (n_p) per total usage of this
- * page (n_t) (in a given time-span) to a probability.
- *
- * Our periodic faults will sample this probability and
- * getting the same result twice in a row, given these
- * samples are fully independent, is then given by
- * P(n)^2, provided our sample period is sufficiently
- * short compared to the usage pattern.
- *
- * This quadric squishes small probabilities, making
- * it less likely we act on an unlikely task<->page
- * relation.
- */
- last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
- if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
+ if (!should_numa_migrate_memory(current, page, curnid, polnid))
goto out;
- }
}

if (curnid != polnid)
--
1.8.4.2

Rik van Riel

unread,
Jan 22, 2014, 12:00:02 AM1/22/14
to
On Tue, 21 Jan 2014 17:20:11 -0500
ri...@redhat.com wrote:

> From: Rik van Riel <ri...@redhat.com>
>
> Cleanup suggested by Mel Gorman. Now the code contains some more
> hints on what statistics go where.
>
> Suggested-by: Mel Gorman <mgo...@suse.de>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

... and patch fatigue hit, causing me to mess up this simple
patch with no functional changes (which is why I did not test
it yet).

Here is one that compiles.

---8<---

Subject: numa,sched: define some magic numbers

Cleanup suggested by Mel Gorman. Now the code contains some more
hints on what statistics go where.

Suggested-by: Mel Gorman <mgo...@suse.de>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ca8c4..6b9d27c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,6 +896,15 @@ struct numa_group {
unsigned long faults[0];
};

+/* Shared or private faults. */
+#define NR_NUMA_HINT_FAULT_TYPES 2
+
+/* Memory and CPU locality */
+#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2)
+
+/* Averaged statistics, and temporary buffers. */
+#define NR_NUMA_HINT_FAULT_BUCKETS (NR_NUMA_HINT_FAULT_STATS * 2)

Mel Gorman

unread,
Jan 24, 2014, 2:20:01 PM1/24/14
to
On Tue, Jan 21, 2014 at 05:26:39PM -0500, Rik van Riel wrote:
> On 01/21/2014 07:21 AM, Mel Gorman wrote:
> > On Mon, Jan 20, 2014 at 02:21:03PM -0500, ri...@redhat.com wrote:
>
> >> +++ b/include/linux/sched.h
> >> @@ -1492,6 +1492,14 @@ struct task_struct {
> >> unsigned long *numa_faults_buffer;
> >>
> >> /*
> >> + * Track the nodes where faults are incurred. This is not very
> >> + * interesting on a per-task basis, but it help with smarter
> >> + * numa memory placement for groups of processes.
> >> + */
> >> + unsigned long *numa_faults_from;
> >> + unsigned long *numa_faults_from_buffer;
> >> +
> >
> > As an aside I wonder if we can derive any useful metric from this
>
> It may provide for a better way to tune the numa scan interval
> than the current code, since the "local vs remote" ratio is not
> going to provide us much useful info when dealing with a workload
> that is spread across multiple numa nodes.
>

Agreed. Local vs Remote handles the easier cases, particularly where the
workload has been configured to have aspects of it fit within NUMA nodes
(e.g. multiple JVMs, multiple virtual machines etc) but it's nowhere near
as useful for large single-image workloads spanning the full machine

I think in this New World Order that for single instance workloads we
would instead take into account the balance of all remote nodes. So if
all remote nodes are roughly even in terms of usage and we've decided to
interleave then slow the scan rate if the remote active nodes are evenly used

It's not something for this series just yet but I have observed a higher
system CPU usage as a result of this series. It's still far lower than
the overhead we had in the past but this is one potential idea that would
allow us to reduce the system overhead again in the future.

> >> grp->total_faults = p->total_numa_faults;
> >> @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
> >>
> >> double_lock(&my_grp->lock, &grp->lock);
> >>
> >> - for (i = 0; i < 2*nr_node_ids; i++) {
> >> + for (i = 0; i < 4*nr_node_ids; i++) {
> >> my_grp->faults[i] -= p->numa_faults[i];
> >> grp->faults[i] += p->numa_faults[i];
> >> }
> >
> > The same obscure trick is used throughout and I'm not sure how
> > maintainable that will be. Would it be better to be explicit about this?
>
> I have made a cleanup patch for this, using the defines you
> suggested.
>
> >> @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
> >> p->numa_pages_migrated += pages;
> >>
> >> p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
> >> + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
> >> p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
> >
> > this_node and node is similarly ambiguous in terms of name. Rename of
> > data_node and cpu_node would have been clearer.
>
> I added a patch in the next version of the series.
>
> Don't want to make the series too large, though :)
>

Understood, it's a bit of a mouthful already.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 24, 2014, 3:30:02 PM1/24/14
to
On Tue, Jan 21, 2014 at 05:20:04PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> In order to get a more consistent naming scheme, making it clear
> which fault statistics track memory locality, and which track
> CPU locality, rename the memory fault statistics.
>
> Suggested-by: Mel Gorman <mgo...@suse.de>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

Thanks.

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 24, 2014, 3:40:02 PM1/24/14
to
faults = group_faults_cpu(numa_group, nid)

?

> + if (faults > max_faults)
> + max_faults = faults;
> + }
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_cpu[task_faults_idx(nid, 0)] +
> + numa_group->faults_cpu[task_faults_idx(nid, 1)];

Same?
Ok, I guess this stops the old group making very different migration
decisions just because one task left the group. That has difficult to
predict consequences so assuming the new group_faults_cpu helper gets used

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 24, 2014, 3:40:02 PM1/24/14
to
/*
* Track the nodes the process was running on when a NUMA hinting fault
* was incurred ......
*/

?

Otherwise the comment is very similar to numa_faults_memory. I'm not
that bothered because the name is descriptive enough.


> + /*
> * numa_faults_locality tracks if faults recorded during the last
> * scan window were remote/local. The task scan period is adapted
> * based on the locality of the faults with different weights
> @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
> extern pid_t task_numa_group_id(struct task_struct *p);
> extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p);
> -
> -extern unsigned int sysctl_numa_balancing_migrate_deferred;
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)

Should this hunk move to patch 1?

Whether you make the changes or not

Acked-by: Mel Gorman <mgo...@suse.de>

In my last review I complained about magic numbers but I see a later
patch has a subject that at least implies it deals with the numbers.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 24, 2014, 3:50:01 PM1/24/14
to
On Tue, Jan 21, 2014 at 05:20:10PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> We track both the node of the memory after a NUMA fault, and the node
> of the CPU on which the fault happened. Rename the local variables in
> task_numa_fault to make things more explicit.
>
> Suggested-by: Mel Gorman <mgo...@suse.de>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,
Jan 24, 2014, 3:50:01 PM1/24/14
to
On Tue, Jan 21, 2014 at 06:58:26PM -0500, Rik van Riel wrote:
> On Tue, 21 Jan 2014 17:20:11 -0500
> ri...@redhat.com wrote:
>
> > From: Rik van Riel <ri...@redhat.com>
> >
> > Cleanup suggested by Mel Gorman. Now the code contains some more
> > hints on what statistics go where.
> >
> > Suggested-by: Mel Gorman <mgo...@suse.de>
> > Signed-off-by: Rik van Riel <ri...@redhat.com>
>
> ... and patch fatigue hit, causing me to mess up this simple
> patch with no functional changes (which is why I did not test
> it yet).
>
> Here is one that compiles.
>
> ---8<---
>
> Subject: numa,sched: define some magic numbers
>
> Cleanup suggested by Mel Gorman. Now the code contains some more
> hints on what statistics go where.
>
> Suggested-by: Mel Gorman <mgo...@suse.de>
> Signed-off-by: Rik van Riel <ri...@redhat.com>

Acked-by: Mel Gorman <mgo...@suse.de>

--
Mel Gorman
SUSE Labs
0 new messages