[PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---
include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 8 ++++++--
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..91726b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
__entry->dst_cpu, __entry->dst_nid)
);
+
+TRACE_EVENT(update_numa_active_nodes_mask,
+
+ TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults),
+
+ TP_ARGS(pid, gid, nid, set, faults, max_faults),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid)
+ __field( pid_t, gid)
+ __field( int, nid)
+ __field( int, set)
+ __field( long, faults)
+ __field( long, max_faults);
+ ),
+
+ TP_fast_assign(
+ __entry->pid = pid;
+ __entry->gid = gid;
+ __entry->nid = nid;
+ __entry->set = set;
+ __entry->faults = faults;
+ __entry->max_faults = max_faults;
+ ),
+
+ TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld",
+ __entry->pid,
+ __entry->gid,
+ __entry->nid,
+ __entry->set,
+ __entry->faults,
+ __entry->max_faults)
+
+);
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
numa_group->faults_from[task_faults_idx(nid, 1)];
if (!node_isset(nid, numa_group->active_nodes)) {
- if (faults > max_faults * 4 / 10)
+ if (faults > max_faults * 4 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
node_set(nid, numa_group->active_nodes);
- } else if (faults < max_faults * 2 / 10)
+ }
+ } else if (faults < max_faults * 2 / 10) {
+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
node_clear(nid, numa_group->active_nodes);
+ }
}
}

--
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops

My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---

include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;
unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;

-/*
- * After skipping a page migration on a shared page, skip N more numa page
- * migrations unconditionally. This reduces the number of NUMA migrations
- * in shared memory workloads, and has the effect of pulling tasks towards
- * where their memory lives, over pulling the memory towards the task.
- */
-unsigned int sysctl_numa_balancing_migrate_deferred = 16;
-
static unsigned int task_nr_scan_windows(struct task_struct *p)
{
unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 096db74..4d19492 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "numa_balancing_migrate_deferred",
- .data = &sysctl_numa_balancing_migrate_deferred,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
.procname = "numa_balancing",
.data = NULL, /* filled in by handler */
.maxlen = sizeof(unsigned int),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

-#ifdef CONFIG_NUMA_BALANCING
-static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- /* Never defer a private fault */
- if (cpupid_match_pid(p, last_cpupid))
- return false;
-
- if (p->numa_migrate_deferred) {
- p->numa_migrate_deferred--;
- return true;
- }
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
- p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
-}
-#else
-static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
-{
- return false;
-}
-
-static inline void defer_numa_migrate(struct task_struct *p)
-{
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
/**
* mpol_misplaced - check whether current page node is valid in policy
*
@@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
*/
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))
- goto out;
}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---

kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;
unsigned long *faults_from;
unsigned long faults[0];
@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +

+ numa_group->faults_from[task_faults_idx(nid, 1)];

+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {
+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +

+ numa_group->faults_from[task_faults_idx(nid, 1)];

+ if (!node_isset(nid, numa_group->active_nodes)) {

+ if (faults > max_faults * 4 / 10)

+ node_set(nid, numa_group->active_nodes);

+ } else if (faults < max_faults * 2 / 10)

+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.
@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
/* Second half of the array tracks where faults come from */
grp->faults_from = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);
+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---

include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;

unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */

+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;

struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..1601c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,
memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

+/*
+ * Get the fraction of time the task has been running since the last
+ * NUMA placement cycle. The scheduler keeps similar statistics, but
+ * decays those on a 32ms period, which is orders of magnitude off
+ * from the dozens-of-seconds NUMA balancing period. Use the scheduler
+ * stats only if the task is so new there are no NUMA statistics yet.
+ */
+static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
+{
+ u64 runtime, delta, now;
+ /* Use the start of this time slice to avoid calculations. */
+ now = p->se.exec_start;
+ runtime = p->se.sum_exec_runtime;
+
+ if (p->last_task_numa_placement) {
+ delta = runtime - p->last_sum_exec_runtime;
+ *period = now - p->last_task_numa_placement;
+ } else {
+ delta = p->se.avg.runnable_avg_sum;
+ *period = p->se.avg.runnable_avg_period;
+ }
+
+ p->last_sum_exec_runtime = runtime;
+ p->last_task_numa_placement = now;
+
+ return delta;
+}
+

static void task_numa_placement(struct task_struct *p)

{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;
@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
@@ -1459,8 +1493,19 @@ static void task_numa_placement(struct task_struct *p)
fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. This prevents the situation where
+ * the garbage collector totally dominates the stats,
+ * and the access patterns of the worker threads are
+ * ignored.
+ */
+ f_weight = (1024 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period);
p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;
p->numa_faults_from_buffer[i] = 0;

faults += p->numa_faults[i];
--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---

include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights
@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)

int priv, i;

for (priv = 0; priv < 2; priv++) {

- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];

+ f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p)

fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ p->numa_faults_from[i] >>= 1;
+ p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from_buffer[i] = 0;
+

faults += p->numa_faults[i];

diff += p->numa_faults[i];
+ f_diff += p->numa_faults_from[i];
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
p->numa_group->faults[i] += diff;
+ p->numa_group->faults_from[i] += f_diff;
p->numa_group->total_faults += diff;
group_faults += p->numa_group->faults[i];
}
@@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(unsigned long);
+ 4*nr_node_ids*sizeof(unsigned long);

grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
grp->gid = p->pid;
+ /* Second half of the array tracks where faults come from */
+ grp->faults_from = grp->faults + 2 * nr_node_ids;

- for (i = 0; i < 2*nr_node_ids; i++)

+ for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

grp->total_faults = p->total_numa_faults;
@@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

double_lock(&my_grp->lock, &grp->lock);

- for (i = 0; i < 2*nr_node_ids; i++) {
+ for (i = 0; i < 4*nr_node_ids; i++) {
my_grp->faults[i] -= p->numa_faults[i];
grp->faults[i] += p->numa_faults[i];
}
@@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p)

if (grp) {
spin_lock(&grp->lock);
- for (i = 0; i < 2*nr_node_ids; i++)

+ for (i = 0; i < 4*nr_node_ids; i++)

grp->faults[i] -= p->numa_faults[i];
grp->total_faults -= p->total_numa_faults;

@@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p)

p->numa_faults = NULL;
p->numa_faults_buffer = NULL;

+ p->numa_faults_from = NULL;
+ p->numa_faults_from_buffer = NULL;
kfree(numa_faults);
}

@@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
bool migrated = flags & TNF_MIGRATED;
+ int this_node = task_node(current);
int priv;

if (!numabalancing_enabled)
@@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 4 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
return;

BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_from = p->numa_faults + (2 * nr_node_ids);
+ p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids);
+ p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids);
p->total_numa_faults = 0;

memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

@@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
p->numa_pages_migrated += pages;

p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+ p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages;
p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
}

--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 6:30:02 AM1/17/14

to

From: Rik van Riel <ri...@surriel.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
Signed-off-by: Rik van Riel <ri...@surriel.com>
---

include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);

extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);

+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);

#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)

@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
}

+bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ struct numa_group *ng = p->numa_group;
+
+ /* Always allow migrate on private faults */
+ if (cpupid_match_pid(p, last_cpupid))
+ return true;
+
+ /* A shared fault, but p->numa_group has not been set up yet. */
+ if (!ng)
+ return true;
+
+ /*
+ * Do not migrate if the destination is not a node that
+ * is actively used by this numa group.
+ */
+ if (!node_isset(dst_nid, ng->active_nodes))
+ return false;
+
+ /*
+ * Source is a node that is not actively used by this
+ * numa group, while the destination is. Migrate.
+ */
+ if (!node_isset(src_nid, ng->active_nodes))
+ return true;
+
+ /*
+ * Both source and destination are nodes in active
+ * use by this numa group. Maximize memory bandwidth
+ * by migrating from more heavily used groups, to less
+ * heavily used ones, spreading the load around.
+ * Use a 1/4 hysteresis to avoid spurious page movement.
+ */
+ return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {

goto out;
}
+
+ if (!should_numa_migrate(current, last_cpupid, curnid, polnid))
+ goto out;

}

if (curnid != polnid)
--
1.8.4.2

Peter Zijlstra

unread,

Jan 17, 2014, 10:00:02 AM1/17/14

to

On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
> + /*
> + * Normalize the faults_from, so all tasks in a group
> + * count according to CPU use, instead of by the raw
> + * number of faults. This prevents the situation where
> + * the garbage collector totally dominates the stats,
> + * and the access patterns of the worker threads are
> + * ignored.
> + */

Instead of focusing on the one example (GC) here, I would suggest
saying something along the lines of: Tasks with little runtime have
little over-all impact on throughput and thus their faults are less
important.

Rik van Riel

unread,

Jan 17, 2014, 3:00:03 PM1/17/14

to

On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
>> + /*
>> + * Normalize the faults_from, so all tasks in a group
>> + * count according to CPU use, instead of by the raw
>> + * number of faults. This prevents the situation where
>> + * the garbage collector totally dominates the stats,
>> + * and the access patterns of the worker threads are
>> + * ignored.
>> + */
>
> Instead of focusing on the one example (GC) here, I would suggest
> saying something along the lines of: Tasks with little runtime have
> little over-all impact on throughput and thus their faults are less
> important.

Thanks for the review, I have updated the comment
accordingly.

Does anybody else have comments on this series, or should
I start begging for Acked-by: and Reviewed-by: lines? :)

--
All rights reversed

Peter Zijlstra

unread,

Jan 17, 2014, 3:10:03 PM1/17/14

to

On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
> > On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
> >> + /*
> >> + * Normalize the faults_from, so all tasks in a group
> >> + * count according to CPU use, instead of by the raw
> >> + * number of faults. This prevents the situation where
> >> + * the garbage collector totally dominates the stats,
> >> + * and the access patterns of the worker threads are
> >> + * ignored.
> >> + */
> >
> > Instead of focusing on the one example (GC) here, I would suggest
> > saying something along the lines of: Tasks with little runtime have
> > little over-all impact on throughput and thus their faults are less
> > important.
>
> Thanks for the review, I have updated the comment
> accordingly.
>
> Does anybody else have comments on this series, or should
> I start begging for Acked-by: and Reviewed-by: lines? :)

I still need to get me head around part of the previous patches, and I
suppose Mel might have a look, give it a few days ;-)

Rik van Riel

unread,

Jan 17, 2014, 6:50:02 PM1/17/14

to

On 01/17/2014 10:04 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 09:51:44AM -0500, Rik van Riel wrote:
>> On 01/17/2014 04:54 AM, Peter Zijlstra wrote:
>>> On Fri, Jan 17, 2014 at 01:17:36AM -0500, ri...@redhat.com wrote:
>>>> + /*
>>>> + * Normalize the faults_from, so all tasks in a group
>>>> + * count according to CPU use, instead of by the raw
>>>> + * number of faults. This prevents the situation where
>>>> + * the garbage collector totally dominates the stats,
>>>> + * and the access patterns of the worker threads are
>>>> + * ignored.
>>>> + */
>>>
>>> Instead of focusing on the one example (GC) here, I would suggest
>>> saying something along the lines of: Tasks with little runtime have
>>> little over-all impact on throughput and thus their faults are less
>>> important.
>>
>> Thanks for the review, I have updated the comment
>> accordingly.
>>
>> Does anybody else have comments on this series, or should
>> I start begging for Acked-by: and Reviewed-by: lines? :)
>
> I still need to get me head around part of the previous patches, and I
> suppose Mel might have a look, give it a few days ;-)

Well, you didn't spot the divide by zero that Vinod somehow
managed to trigger :)

I am also going to add another patch, that prevents the temporary
values from "p->numa_faults[i] >>= 1;" becoming visible, by using
local variables for those calculations.

Expect a v2 soonish.

--
All rights reversed

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:01 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa680e2..3551009 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)

faults = numa_group->faults_from[task_faults_idx(nid, 0)] +

numa_group->faults_from[task_faults_idx(nid, 1)];
if (!node_isset(nid, numa_group->active_nodes)) {

- if (faults > max_faults * 4 / 10)

+ if (faults > max_faults * 4 / 10) {

+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
node_set(nid, numa_group->active_nodes);

- } else if (faults < max_faults * 2 / 10)

+ }
+ } else if (faults < max_faults * 2 / 10) {

+ trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults);
node_clear(nid, numa_group->active_nodes);
+ }
}
}

--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:02 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.

The next patches use this to build up a bitmap of which nodes a
workload is actively running on.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

include/linux/sched.h | 10 ++++++++--
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97efba4..a9f7f05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -1492,6 +1492,14 @@ struct task_struct {
unsigned long *numa_faults_buffer;

/*
+ * Track the nodes where faults are incurred. This is not very
+ * interesting on a per-task basis, but it help with smarter
+ * numa memory placement for groups of processes.
+ */
+ unsigned long *numa_faults_from;
+ unsigned long *numa_faults_from_buffer;
+
+ /*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local. The task scan period is adapted
* based on the locality of the faults with different weights

@@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);

extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);

-
-extern unsigned int sysctl_numa_balancing_migrate_deferred;

#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41e2176..1945ddc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -886,6 +886,7 @@ struct numa_group {

struct rcu_head rcu;
unsigned long total_faults;
+ unsigned long *faults_from;
unsigned long faults[0];
};

@@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff;
+ long diff, f_diff;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];
+ f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */

p->numa_faults[i] >>= 1;

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:02 PM1/17/14

to

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v1:
- fix divide by zero found by Chegu Vinod
- improve comment, as suggested by Peter Zijlstra
- do stats calculations in task_numa_placement in local variables

Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops

My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:03 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);

extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);

+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);

#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)

@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3551009..8e0a53a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:03 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1945ddc..aa680e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -885,6 +885,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ nodemask_t active_nodes;
unsigned long total_faults;

unsigned long *faults_from;
unsigned long faults[0];

@@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p)
}

/*
+ * Iterate over the nodes from which NUMA hinting faults were triggered, in
+ * other words where the CPUs that incurred NUMA hinting faults are. The
+ * bitmask is used to limit NUMA page migrations, and spread out memory
+ * between the actively used nodes. To prevent flip-flopping, and excessive
+ * page migrations, nodes are added when they cause over 40% of the maximum
+ * number of faults, but only removed when they drop below 20%.
+ */
+static void update_numa_active_node_mask(struct task_struct *p)
+{
+ unsigned long faults, max_faults = 0;
+ struct numa_group *numa_group = p->numa_group;
+ int nid;
+
+ for_each_online_node(nid) {

+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +

+ numa_group->faults_from[task_faults_idx(nid, 1)];

+ if (faults > max_faults)
+ max_faults = faults;
+ }
+
+ for_each_online_node(nid) {

+ faults = numa_group->faults_from[task_faults_idx(nid, 0)] +

+ numa_group->faults_from[task_faults_idx(nid, 1)];

+ if (!node_isset(nid, numa_group->active_nodes)) {

+ if (faults > max_faults * 4 / 10)

+ node_set(nid, numa_group->active_nodes);

+ } else if (faults < max_faults * 2 / 10)

+ node_clear(nid, numa_group->active_nodes);
+ }
+}
+
+/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
* period will be for the next scan window. If local/remote ratio is below
@@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p)
update_task_scan_period(p, fault_types[0], fault_types[1]);

if (p->numa_group) {
+ update_numa_active_node_mask(p);
/*
* If the preferred task and group nids are different,
* iterate over the nodes again to find the best place.

@@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,

/* Second half of the array tracks where faults come from */

grp->faults_from = grp->faults + 2 * nr_node_ids;

+ node_set(task_node(current), grp->active_nodes);

+
for (i = 0; i < 4*nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];

@@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
my_grp->nr_tasks--;
grp->nr_tasks++;

+ update_numa_active_node_mask(p);
+
spin_unlock(&my_grp->lock);
spin_unlock(&grp->lock);

--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 9:20:03 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

include/linux/sched.h | 1 -
kernel/sched/fair.c | 8 --------
kernel/sysctl.c | 7 -------
mm/mempolicy.c | 45 ---------------------------------------------
4 files changed, 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68a0e84..97efba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -1469,7 +1469,6 @@ struct task_struct {
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
- int numa_migrate_deferred;

unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */

struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 867b0a4..41e2176 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cb46c..052abac 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c

if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {

-
- /* See sysctl_numa_balancing_migrate_deferred comment */
- if (!cpupid_match_pid(current, last_cpupid))
- defer_numa_migrate(current);
-
goto out;
}
-
- /*
- * The quadratic filter above reduces extraneous migration
- * of shared pages somewhat. This code reduces it even more,
- * reducing the overhead of page migrations of shared pages.
- * This makes workloads with shared pages rely more on
- * "move task near its memory", and less on "move memory
- * towards its task", which is exactly what we want.
- */
- if (numa_migrate_deferred(current, last_cpupid))

- goto out;

}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 9:30:02 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++

kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0af6c1a..52de567 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -1471,6 +1471,8 @@ struct task_struct {
int numa_preferred_nid;

unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */

+ u64 last_task_numa_placement;
+ u64 last_sum_exec_runtime;
struct callback_head numa_work;

struct list_head numa_entry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f45fd5..9a0908a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_work.next = &p->numa_work;

p->numa_faults = NULL;
p->numa_faults_buffer = NULL;

+ p->last_task_numa_placement = 0;
+ p->last_sum_exec_runtime = 0;

INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0a53a..0d395a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p,

memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
}

static void task_numa_placement(struct task_struct *p)

{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;
+ runtime = numa_get_avg_runtime(p, &period);
+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;

@@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p)

int priv, i;

for (priv = 0; priv < 2; priv++) {

- long diff, f_diff;
+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);
diff = -p->numa_faults[i];

@@ -1459,8 +1493,18 @@ static void task_numa_placement(struct task_struct *p)

fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw

+ * number of faults. Tasks with little runtime have
+ * little over-all impact on throughput, and thus their
+ * faults are less important.

+ */
+ f_weight = (1024 * runtime *
+ p->numa_faults_from_buffer[i]) /

+ (total_faults * period + 1);

p->numa_faults_from[i] >>= 1;

- p->numa_faults_from[i] += p->numa_faults_from_buffer[i];
+ p->numa_faults_from[i] += f_weight;

p->numa_faults_from_buffer[i] = 0;

faults += p->numa_faults[i];

--
1.8.4.2

ri...@redhat.com

unread,

Jan 17, 2014, 9:30:02 PM1/17/14

to

From: Rik van Riel <ri...@redhat.com>

The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.

The NUMA balancing code looks at those per-process variables, and
having other tasks temporarily see halved statistics could lead to
unwanted numa migrations. This can be avoided by doing all the math
in local variables.

This change also simplifies the code a little.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

kernel/sched/fair.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d395a0..0f48382 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1484,12 +1484,9 @@ static void task_numa_placement(struct task_struct *p)

long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);

- diff = -p->numa_faults[i];
- f_diff = -p->numa_faults_from[i];

/* Decay existing window, copy faults since last scan */

- p->numa_faults[i] >>= 1;
- p->numa_faults[i] += p->numa_faults_buffer[i];
+ diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2;

fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

@@ -1503,13 +1500,12 @@ static void task_numa_placement(struct task_struct *p)

f_weight = (1024 * runtime *

p->numa_faults_from_buffer[i]) /

(total_faults * period + 1);

- p->numa_faults_from[i] >>= 1;
- p->numa_faults_from[i] += f_weight;
+ f_diff = f_weight - p->numa_faults_from[i] / 2;

p->numa_faults_from_buffer[i] = 0;

+ p->numa_faults[i] += diff;
+ p->numa_faults_from[i] += f_diff;

faults += p->numa_faults[i];

- diff += p->numa_faults[i];
- f_diff += p->numa_faults_from[i];

p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */

--
1.8.4.2

Rik van Riel

unread,

Jan 18, 2014, 3:40:01 AM1/18/14

to

On 01/17/2014 04:12 PM, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> The current code in task_numa_placement calculates the difference
> between the old and the new value, but also temporarily stores half
> of the old value in the per-process variables.
>
> The NUMA balancing code looks at those per-process variables, and
> having other tasks temporarily see halved statistics could lead to
> unwanted numa migrations. This can be avoided by doing all the math
> in local variables.
>
> This change also simplifies the code a little.

I am seeing what looks like a performance improvement
with this patch, so it is not just a theoretical bug.
The improvement is small, as is to be expected with
such a small race, but with two 32-warehouse specjbb
instances on a 4-node, 10core/20thread per node system,
I see the following change in performance, and reduced
numa page migrations.

Without the patch:
run 1: throughput 367660 367660, migrated 3112982
run 2: throughput 353821 355612, migrated 2881317
run 3: throughput 355027 355027, migrated 3358105
run 4: throughput 354366 354366, migrated 3466687
run 5: throughput 356186 356186, migrated 3152194
run 6: throughput 361431 361431, migrated 3336219
run 7: throughput 354704 354704, migrated 3345418
run 8: throughput 363770 363770, migrated 3642925
run 9: throughput 363380 363380, migrated 3192836
run 10: throughput 358440 358440, migrated 3354028
avg: througphut 358968, migrated 3284271

With the patch:
run 1: throughput 360580 360580, migrated 3169872
run 2: throughput 361303 361303, migrated 3220280
run 3: throughput 367692 367692, migrated 3096093
run 4: throughput 362320 362320, migrated 2981762
run 5: throughput 364201 364201, migrated 3089107
run 6: throughput 364561 364561, migrated 2892364
run 7: throughput 360771 360771, migrated 3086638
run 8: throughput 361530 361530, migrated 2933256
run 9: throughput 365841 365841, migrated 3356944
run 10: throughput 359188 359188, migrated 3394545
avg: througphut 362798, migrated 3122086

--
All rights reversed

Chegu Vinod

unread,

Jan 18, 2014, 10:10:02 PM1/18/14

to

> .
>

Acked-by: Chegu Vinod <chegu...@hp.com>

----

Here are some results using the v2 version of the patches
on an 8 socket box using SPECjbb2005 as a workload :

I) Eight 1-socket wide instances(10 warehouse threads each) :

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 270620.04 273675.10
spec2.txt: throughput = 274115.33 272845.17
spec3.txt: throughput = 277830.09 272057.33
spec4.txt: throughput = 270898.52 270670.54
spec5.txt: throughput = 270397.30 270906.82
spec6.txt: throughput = 270451.93 268217.55
spec7.txt: throughput = 269511.07 269354.46
spec8.txt: throughput = 269386.06 270540.00

b)Automatic NUMA balancing results
spec1.txt: throughput = 244333.41 248072.72
spec2.txt: throughput = 252166.99 251818.30
spec3.txt: throughput = 251365.58 258266.24
spec4.txt: throughput = 245247.91 256873.51
spec5.txt: throughput = 245579.68 247743.18
spec6.txt: throughput = 249767.38 256285.86
spec7.txt: throughput = 244570.64 255343.99
spec8.txt: throughput = 245703.60 254434.36

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 132959.73 136957.12
spec2.txt: throughput = 127937.11 129326.23
spec3.txt: throughput = 130697.10 125772.11
spec4.txt: throughput = 134978.49 141607.58
spec5.txt: throughput = 127574.34 126748.18
spec6.txt: throughput = 138699.99 128597.95
spec7.txt: throughput = 133247.25 137344.57
spec8.txt: throughput = 124548.00 139040.98

------

II) Four 2-socket wide instances(20 warehouse threads each) :

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 479931.16 472467.58
spec2.txt: throughput = 466652.15 466237.10
spec3.txt: throughput = 473591.51 466891.98
spec4.txt: throughput = 462346.62 466891.98

b)Automatic NUMA balancing results
spec1.txt: throughput = 383758.29 437489.99
spec2.txt: throughput = 370926.06 435692.97
spec3.txt: throughput = 368872.72 444615.08
spec4.txt: throughput = 404422.82 435236.20

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 252752.12 231762.30
spec2.txt: throughput = 255391.51 253250.95
spec3.txt: throughput = 264764.00 263721.03
spec4.txt: throughput = 254833.39 242892.72

------

III) Two 4-socket wide instances(40 warehouse threads each)

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 771340.84 769039.53
spec2.txt: throughput = 762184.48 760745.65

b)Automatic NUMA balancing results
spec1.txt: throughput = 667182.98 720197.01
spec2.txt: throughput = 692564.11 739872.51

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 457079.28 467199.30
spec2.txt: throughput = 479790.47 456279.07

-----

IV) One 8-socket wide instance(80 warehouse threads)

Without
patches With patches
-------------------- ----------------
a) numactl pinning results
spec1.txt: throughput = 982113.03 985836.96

b)Automatic NUMA balancing results
spec1.txt: throughput = 755615.94 843632.09

c)NO Automatic NUMA balancing and NO-pinning results
spec1.txt: throughput = 671583.26 661768.54

Peter Zijlstra

unread,

Jan 20, 2014, 4:40:02 PM1/20/14

to

Why not use 6/16 and 3/16 resp.? That avoids an actual division.

Peter Zijlstra

unread,

Jan 20, 2014, 5:00:01 PM1/20/14

to

On Fri, Jan 17, 2014 at 04:12:08PM -0500, ri...@redhat.com wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0af6c1a..52de567 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1471,6 +1471,8 @@ struct task_struct {
> int numa_preferred_nid;
> unsigned long numa_migrate_retry;
> u64 node_stamp; /* migration stamp */
> + u64 last_task_numa_placement;
> + u64 last_sum_exec_runtime;
> struct callback_head numa_work;
>
> struct list_head numa_entry;

Have you tried what happens if you use p->se.avg.runnable_avg_sum /
p->se.avg.runnable_avg_period instead? If that also works it avoids
growing the datastructures and keeping of yet another set of runtime
stats.

Peter Zijlstra

unread,

Jan 20, 2014, 5:00:02 PM1/20/14

to

On Fri, Jan 17, 2014 at 04:12:05PM -0500, ri...@redhat.com wrote:
> /*
> + * Iterate over the nodes from which NUMA hinting faults were triggered, in
> + * other words where the CPUs that incurred NUMA hinting faults are. The
> + * bitmask is used to limit NUMA page migrations, and spread out memory
> + * between the actively used nodes. To prevent flip-flopping, and excessive
> + * page migrations, nodes are added when they cause over 40% of the maximum
> + * number of faults, but only removed when they drop below 20%.
> + */

Maybe break the above into two paragraphs for added readability.

Also, I think this might be a good spot to explain why you need the
second fault metric -- that is, why can't we create the interleave mask
from the existing memory location faults.

> +static void update_numa_active_node_mask(struct task_struct *p)
> +{
> + unsigned long faults, max_faults = 0;
> + struct numa_group *numa_group = p->numa_group;
> + int nid;
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];
> + if (faults > max_faults)
> + max_faults = faults;
> + }
> +
> + for_each_online_node(nid) {
> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> + numa_group->faults_from[task_faults_idx(nid, 1)];
> + if (!node_isset(nid, numa_group->active_nodes)) {
> + if (faults > max_faults * 4 / 10)
> + node_set(nid, numa_group->active_nodes);
> + } else if (faults < max_faults * 2 / 10)
> + node_clear(nid, numa_group->active_nodes);
> + }
> +}

Peter Zijlstra

unread,

Jan 20, 2014, 5:00:02 PM1/20/14

to

On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:
> From: Rik van Riel <ri...@redhat.com>
>
> Being able to see how the active nodemask changes over time, and why,
> can be quite useful.
>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Mel Gorman <mgo...@suse.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Chegu Vinod <chegu...@hp.com>
> Signed-off-by: Rik van Riel <ri...@redhat.com>
> ---
> include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 8 ++++++--
> 2 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 67e1bbf..91726b6 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
> __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> __entry->dst_cpu, __entry->dst_nid)
> );
> +
> +TRACE_EVENT(update_numa_active_nodes_mask,

Please stick to the sched_ naming for these things.

Ideally we'd rename the sysctls too :/

> +++ b/kernel/sched/fair.c
> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
> faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> numa_group->faults_from[task_faults_idx(nid, 1)];
> if (!node_isset(nid, numa_group->active_nodes)) {
> - if (faults > max_faults * 4 / 10)
> + if (faults > max_faults * 4 / 10) {
> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);

While I think the tracepoint hookery is smart enough to avoid evaluating
arguments when they're disabled, it might be best to simply pass:
current and numa_group and do the dereference in fast_assign().

That said, this is the first and only numa tracepoint, I'm not sure why
this qualifies and other metrics do not.

Rik van Riel

unread,

Jan 20, 2014, 7:00:02 PM1/20/14

to

On 01/20/2014 11:52 AM, Peter Zijlstra wrote:
> On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:

>> +++ b/kernel/sched/fair.c
>> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
>> faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
>> numa_group->faults_from[task_faults_idx(nid, 1)];
>> if (!node_isset(nid, numa_group->active_nodes)) {
>> - if (faults > max_faults * 4 / 10)
>> + if (faults > max_faults * 4 / 10) {
>> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
>
> While I think the tracepoint hookery is smart enough to avoid evaluating
> arguments when they're disabled, it might be best to simply pass:
> current and numa_group and do the dereference in fast_assign().
>
> That said, this is the first and only numa tracepoint, I'm not sure why
> this qualifies and other metrics do not.

It's there because I needed it in development.

If you think it is not merge material, I would be comfortable
leaving it out.

--
All rights reversed

Steven Rostedt

unread,

Jan 20, 2014, 7:10:01 PM1/20/14

to

On Mon, 20 Jan 2014 17:52:05 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> On Fri, Jan 17, 2014 at 04:12:06PM -0500, ri...@redhat.com wrote:
> > From: Rik van Riel <ri...@redhat.com>
> >

> > +++ b/kernel/sched/fair.c
> > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p)
> > faults = numa_group->faults_from[task_faults_idx(nid, 0)] +
> > numa_group->faults_from[task_faults_idx(nid, 1)];
> > if (!node_isset(nid, numa_group->active_nodes)) {
> > - if (faults > max_faults * 4 / 10)
> > + if (faults > max_faults * 4 / 10) {
> > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults);
>
> While I think the tracepoint hookery is smart enough to avoid evaluating
> arguments when they're disabled, it might be best to simply pass:
> current and numa_group and do the dereference in fast_assign().

It's really up to gcc to optimize it. But that said, it is more
efficient to just past the pointer and do the dereferencing in the
fast_assign(). At least it keeps any bad optimization in gcc from
infecting the tracepoint caller.

It also makes it easier to get other information if you want to later
extend that tracepoint.

Does this tracepoint always use current? If so, why bother passing it
in?

-- Steve

Rik van Riel

unread,

Jan 20, 2014, 7:10:01 PM1/20/14

to

That is what I started out with, and the results were not
as stable as with this calculation.

Having said that, I did that before I came up with patch 7/7,
so maybe the effect would no longer be as pronounced any more
as it was before...

I can send in a simplified version, if you prefer.

--
All rights reversed

Peter Zijlstra

unread,

Jan 20, 2014, 7:20:01 PM1/20/14

to

On Mon, Jan 20, 2014 at 02:02:32PM -0500, Rik van Riel wrote:
> That is what I started out with, and the results were not
> as stable as with this calculation.
>
> Having said that, I did that before I came up with patch 7/7,
> so maybe the effect would no longer be as pronounced any more
> as it was before...
>
> I can send in a simplified version, if you prefer.

If you could retry with 7/7, I don't mind adding the extra stats too
much, but it would be nice if we can avoid it.

ri...@redhat.com

unread,

Jan 20, 2014, 7:30:02 PM1/20/14

to

The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being available to the
workload.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch series identifies the NUMA nodes on which the workload
is actively running, and balances (somewhat lazily) the memory
between those nodes, satisfying the criteria above.

As usual, the series has had some performance testing, but it
could always benefit from more testing, on other systems.

Changes since v2:
- dropped tracepoint (for now?)
- implement obvious improvements suggested by Peter
- use the scheduler maintained CPU use statistics, drop
the NUMA specific ones for now. We can add those later
if they turn out to be beneficial

Changes since v1:
- fix divide by zero found by Chegu Vinod
- improve comment, as suggested by Peter Zijlstra
- do stats calculations in task_numa_placement in local variables

Some performance numbers, with two 40-warehouse specjbb instances
on an 8 node system with 10 CPU cores per node, using a pre-cleanup
version of these patches, courtesy of Chegu Vinod:

numactl manual pinning
spec1.txt: throughput = 755900.20 SPECjbb2005 bops
spec2.txt: throughput = 754914.40 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, with patches)
spec1.txt: throughput = 706439.84 SPECjbb2005 bops
spec2.txt: throughput = 729347.75 SPECjbb2005 bops

NO-pinning results (Automatic NUMA balancing, without patches)
spec1.txt: throughput = 667988.47 SPECjbb2005 bops
spec2.txt: throughput = 638220.45 SPECjbb2005 bops

No Automatic NUMA and NO-pinning results
spec1.txt: throughput = 544120.97 SPECjbb2005 bops
spec2.txt: throughput = 453553.41 SPECjbb2005 bops

My own performance numbers are not as relevant, since I have been
running with a more hostile workload on purpose, and I have run
into a scheduler issue that caused the workload to run on only
two of the four NUMA nodes on my test system...

ri...@redhat.com

unread,

Jan 20, 2014, 7:30:02 PM1/20/14

to

From: Rik van Riel <ri...@redhat.com>

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration of pages
3) distribute shared memory across the active nodes, to
maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:
1) always migrate on a private fault
2) never migrate to a node that is not in the set of active nodes
for the numa_group
3) always migrate from a node outside of the set of active nodes,
to a node that is in that set
4) within the set of active nodes in the numa_group, only migrate
from a node with more NUMA page faults, to a node with fewer
NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

include/linux/sched.h | 7 +++++++
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
mm/mempolicy.c | 3 +++
3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9f7f05..0af6c1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled)
static inline void task_numa_free(struct task_struct *p)
{
}
+static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid,
+ int src_nid, int dst_nid)
+{
+ return true;
+}
#endif

static inline struct pid *task_pid(struct task_struct *task)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea8b2ae..ea873b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 052abac..050962b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {

goto out;
}
+
+ if (!should_numa_migrate(current, last_cpupid, curnid, polnid))

+ goto out;

}

if (curnid != polnid)
--
1.8.4.2

ri...@redhat.com

unread,

Jan 20, 2014, 7:30:02 PM1/20/14

to

From: Rik van Riel <ri...@redhat.com>

The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.

Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the node with
the garbage collector being marked the only active node in
the group.

This issue is avoided if we weigh the statistics by CPU use
of each task in the numa group, instead of by how many faults
each thread has occurred.

To achieve this, we normalize the number of faults to the
fraction of faults that occurred on each node, and then
multiply that fraction by the fraction of CPU time the
task has used since the last time task_numa_placement was
invoked.

This way the nodes in the active node mask will be the ones
where the tasks from the numa group are most actively running,
and the influence of eg. the garbage collector and other
do-little threads is properly minimized.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Mel Gorman <mgo...@suse.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Chegu Vinod <chegu...@hp.com>
Signed-off-by: Rik van Riel <ri...@redhat.com>
---

kernel/sched/fair.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea873b6..203877d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p)

int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
unsigned long fault_types[2] = { 0, 0 };
+ unsigned long total_faults;
+ u64 runtime, period;
spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);

@@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p)

p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

+ total_faults = p->numa_faults_locality[0] +
+ p->numa_faults_locality[1] + 1;

+ runtime = p->se.avg.runnable_avg_sum;
+ period = p->se.avg.runnable_avg_period;

+
/* If the task is part of a group prevent parallel updates to group stats */
if (p->numa_group) {
group_lock = &p->numa_group->lock;

@@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p)

int priv, i;

for (priv = 0; priv < 2; priv++) {
- long diff, f_diff;

+ long diff, f_diff, f_weight;

i = task_faults_idx(nid, priv);

diff = -p->numa_faults[i];

@@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p)

fault_types[priv] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;

+ /*
+ * Normalize the faults_from, so all tasks in a group
+ * count according to CPU use, instead of by the raw
+ * number of faults. Tasks with little runtime have
+ * little over-all impact on throughput, and thus their
+ * faults are less important.
+ */

+ f_weight = (16384 * runtime *
+ p->numa_faults_from_buffer[i]) /
+ (total_faults * period + 1);