[PATCH 0/3] NUMA based LMB remove

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 7, 2020, 3:44:11 AM12/7/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

This series introduced change in the way the LMBs to be remove are choosen.

Currently this is done in the kernel, but without taking in account the
NUMA topology. Thus a node running a lot of CPUs may be starved of memory
while a node with no CPU will be let with plenty of memory.

The overall idea is to delegate to the drmgr user space process the choice
of the LMB and to rely on the remove by index kernel APU to remove LMB per
LMB. There is no need to do that in the kernel since all the information
are available in user space, and coding in user space eases the
maintenance.

Removing the LMB one by one using the remove by DRC index kernel API is not
impacting so much the performance, and this allows the new drmgr to run on
top of any kernel. I did test, removing 1TB on a 16 nodes PowerPC system
configured with 704 CPUs. The test has been done on top of the 5.9 and the
RHEL 8.3 kernel (10 times each).

Time to remove 1TB on top of the kernel 5.9:
base drmgr new drmgr
mean 00:04:32 00:04:42 (+3.67%)
dev 00:00:07 00:00:08
max 00:05:03 00:04:56
min 00:04:21 00:04:18

Time to remove 1TB on top of the RHEL 8.3 kernel
base drmgr new drmgr
mean 00:25:56 00:26:16 (+1.28%)
dev 00:01:54 00:01:56
max 00:33:28 00:29:54
min 00:23:39 00:22:39

Regarding the LMB removal spreading, here is the same results PowerPC
system, running a 5.9 kernel, and removing 1TB (memory is in MB):

base drmgr new drmgr
memory (MB) memory (MB)
Node CPUs before after before after
0 56 477122 376514 -100608 476610 346562 -130048
1 64 545717 439733 -105984 545717 545717 0
2 64 548257 447393 -100864 548257 548257 0
3 64 548080 444144 -103936 548080 548080 0
4 56 477622 382646 -94976 477878 347318 -130560
5 64 545473 431057 -114416 547777 543681 -4096
6 56 478343 380039 -98304 477831 347527 -130304
7 56 478182 382438 -95744 477670 347366 -130304
8 24 205150 166750 -38400 204894 148830 -56064
9 32 273624 218840 -54784 273624 198616 -75008
10 32 273624 221400 -52224 273624 198616 -75008
11 32 273591 226999 -46592 273591 198583 -75008
12 24 205037 163309 -41728 204781 148973 -55808
13 32 273305 273305 0 273817 198553 -75264
14 24 204654 204654 0 204142 148590 -55552
15 24 204734 204734 0 204222 148670 -55552

Nodes with higher number of CPUs are less impacted when running the new
drmgr.

In the case the NUMA topology can't be read, the previous way of removing
the memory using the remove by count kernel API is used. But this is
unlikely to happen.

The adding memory algorithm is not impacted since the choice of the node
where the memory is added is up to the hypervisor.

Laurent Dufour (3):
drmgr: don't open sysfs file for each command
drmgr: read the CPU NUMA topology
drmgr: introduce NUMA based LMB removal

Makefile.am | 5 +-
configure.ac | 4 +
src/drmgr/common.c | 22 ++-
src/drmgr/common_numa.c | 268 +++++++++++++++++++++++++++++
src/drmgr/common_numa.h | 83 +++++++++
src/drmgr/dr.h | 3 +-
src/drmgr/drslot_chrp_mem.c | 335 +++++++++++++++++++++++++++++++++++-
src/drmgr/ofdt.h | 2 +
8 files changed, 710 insertions(+), 12 deletions(-)
create mode 100644 src/drmgr/common_numa.c
create mode 100644 src/drmgr/common_numa.h

--
2.29.2

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 7, 2020, 3:44:11 AM12/7/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

When the NUMA topology can be read, all the LMBs found in the Device Tree
are linked the corresponding node. LMB not associated to node are
considered as not used.

LMB associated to CPU less node are accounted separately because they will
be targeted first to be remove. The LMB are removed from the CPU less nodes
to reach an average number LMBs per CPU less node.

Node with CPU have a ration indexed on their number of CPUs. The higher a
node have CPU the lower number LMB will be removed. This way node with a
high number of CPU will get a higher amount of memory.

When a LMB can't be removed (because its memory can't be offlined by the
kernel), the LMB count for node is decremented and the LMB is removed from
the node's LMB list. This way, it is no more accounted as 'active' and the
removal operation will continue without taking it in account anymore.

The removal is done through the remove by DRC index API, allowing to remove
a LMB at a time. One futur optimization would be to extend that API to
remove a linear range of LMB each time.

If the NUMA topology can't be read, we fallback using the legacy remove
way.

Signed-off-by: Laurent Dufour <ldu...@linux.ibm.com>
---

src/drmgr/drslot_chrp_mem.c | 335 +++++++++++++++++++++++++++++++++++-
src/drmgr/ofdt.h | 2 +

2 files changed, 336 insertions(+), 1 deletion(-)

diff --git a/src/drmgr/drslot_chrp_mem.c b/src/drmgr/drslot_chrp_mem.c
index 502aa3e9fff0..47d9f7b8ed90 100644
--- a/src/drmgr/drslot_chrp_mem.c
+++ b/src/drmgr/drslot_chrp_mem.c
@@ -31,12 +31,16 @@
#include "dr.h"
#include "ofdt.h"
#include "drmem.h"
+#include "common_numa.h"

static int block_sz_bytes = 0;
static char *state_strs[] = {"offline", "online"};

static char *usagestr = "-c mem {-a | -r} {-q <quantity> -p {variable_weight | ent_capacity} | {-q <quantity> | -s [<drc_name> | <drc_index>]}}";

+static struct numa_topology numa;
+static int numa_enabled = 0;
+
/**
* mem_usage
* @brief return usage string
@@ -306,6 +310,31 @@ get_mem_node_lmbs(struct lmb_list_head *lmb_list)
return rc;
}

+static int link_lmb_to_numa_node(struct dr_node *lmb)
+{
+ int nid;
+ struct numa_node *node;
+
+ nid = numa_aa_index_to_node(&numa, lmb->lmb_aa_index);
+ if (nid == NUMA_NO_NODE)
+ return 0;
+
+ node = numa_fetch_node(&numa, nid);
+ if (!node)
+ return -ENOMEM;
+
+ lmb->lmb_numa_next = node->lmbs;
+ node->lmbs = lmb;
+ node->n_lmbs++;
+
+ if (node->n_cpus)
+ numa.lmb_count++;
+ else
+ numa.cpuless_lmb_count++;
+
+ return 0;
+}
+
int add_lmb(struct lmb_list_head *lmb_list, uint32_t drc_index,
uint64_t address, uint64_t lmb_sz, uint32_t aa_index,
uint32_t flags)
@@ -324,6 +353,9 @@ int add_lmb(struct lmb_list_head *lmb_list, uint32_t drc_index,
lmb->lmb_address = address;
lmb->lmb_aa_index = aa_index;

+ if (numa_enabled && link_lmb_to_numa_node(lmb))
+ return -ENOMEM;
+
if (flags & DRMEM_ASSIGNED) {
int rc;

@@ -490,7 +522,7 @@ get_dynamic_reconfig_lmbs(struct lmb_list_head *lmb_list)

if (stat(DYNAMIC_RECONFIG_MEM_V1, &sbuf) == 0) {
rc = get_dynamic_reconfig_lmbs_v1(lmb_sz, lmb_list);
- } else if (is_lsslot_cmd &&
+ } else if ((is_lsslot_cmd || numa_enabled) &&
stat(DYNAMIC_RECONFIG_MEM_V2, &sbuf) == 0) {
rc = get_dynamic_reconfig_lmbs_v2(lmb_sz, lmb_list);
} else {
@@ -1424,11 +1456,312 @@ int valid_mem_options(void)
return 0;
}

+static int remove_lmb_by_index(uint32_t drc_index)
+{
+ char cmdbuf[128];
+ int offset;
+
+ offset = sprintf(cmdbuf, "memory remove index 0x%x", drc_index);
+
+ return __do_kernel_dlpar(cmdbuf, offset, 1 /* Don't report error */);
+}
+
+static int remove_lmb_from_node(struct numa_node *node, uint32_t count)
+{
+ struct dr_node *lmb;
+ int err, done = 0, unlinked = 0;
+
+ say(DEBUG, "Try removing %d / %d LMBs from node %d\n",
+ count, node->n_lmbs, node->node_id);
+
+ for (lmb = node->lmbs; lmb && done < count; lmb = lmb->lmb_numa_next) {
+ unlinked ++;
+ err = remove_lmb_by_index(lmb->drc_index);
+ if (err)
+ say(WARN,"Can't remove LMB node:%d index:0x%x: %s\n",
+ node->node_id, lmb->drc_index, strerror(-err));
+ else
+ done++;
+ }
+
+ /*
+ * Decrement the node LMB's count since whatever is the success
+ * of the removal operation, it will not be tried again on that
+ * LMB.
+ */
+ node->n_lmbs -= unlinked;
+
+ /*
+ * Update the node's list of LMB to not process the one we removed or
+ * tried to removed again.
+ */
+ node->lmbs = lmb;
+
+ /* Update numa's counters */
+ if (node->n_cpus)
+ numa.lmb_count -= unlinked;
+ else
+ numa.cpuless_node_count -= unlinked;
+
+ if (!node->n_lmbs) {
+ node->ratio = 0; /* for sanity only */
+ if (node->n_cpus)
+ numa.cpu_count -= node->n_cpus;
+ else
+ numa.cpuless_node_count--;
+ }
+
+ say(INFO, "Removed %d LMBs from node %d\n", done, node->node_id);
+ return done;
+}
+
+#define min(a,b) ((a < b) ? a : b)
+
+static void update_cpuless_node_ratio(void)
+{
+ struct numa_node *node;
+ int nid;
+
+ /*
+ * Assumptions:
+ * 1. numa->cpuless_node_count is up to date
+ * 2. numa->cpuless_lmb_count is up to date
+ * Nodes with no memory and nodes with CPUs are ignored here.
+ */
+ numa_foreach_node(&numa, nid, node) {
+ if (node->n_cpus ||!node->n_lmbs)
+ continue;
+ node->ratio = (node->n_lmbs * 100) / numa.cpuless_lmb_count;
+ }
+}
+
+/*
+ * Remove LMBs from node without CPUs only.
+ * The more the node has LMBs, the more LMBs will be removed from it.
+ *
+ * We have to retry the operation multiple times because some LMB cannot be
+ * removed due to the page usage in the kernel. In that case, that LMB is no
+ * more taken in account and the node's LMB count is decremented, assuming that
+ * LMB is unremovable at this time. Thus each node's ratio has to be computed on
+ * each iteration. This is not a big deal, usually, there are not so much nodes.
+ */
+static int remove_cpuless_lmbs(uint32_t count)
+{
+ struct numa_node *node;
+ int nid;
+ uint32_t total = count, todo, done = 0, this_loop;
+
+ while (count) {
+ count = min(count, numa.cpuless_lmb_count);
+ if (!count)
+ break;
+
+ update_cpuless_node_ratio();
+
+ this_loop = 0;
+ numa_foreach_node(&numa, nid, node) {
+ if (!node->n_lmbs || node->n_cpus)
+ continue;
+
+ todo = (count * node->ratio) / 100;
+ todo = min(todo, node->n_lmbs);
+ /* Fix rounded value to 0 */
+ if (!todo && node->n_lmbs)
+ todo = (count - this_loop);
+
+ if (todo)
+ todo = remove_lmb_from_node(node, todo);
+
+ this_loop += todo;
+ done += todo;
+ if (done >= total)
+ break;
+ }
+
+ /* Don't continue if we didn't make any progress. */
+ if (!this_loop)
+ break;
+
+ count -= this_loop;
+ }
+
+ say(DEBUG, "%d / %d LMBs removed from the CPU less nodes\n",
+ done, total);
+ return done;
+}
+
+static void update_node_ratio(void)
+{
+ int nid;
+ struct numa_node *node, *n, **p;
+ uint32_t cpu_ratio, mem_ratio;
+
+ /*
+ * Assumptions:
+ * 1. numa->cpu_count is up to date
+ * 2. numa->lmb_count is up to date
+ * Nodes with no memory and nodes with no CPU are ignored here.
+ */
+
+ numa.ratio = NULL;
+ numa_foreach_node(&numa, nid, node) {
+ if (!node->n_lmbs || !node->n_cpus)
+ continue;
+ cpu_ratio = (node->n_cpus * 100) / numa.cpu_count;
+ mem_ratio = (node->n_lmbs * 100) / numa.lmb_count;
+
+ /* Say that CPU ratio is 90% of the ratio */
+ node->ratio = (cpu_ratio * 9 + mem_ratio) / 10;
+ }
+
+ /* Create an ordered link of the nodes */
+ numa_foreach_node(&numa, nid, node) {
+ if (!node->n_lmbs || !node->n_cpus)
+ continue;
+
+ p = &numa.ratio;
+ for (n = numa.ratio;
+ n && n->ratio < node->ratio; n = n->ratio_next)
+ p = &n->ratio_next;
+ *p = node;
+ node->ratio_next = n;
+ }
+}
+
+/*
+ * Remove LMBs from node with CPUs.
+ *
+ * The less a node has CPU, the more memory will be removed from it.
+ *
+ * As for the CPU less nodes, we must iterate because some LMBs may not be
+ * removable at this time.
+ */
+static int remove_cpu_lmbs(uint32_t count)
+{
+ struct numa_node *node;
+ uint32_t total = count, todo, done = 0, this_loop;
+ uint32_t new_lmb_count;
+
+ while(count) {
+ count = min(count, numa.lmb_count);
+ if (!count)
+ break;
+
+ update_node_ratio();
+
+ new_lmb_count = numa.lmb_count - count;
+
+ this_loop = 0;
+ numa_foreach_node_by_ratio(&numa, node) {
+ if (!node->n_lmbs || !node->n_cpus)
+ continue;
+
+ todo = (new_lmb_count * node->ratio) / 100;
+ todo = node->n_lmbs - min(todo, node->n_lmbs);
+ todo = min(count, todo);
+
+ if (todo) {
+ todo = remove_lmb_from_node(node, todo);
+ count -= todo;
+ this_loop += todo;
+ }
+
+ if (!count)
+ break;
+ }
+
+ /* Don't continue if we didn't make any progress. */
+ if (!this_loop)
+ break;
+ done += this_loop;
+ }
+
+ say(DEBUG, "%d / %d LMBs removed from the CPU nodes\n",
+ done, total);
+ return done;
+}
+
+static void build_numa_topology(void)
+{
+ int rc;
+
+ rc = numa_get_topology(&numa);
+ if (rc)
+ return;
+
+ numa_enabled = 1;
+}
+
+static void clear_numa_lmb_links(void)
+{
+ int nid;
+ struct numa_node *node;
+
+ numa_foreach_node(&numa, nid, node)
+ node->lmbs = NULL;
+}
+
+static int numa_based_remove(uint32_t count)
+{
+ struct lmb_list_head *lmb_list;
+ struct numa_node *node;
+ int nid;
+ uint32_t done = 0;
+
+ /*
+ * Read the LMBs
+ * Link the LMBs to their node
+ * Update global counter
+ */
+ lmb_list = get_lmbs(LMB_NORMAL_SORT);
+ if (lmb_list == NULL) {
+ clear_numa_lmb_links();
+ return -1;
+ }
+
+ if (!numa.node_count) {
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);
+ return -EINVAL;
+ }
+
+ numa_foreach_node(&numa, nid, node) {
+ say(INFO, "node %4d %4d CPUs %8d LMBs\n",
+ nid, node->n_cpus, node->n_lmbs);
+ }
+
+ done += remove_cpuless_lmbs(count);
+ count -= done;
+
+ done += remove_cpu_lmbs(count);
+
+ report_resource_count(done);
+
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);
+ return 0;
+}
+
int do_mem_kernel_dlpar(void)
{
char cmdbuf[128];
int rc, offset;

+
+ if (usr_action == REMOVE && usr_drc_count) {
+ build_numa_topology();
+ if (numa_enabled) {
+ if (!numa_based_remove(usr_drc_count))
+ return 0;
+
+ /*
+ * If the NUMA based removal failed, lets try the legacy
+ * way.
+ */
+ say(WARN, "Can't do NUMA based removal operation.\n");
+ }
+ }
+
offset = sprintf(cmdbuf, "%s ", "memory");

switch (usr_action) {
diff --git a/src/drmgr/ofdt.h b/src/drmgr/ofdt.h
index 3850a77229b4..3c2840b2e0ee 100644
--- a/src/drmgr/ofdt.h
+++ b/src/drmgr/ofdt.h
@@ -92,6 +92,7 @@ struct dr_node {
uint32_t _lmb_aa_index;
struct mem_scn *_mem_scns;
struct of_node *_of_node;
+ struct dr_node *_numa_next;
} _smem;

#define lmb_address _node_u._smem._address
@@ -99,6 +100,7 @@ struct dr_node {
#define lmb_aa_index _node_u._smem._lmb_aa_index
#define lmb_mem_scns _node_u._smem._mem_scns
#define lmb_of_node _node_u._smem._of_node
+#define lmb_numa_next _node_u._smem._numa_next

struct hea_info {
uint _port_no;
--
2.29.2

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 8, 2020, 11:15:50 AM12/8/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

Le 07/12/2020 à 09:44, Laurent Dufour a écrit :
> When the NUMA topology can be read, all the LMBs found in the Device Tree
> are linked the corresponding node. LMB not associated to node are
> considered as not used.
>
> LMB associated to CPU less node are accounted separately because they will
> be targeted first to be remove. The LMB are removed from the CPU less nodes
> to reach an average number LMBs per CPU less node.
>
> Node with CPU have a ration indexed on their number of CPUs. The higher a
> node have CPU the lower number LMB will be removed. This way node with a
> high number of CPU will get a higher amount of memory.
>
> When a LMB can't be removed (because its memory can't be offlined by the
> kernel), the LMB count for node is decremented and the LMB is removed from
> the node's LMB list. This way, it is no more accounted as 'active' and the
> removal operation will continue without taking it in account anymore.
>
> The removal is done through the remove by DRC index API, allowing to remove
> a LMB at a time. One futur optimization would be to extend that API to
> remove a linear range of LMB each time.
>
> If the NUMA topology can't be read, we fallback using the legacy remove
> way.

I forgot to mention that in the case not all the requested LMBs are removed, a
partial status is reported to the HMC. The following should be added to that
commit's description:

When the requested amount of LMB could not be removed a partial status is
reported. This is a major difference since currently the kernel is adding back
again the removed LMBs in case the requested amount to remove cannot be reached.
That's odd and reporting a partial status is better when user want to remove as
much as memory as possible.

Tyrel, could you add that to commit's description when getting that patch?

Thanks,
Laurent.

Nathan Lynch

<nathanl@linux.ibm.com>

unread,

Dec 10, 2020, 12:37:35 AM12/10/20

to Laurent Dufour, powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, cheloha@linux.ibm.com

Laurent Dufour <ldu...@linux.ibm.com> writes:
> This series introduced change in the way the LMBs to be remove are choosen.
>
> Currently this is done in the kernel, but without taking in account the
> NUMA topology. Thus a node running a lot of CPUs may be starved of memory
> while a node with no CPU will be let with plenty of memory.
>
> The overall idea is to delegate to the drmgr user space process the choice
> of the LMB and to rely on the remove by index kernel APU to remove LMB per
> LMB. There is no need to do that in the kernel since all the information
> are available in user space, and coding in user space eases the
> maintenance.

I think overall this looks good. I had some minor comments on patch 1
but the new code in patches 2 and 3 seems fine.

Seeing more linked lists makes me wish this project used a utility
library such as GLib that provides nicer data structures, but I suppose
that's a larger issue that shouldn't get in the way of this change.

Nathan Lynch

<nathanl@linux.ibm.com>

unread,

Dec 10, 2020, 9:40:35 AM12/10/20

to Laurent Dufour, powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, cheloha@linux.ibm.com

Just to add to this: while this is a significant behavior change, we've
confirmed with HMC development that partial success has always been a
valid result here, and this doesn't constitute a change in the protocol.

Laurent, are you able to provide a log of the HMC's chhwres command when
drmgr returns partial success? I think that would be good to have in the
mailing list archive at least.

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 10, 2020, 11:52:49 AM12/10/20

to Nathan Lynch, powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, cheloha@linux.ibm.com

Here is the output of the chhwres command when a partial status is reported:

hscroot@ltchmcv2:~> chhwres -m ltcbrazos2-SN107FCD7 -p ltcbrazos2-lp17 -r mem -o
r -q 14336
HSCL2932 The dynamic removal of memory resources failed: The operating system
prevented all of the requested memory from being removed. Amount of memory
removed: 7680 MB of 14336 MB. The detailed output of the OS operation follows:
Dec 10 11:51:00 caDlparCommand:execv to drmgr
Validating Memory DLPAR capability...yes.

Please issue the lshwres command to list the memory resources of the partition
and to determine whether or not its pending and runtime memory values match. If
they do not match, problems with future memory-related operations on the managed
system may occur, and it is recommended that the rsthwres command to restore
memory resources be issued on the partition to synchronize its pending memory
value with its runtime memory value.

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 10, 2020, 12:09:03 PM12/10/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

This series introduced change in the way the LMBs to be remove are choosen.

Currently this is done in the kernel, but without taking in account the
NUMA topology. Thus a node running a lot of CPUs may be starved of memory
while a node with no CPU will be let with plenty of memory.

The overall idea is to delegate to the drmgr user space process the choice
of the LMB and to rely on the remove by index kernel APU to remove LMB per
LMB. There is no need to do that in the kernel since all the information
are available in user space, and coding in user space eases the
maintenance.

When not the requested amount of LMB to remove cannot be reached, because
memory offlining is failing for instance, a partial status is reported to
the HMC. This is a major difference since previously the kernel was only
reporting success or failure, readding back the already removed LMBs in
case the requested amount is not reached. Returning a partial status is
more user friendly, since user want to remove memory, lets try to remove as
much as we can, up to the requested amount.
The HMC is reporting a partial status in the following way:

vvvvvvvvvvvvvvvv

hscroot@ltchmcv2:~> chhwres -m ltcbrazos2-SN107FCD7 -p ltcbrazos2-lp17 -r
mem -o r -q 14336
HSCL2932 The dynamic removal of memory resources failed: The operating
system prevented all of the requested memory from being removed. Amount of
memory removed: 7680 MB of 14336 MB. The detailed output of the OS
operation follows: Dec 10 11:51:00 caDlparCommand:execv to drmgr Validating
Memory DLPAR capability...yes.

Please issue the lshwres command to list the memory resources of the
partition and to determine whether or not its pending and runtime memory
values match. If they do not match, problems with future memory-related
operations on the managed system may occur, and it is recommended that the
rsthwres command to restore memory resources be issued on the partition to
synchronize its pending memory value with its runtime memory value.

^^^^^^^^^^^^^^^^

Changes since V1:
patch 1/3: addressing Nathan's comments
patch 3/3: add comment about the partial status in the commit's
description

Laurent Dufour (3):
drmgr: don't open sysfs file for each command
drmgr: read the CPU NUMA topology
drmgr: introduce NUMA based LMB removal

Makefile.am | 5 +-
configure.ac | 4 +

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 10, 2020, 12:09:03 PM12/10/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

This will be used in the next commit to compute LMB removal based on the
NUMA topology.

The NUMA topology is read using the libnuma, so a dependency against it is
added in the configure file.

Signed-off-by: Laurent Dufour <ldu...@linux.ibm.com>
---

Makefile.am | 5 +-
configure.ac | 4 +

src/drmgr/common_numa.c | 268 ++++++++++++++++++++++++++++++++++++++++
src/drmgr/common_numa.h | 83 +++++++++++++
4 files changed, 359 insertions(+), 1 deletion(-)

create mode 100644 src/drmgr/common_numa.c
create mode 100644 src/drmgr/common_numa.h

diff --git a/Makefile.am b/Makefile.am
index 2ff2232537df..31baaa74b353 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -155,6 +155,7 @@ src_drmgr_drmgr_SOURCES = \
src/drmgr/common_cpu.c \
src/drmgr/common_ofdt.c \
src/drmgr/common_pci.c \
+ src/drmgr/common_numa.c \
src/drmgr/drmgr.c \
src/drmgr/drmig_chrp_pmig.c \
src/drmgr/drslot_chrp_cpu.c \
@@ -171,13 +172,14 @@ noinst_HEADERS += \
src/drmgr/drcpu.h \
src/drmgr/dr.h \
src/drmgr/drmem.h \
+ src/drmgr/numa.h \
src/drmgr/drpci.h \
src/drmgr/rtas_calls.h \
src/drmgr/ofdt.h \
src/drmgr/rtas_calls.h \
src/drmgr/options.c

-src_drmgr_drmgr_LDADD = -lrtas
+src_drmgr_drmgr_LDADD = -lrtas -lnuma

src_drmgr_lsslot_SOURCES = \
src/drmgr/lsslot.c \
@@ -186,6 +188,7 @@ src_drmgr_lsslot_SOURCES = \
src/drmgr/common_cpu.c \
src/drmgr/common_pci.c \
src/drmgr/common_ofdt.c \
+ src/drmgr/common_numa.c \
src/drmgr/rtas_calls.c \
src/drmgr/drslot_chrp_mem.c \
$(pseries_platform_SOURCES)
diff --git a/configure.ac b/configure.ac
index de3c6758389a..0239754cc4f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -42,6 +42,10 @@ AC_CHECK_HEADER(zlib.h,
[AC_CHECK_LIB(z, inflate, [], [AC_MSG_FAILURE([zlib library is required for compilation])])],
[AC_MSG_FAILURE([zlib.h is required for compiliation])])

+AC_CHECK_HEADER(numa.h,
+ [AC_CHECK_LIB(numa, numa_available, [], [AC_MSG_FAILURE([numa library is required for compilation])])],
+ [AC_MSG_FAILURE([numa.h is required for compiliation])])
+
# check for librtas
AC_ARG_WITH([librtas],
[AS_HELP_STRING([--without-librtas],
diff --git a/src/drmgr/common_numa.c b/src/drmgr/common_numa.c
new file mode 100644
index 000000000000..5778769b25b6
--- /dev/null
+++ b/src/drmgr/common_numa.c
@@ -0,0 +1,268 @@
+/**
+ * @file common_numa.c
+ *
+ * Copyright (C) IBM Corporation 2020
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <numa.h>
+
+#include "dr.h"
+#include "ofdt.h"
+#include "drmem.h" /* for DYNAMIC_RECONFIG_MEM */
+#include "common_numa.h"
+
+#define RTAS_DIRECTORY "/proc/device-tree/rtas"
+#define CHOSEN_DIRECTORY "/proc/device-tree/chosen"
+#define ASSOC_REF_POINTS "ibm,associativity-reference-points"
+#define ASSOC_LOOKUP_ARRAYS "ibm,associativity-lookup-arrays"
+#define ARCHITECTURE_VEC_5 "ibm,architecture-vec-5"
+
+/*
+ * Allocate and read a property, return the size.
+ * The read property is not converted to the host endianess.
+ */
+static int load_property(char *dir, char *prop, uint32_t **buf)
+{
+ int size;
+
+ size = get_property_size(dir, prop);
+ if (!size)
+ return -ENOENT;
+
+ *buf = zalloc(size);
+ if (!*buf) {
+ say(ERROR, "Could not allocate buffer read %s (%d bytes)\n",
+ prop, size);

+ return -ENOMEM;
+ }
+

+ if (get_property(dir, prop, *buf, size)) {
+ free(*buf);
+ say(ERROR, "Can't retrieve %s/%s\n", dir, prop);

+ return -EINVAL;
+ }
+

+ return size;
+}
+
+/*
+ * Get the minimal common depth, based on the form 1 of the ibm,associativ-
+ * ity-reference-points property. We only support that form.
+ *
+ * We should check that the "ibm,architecture-vec-5" property byte 5 bit 0
+ * has the value of one.
+ */
+static int get_min_common_depth(struct numa_topology *numa)
+{
+ int size;
+ uint32_t *p;
+ unsigned char val;
+
+ size = load_property(CHOSEN_DIRECTORY, ARCHITECTURE_VEC_5, &p);
+ if (size < 0)
+ return size;
+
+ /* PAPR byte start at 1 (and not 0) but there is the length field */
+ if (size < 6) {
+ report_unknown_error(__FILE__, __LINE__);
+ free(p);
+ return -EINVAL;
+ }
+ val = ((unsigned char *)p)[5];
+ free(p);
+
+ if (!(val & 0x80))
+ return -ENOTSUP;
+
+ size = load_property(RTAS_DIRECTORY, ASSOC_REF_POINTS, &p);
+ if (size <= 0)
+ return size;
+ if (size < sizeof(uint32_t)) {
+ report_unknown_error(__FILE__, __LINE__);
+ free(p);

+ return -EINVAL;
+ }
+

+ /* Get the first entry */
+ numa->min_common_depth = be32toh(*p);
+ free(p);

+ return 0;
+}
+

+static int get_assoc_arrays(struct numa_topology *numa)
+{
+ int size;
+ int rc;
+ uint32_t *prop, i;
+ struct assoc_arrays *aa = &numa->aa;
+
+ size = load_property(DYNAMIC_RECONFIG_MEM, ASSOC_LOOKUP_ARRAYS, &prop);
+ if (size < 0)
+ return size;
+
+ size /= sizeof(uint32_t);
+ if (size < 2) {
+ say(ERROR, "Could not find the associativity lookup arrays\n");
+ free(prop);

+ return -EINVAL;
+ }
+

+ aa->n_arrays = be32toh(prop[0]);
+ aa->array_sz = be32toh(prop[1]);
+
+ rc = -EINVAL;
+ if (numa->min_common_depth > aa->array_sz) {
+ say(ERROR, "Bad min common depth or associativity array size\n");
+ goto out_free;
+ }
+
+ /* Sanity check */
+ if (size != (aa->n_arrays * aa->array_sz + 2)) {
+ say(ERROR, "Bad size of the associativity lookup arrays\n");
+ goto out_free;
+ }
+
+ aa->min_array = zalloc(aa->n_arrays * sizeof(uint32_t));
+
+ /* Keep only the most significant value */
+ for (i = 0; i < aa->n_arrays; i++) {
+ int prop_index = i * aa->array_sz + numa->min_common_depth + 1;
+
+ aa->min_array[i] = be32toh(prop[prop_index]);
+ }
+ rc = 0;
+
+out_free:
+ free(prop);
+ return rc;
+}
+
+struct numa_node *numa_fetch_node(struct numa_topology *numa, int nid)

+{
+ struct numa_node *node;
+

+ if (nid > MAX_NUMNODES) {
+ report_unknown_error(__FILE__, __LINE__);
+ return NULL;
+ }
+
+ node = numa->nodes[nid];
+ if (node)
+ return node;
+
+ node = zalloc(sizeof(struct numa_node));
+ if (!node) {
+ say(ERROR, "Can't allocate a new node\n");
+ return NULL;
+ }
+
+ node->node_id = nid;
+
+ if (!numa->node_count || nid < numa->node_min)
+ numa->node_min = nid;
+ if (nid > numa->node_max)
+ numa->node_max = nid;
+
+ numa->nodes[nid] = node;
+ numa->node_count++;
+
+ return node;
+}
+
+/*
+ * Read the number of CPU for each node using the libnuma to get the details
+ * from sysfs.
+ */
+static int read_numa_topology(struct numa_topology *numa)
+{
+ struct bitmask *cpus;
+ struct numa_node *node;
+ int rc, max_node, nid, i;
+
+ if (numa_available() < 0)
+ return -ENOENT;
+
+ max_node = numa_max_node();
+ if (max_node >= MAX_NUMNODES) {
+ say(ERROR, "Too many nodes %d (max:%d)\n",
+ max_node, MAX_NUMNODES);

+ return -EINVAL;
+ }
+

+ rc = 0;
+
+ /* In case of allocation error, the libnuma is calling exit() */
+ cpus = numa_allocate_cpumask();
+
+ for (nid = 0; nid <= max_node; nid++) {
+
+ if (!numa_bitmask_isbitset(numa_nodes_ptr, nid))
+ continue;
+
+ node = numa_fetch_node(numa, nid);
+ if (!node) {
+ rc = -ENOMEM;
+ break;
+ }
+
+ rc = numa_node_to_cpus(nid, cpus);
+ if (rc < 0)
+ break;
+
+ /* Count the CPUs in that node */
+ for (i = 0; i < cpus->size; i++)
+ if (numa_bitmask_isbitset(cpus, i))
+ node->n_cpus++;
+
+ numa->cpu_count += node->n_cpus;
+ }
+
+ numa_bitmask_free(cpus);
+
+ if (rc) {
+ numa_foreach_node(numa, nid, node)
+ node->n_cpus = 0;
+ numa->cpu_count = 0;
+ }
+
+ return rc;
+}
+
+int numa_get_topology(struct numa_topology *numa)

+{
+ int rc;
+

+ rc = get_min_common_depth(numa);
+ if (rc)
+ return rc;
+
+
+ rc = get_assoc_arrays(numa);
+ if (rc)
+ return rc;
+
+ rc = read_numa_topology(numa);
+ if (rc)
+ return rc;
+
+ if (!numa->node_count)
+ return -1;
+
+ return 0;
+}
diff --git a/src/drmgr/common_numa.h b/src/drmgr/common_numa.h
new file mode 100644
index 000000000000..4d0054926819
--- /dev/null
+++ b/src/drmgr/common_numa.h
@@ -0,0 +1,83 @@
+/**
+ * @file numa.h
+ *
+ * Copyright (C) IBM Corporation 2020
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef _NUMA_H_
+#define _NUMA_H_
+
+#define MAX_NUMNODES 256
+#define NUMA_NO_NODE -1
+
+struct numa_node {
+ int node_id;
+ unsigned int n_cpus;
+ unsigned int n_lmbs;
+ unsigned int ratio;
+ struct dr_node *lmbs; /* linked by lmb_numa_next */
+ struct numa_node *ratio_next;
+};
+
+struct assoc_arrays {
+ uint32_t n_arrays;
+ uint32_t array_sz;
+ uint32_t *min_array;
+};
+
+struct numa_topology {
+ unsigned int cpu_count;
+ unsigned int lmb_count;
+ unsigned int cpuless_node_count;
+ unsigned int cpuless_lmb_count;
+ unsigned int node_count, node_min, node_max;
+ struct numa_node *nodes[MAX_NUMNODES];
+ struct numa_node *ratio;
+ uint32_t min_common_depth;
+ struct assoc_arrays aa;
+};
+
+int numa_get_topology(struct numa_topology *numa);
+struct numa_node *numa_fetch_node(struct numa_topology *numa, int node_id);
+
+static inline int numa_aa_index_to_node(struct numa_topology *numa,
+ uint32_t aa_index)
+{
+ if (aa_index < numa->aa.n_arrays)
+ return numa->aa.min_array[aa_index];
+ return NUMA_NO_NODE;
+}
+
+static inline int next_node(struct numa_topology *numa, int nid,
+ struct numa_node **node)
+{
+ for (nid++; nid <= numa->node_max; nid++)
+ if (numa->nodes[nid]) {
+ *node = numa->nodes[nid];
+ break;
+ }
+ return nid;
+}
+
+#define numa_foreach_node(numa, nid, node) \
+ for (nid = (numa)->node_min, node = (numa)->nodes[nid]; \
+ nid <= (numa)->node_max; \
+ nid = next_node(numa, nid, &(node)))
+
+#define numa_foreach_node_by_ratio(numa, node) \
+ for (node = (numa)->ratio; node; node = node->ratio_next)
+
+#endif /* _NUMA_H_ */
--
2.29.2

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 10, 2020, 12:09:04 PM12/10/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

When the NUMA topology can be read, all the LMBs found in the Device Tree
are linked the corresponding node. LMB not associated to node are
considered as not used.

LMB associated to CPU less node are accounted separately because they will
be targeted first to be remove. The LMB are removed from the CPU less nodes
to reach an average number LMBs per CPU less node.

Node with CPU have a ration indexed on their number of CPUs. The higher a
node have CPU the lower number LMB will be removed. This way node with a
high number of CPU will get a higher amount of memory.

When a LMB can't be removed (because its memory can't be offlined by the
kernel), the LMB count for node is decremented and the LMB is removed from
the node's LMB list. This way, it is no more accounted as 'active' and the
removal operation will continue without taking it in account anymore.

The removal is done through the remove by DRC index API, allowing to remove
a LMB at a time. One futur optimization would be to extend that API to
remove a linear range of LMB each time.

When the requested amount of LMB could not be removed a partial status is
reported. This is a major difference since currently the kernel is adding back
again the removed LMBs in case the requested amount to remove cannot be reached.

That's odd and reporting a partial status is better when user want to remove as
much as memory as possible.

If the NUMA topology can't be read, we fallback using the legacy remove
way.

Signed-off-by: Laurent Dufour <ldu...@linux.ibm.com>
---

src/drmgr/drslot_chrp_mem.c | 336 +++++++++++++++++++++++++++++++++++-
src/drmgr/ofdt.h | 2 +

2 files changed, 337 insertions(+), 1 deletion(-)

diff --git a/src/drmgr/drslot_chrp_mem.c b/src/drmgr/drslot_chrp_mem.c
index 502aa3e9fff0..7aee377f9aa4 100644

--- a/src/drmgr/drslot_chrp_mem.c
+++ b/src/drmgr/drslot_chrp_mem.c
@@ -31,12 +31,16 @@
#include "dr.h"
#include "ofdt.h"
#include "drmem.h"
+#include "common_numa.h"

static int block_sz_bytes = 0;
static char *state_strs[] = {"offline", "online"};

static char *usagestr = "-c mem {-a | -r} {-q <quantity> -p {variable_weight | ent_capacity} | {-q <quantity> | -s [<drc_name> | <drc_index>]}}";

+static struct numa_topology numa;
+static int numa_enabled = 0;
+
/**
* mem_usage
* @brief return usage string
@@ -306,6 +310,31 @@ get_mem_node_lmbs(struct lmb_list_head *lmb_list)
return rc;
}

+static int link_lmb_to_numa_node(struct dr_node *lmb)
+{
+ int nid;

+ struct numa_node *node;
+

+ nid = numa_aa_index_to_node(&numa, lmb->lmb_aa_index);
+ if (nid == NUMA_NO_NODE)

+ return 0;
+
+ node = numa_fetch_node(&numa, nid);
+ if (!node)

+ return -ENOMEM;
+
+ lmb->lmb_numa_next = node->lmbs;
+ node->lmbs = lmb;
+ node->n_lmbs++;
+
+ if (node->n_cpus)
+ numa.lmb_count++;
+ else
+ numa.cpuless_lmb_count++;

+
+ return 0;
+}

+
int add_lmb(struct lmb_list_head *lmb_list, uint32_t drc_index,
uint64_t address, uint64_t lmb_sz, uint32_t aa_index,
uint32_t flags)
@@ -324,6 +353,9 @@ int add_lmb(struct lmb_list_head *lmb_list, uint32_t drc_index,
lmb->lmb_address = address;
lmb->lmb_aa_index = aa_index;

+ if (numa_enabled && link_lmb_to_numa_node(lmb))
+ return -ENOMEM;
+
if (flags & DRMEM_ASSIGNED) {
int rc;

@@ -490,7 +522,7 @@ get_dynamic_reconfig_lmbs(struct lmb_list_head *lmb_list)

if (stat(DYNAMIC_RECONFIG_MEM_V1, &sbuf) == 0) {
rc = get_dynamic_reconfig_lmbs_v1(lmb_sz, lmb_list);
- } else if (is_lsslot_cmd &&
+ } else if ((is_lsslot_cmd || numa_enabled) &&
stat(DYNAMIC_RECONFIG_MEM_V2, &sbuf) == 0) {
rc = get_dynamic_reconfig_lmbs_v2(lmb_sz, lmb_list);
} else {

@@ -1424,11 +1456,313 @@ int valid_mem_options(void)

return 0;
}

+static int remove_lmb_by_index(uint32_t drc_index)
+{
+ char cmdbuf[128];
+ int offset;
+
+ offset = sprintf(cmdbuf, "memory remove index 0x%x", drc_index);
+

+ return do_kernel_dlpar_common(cmdbuf, offset,
+ 1 /* Don't report error */);

+}
+
+static int remove_lmb_from_node(struct numa_node *node, uint32_t count)
+{
+ struct dr_node *lmb;
+ int err, done = 0, unlinked = 0;
+
+ say(DEBUG, "Try removing %d / %d LMBs from node %d\n",
+ count, node->n_lmbs, node->node_id);

+

+{
+ struct numa_node *node;

+ int nid;
+
+ /*
+ * Assumptions:
+ * 1. numa->cpuless_node_count is up to date
+ * 2. numa->cpuless_lmb_count is up to date
+ * Nodes with no memory and nodes with CPUs are ignored here.
+ */
+ numa_foreach_node(&numa, nid, node) {
+ if (node->n_cpus ||!node->n_lmbs)
+ continue;
+ node->ratio = (node->n_lmbs * 100) / numa.cpuless_lmb_count;
+ }

+}
+
+/*

+ * Remove LMBs from node without CPUs only.
+ * The more the node has LMBs, the more LMBs will be removed from it.

+ *

+ * We have to retry the operation multiple times because some LMB cannot be
+ * removed due to the page usage in the kernel. In that case, that LMB is no
+ * more taken in account and the node's LMB count is decremented, assuming that
+ * LMB is unremovable at this time. Thus each node's ratio has to be computed on
+ * each iteration. This is not a big deal, usually, there are not so much nodes.
+ */
+static int remove_cpuless_lmbs(uint32_t count)

+{
+ struct numa_node *node;

+ break;
+ }
+

+}
+
+/*

+ * Remove LMBs from node with CPUs.
+ *
+ * The less a node has CPU, the more memory will be removed from it.
+ *
+ * As for the CPU less nodes, we must iterate because some LMBs may not be
+ * removable at this time.
+ */
+static int remove_cpu_lmbs(uint32_t count)

+{
+ struct numa_node *node;

+ break;
+ }
+

+ /* Don't continue if we didn't make any progress. */
+ if (!this_loop)
+ break;
+ done += this_loop;
+ }
+
+ say(DEBUG, "%d / %d LMBs removed from the CPU nodes\n",
+ done, total);
+ return done;
+}
+
+static void build_numa_topology(void)

+{
+ int rc;
+

+ rc = numa_get_topology(&numa);
+ if (rc)

+ return;
+
+ numa_enabled = 1;
+}
+
+static void clear_numa_lmb_links(void)
+{
+ int nid;

+ struct numa_node *node;
+

+ numa_foreach_node(&numa, nid, node)
+ node->lmbs = NULL;
+}
+
+static int numa_based_remove(uint32_t count)
+{
+ struct lmb_list_head *lmb_list;

+ struct numa_node *node;
+ int nid;
+ uint32_t done = 0;
+
+ /*

+ * Read the LMBs
+ * Link the LMBs to their node
+ * Update global counter
+ */
+ lmb_list = get_lmbs(LMB_NORMAL_SORT);
+ if (lmb_list == NULL) {
+ clear_numa_lmb_links();

+ return -1;
+ }
+

+ if (!numa.node_count) {
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);

+ return -EINVAL;
+ }
+

+ numa_foreach_node(&numa, nid, node) {
+ say(INFO, "node %4d %4d CPUs %8d LMBs\n",
+ nid, node->n_cpus, node->n_lmbs);
+ }
+
+ done += remove_cpuless_lmbs(count);
+ count -= done;
+
+ done += remove_cpu_lmbs(count);
+
+ report_resource_count(done);
+
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);

+ return 0;
+}
+

int do_mem_kernel_dlpar(void)
{
char cmdbuf[128];
int rc, offset;

+
+ if (usr_action == REMOVE && usr_drc_count) {
+ build_numa_topology();
+ if (numa_enabled) {
+ if (!numa_based_remove(usr_drc_count))

+ return 0;
+

--
2.29.2

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 15, 2020, 5:10:33 AM12/15/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

This will be used in the next commit to compute LMB removal based on the
NUMA topology.

The NUMA topology is read using the libnuma, so a dependency against it is
added in the configure file.

Signed-off-by: Laurent Dufour <ldu...@linux.ibm.com>
---

Makefile.am | 5 +-
configure.ac | 4 +
src/drmgr/common_numa.c | 268 ++++++++++++++++++++++++++++++++++++++++
src/drmgr/common_numa.h | 83 +++++++++++++
4 files changed, 359 insertions(+), 1 deletion(-)
create mode 100644 src/drmgr/common_numa.c
create mode 100644 src/drmgr/common_numa.h

diff --git a/Makefile.am b/Makefile.am

index 2ff2232537df..422503efd07c 100644

--- a/Makefile.am
+++ b/Makefile.am
@@ -155,6 +155,7 @@ src_drmgr_drmgr_SOURCES = \
src/drmgr/common_cpu.c \
src/drmgr/common_ofdt.c \
src/drmgr/common_pci.c \
+ src/drmgr/common_numa.c \
src/drmgr/drmgr.c \
src/drmgr/drmig_chrp_pmig.c \
src/drmgr/drslot_chrp_cpu.c \
@@ -171,13 +172,14 @@ noinst_HEADERS += \
src/drmgr/drcpu.h \
src/drmgr/dr.h \
src/drmgr/drmem.h \

+ src/drmgr/common_numa.h \

+ return -ENOMEM;
+ }
+

+ if (get_property(dir, prop, *buf, size)) {
+ free(*buf);
+ say(ERROR, "Can't retrieve %s/%s\n", dir, prop);

+ return -EINVAL;
+ }
+

+ return size;
+}
+
+/*

+ * Get the minimal common depth, based on the form 1 of the ibm,associativ-
+ * ity-reference-points property. We only support that form.

+ *

+ * We should check that the "ibm,architecture-vec-5" property byte 5 bit 0
+ * has the value of one.
+ */
+static int get_min_common_depth(struct numa_topology *numa)
+{
+ int size;
+ uint32_t *p;
+ unsigned char val;
+
+ size = load_property(CHOSEN_DIRECTORY, ARCHITECTURE_VEC_5, &p);
+ if (size < 0)
+ return size;
+
+ /* PAPR byte start at 1 (and not 0) but there is the length field */
+ if (size < 6) {
+ report_unknown_error(__FILE__, __LINE__);
+ free(p);

+ return -EINVAL;
+ }

+ val = ((unsigned char *)p)[5];
+ free(p);
+
+ if (!(val & 0x80))
+ return -ENOTSUP;
+
+ size = load_property(RTAS_DIRECTORY, ASSOC_REF_POINTS, &p);
+ if (size <= 0)
+ return size;
+ if (size < sizeof(uint32_t)) {
+ report_unknown_error(__FILE__, __LINE__);
+ free(p);

+ return -EINVAL;
+ }
+

+ /* Get the first entry */
+ numa->min_common_depth = be32toh(*p);
+ free(p);

+ return 0;
+}
+

+static int get_assoc_arrays(struct numa_topology *numa)
+{
+ int size;
+ int rc;
+ uint32_t *prop, i;
+ struct assoc_arrays *aa = &numa->aa;
+
+ size = load_property(DYNAMIC_RECONFIG_MEM, ASSOC_LOOKUP_ARRAYS, &prop);
+ if (size < 0)
+ return size;
+
+ size /= sizeof(uint32_t);
+ if (size < 2) {
+ say(ERROR, "Could not find the associativity lookup arrays\n");
+ free(prop);

+ return -EINVAL;
+ }
+

+{
+ struct numa_node *node;
+

+}
+
+/*

+ * Read the number of CPU for each node using the libnuma to get the details
+ * from sysfs.
+ */
+static int read_numa_topology(struct numa_topology *numa)
+{
+ struct bitmask *cpus;

+ struct numa_node *node;

+ int rc, max_node, nid, i;
+
+ if (numa_available() < 0)
+ return -ENOENT;
+
+ max_node = numa_max_node();
+ if (max_node >= MAX_NUMNODES) {
+ say(ERROR, "Too many nodes %d (max:%d)\n",
+ max_node, MAX_NUMNODES);

+ return -EINVAL;
+ }
+

+ rc = 0;
+

+ /* In case of allocation error, the libnuma is calling exit() */
+ cpus = numa_allocate_cpumask();

+

+ for (nid = 0; nid <= max_node; nid++) {
+
+ if (!numa_bitmask_isbitset(numa_nodes_ptr, nid))
+ continue;
+

+ node = numa_fetch_node(numa, nid);
+ if (!node) {
+ rc = -ENOMEM;
+ break;
+ }
+

+ rc = numa_node_to_cpus(nid, cpus);
+ if (rc < 0)
+ break;
+
+ /* Count the CPUs in that node */
+ for (i = 0; i < cpus->size; i++)
+ if (numa_bitmask_isbitset(cpus, i))
+ node->n_cpus++;
+
+ numa->cpu_count += node->n_cpus;
+ }
+
+ numa_bitmask_free(cpus);
+
+ if (rc) {
+ numa_foreach_node(numa, nid, node)
+ node->n_cpus = 0;
+ numa->cpu_count = 0;
+ }
+
+ return rc;
+}
+
+int numa_get_topology(struct numa_topology *numa)

+{
+ int rc;
+

+ rc = get_min_common_depth(numa);
+ if (rc)

+ return rc;
+
+
+ rc = get_assoc_arrays(numa);

+ if (rc)
+ return rc;
+
+ rc = read_numa_topology(numa);
+ if (rc)

+ return rc;
+
+ if (!numa->node_count)

+ return -1;
+

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 15, 2020, 5:10:36 AM12/15/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

When the NUMA topology can be read, all the LMBs found in the Device Tree
are linked the corresponding node. LMB not associated to node are
considered as not used.

LMB associated to CPU less node are accounted separately because they will
be targeted first to be remove. The LMB are removed from the CPU less nodes
to reach an average number LMBs per CPU less node.

Node with CPU have a ration indexed on their number of CPUs. The higher a
node have CPU the lower number LMB will be removed. This way node with a
high number of CPU will get a higher amount of memory.

When a LMB can't be removed (because its memory can't be offlined by the
kernel), the LMB count for node is decremented and the LMB is removed from
the node's LMB list. This way, it is no more accounted as 'active' and the
removal operation will continue without taking it in account anymore.

The removal is done through the remove by DRC index API, allowing to remove
a LMB at a time. One futur optimization would be to extend that API to
remove a linear range of LMB each time.

When the requested amount of LMB could not be removed a partial status is
reported. This is a major difference since currently the kernel is adding back
again the removed LMBs in case the requested amount to remove cannot be reached.
That's odd and reporting a partial status is better when user want to remove as
much as memory as possible.

If the NUMA topology can't be read, we fallback using the legacy remove
way.

Signed-off-by: Laurent Dufour <ldu...@linux.ibm.com>
---

+ struct numa_node *node;
+

+ nid = numa_aa_index_to_node(&numa, lmb->lmb_aa_index);
+ if (nid == NUMA_NO_NODE)

+ return 0;
+
+ node = numa_fetch_node(&numa, nid);
+ if (!node)

+ return -ENOMEM;
+
+ lmb->lmb_numa_next = node->lmbs;
+ node->lmbs = lmb;
+ node->n_lmbs++;
+
+ if (node->n_cpus)
+ numa.lmb_count++;
+ else
+ numa.cpuless_lmb_count++;

+
+ return 0;
+}

+

+{
+ struct numa_node *node;

+ int nid;
+
+ /*
+ * Assumptions:
+ * 1. numa->cpuless_node_count is up to date
+ * 2. numa->cpuless_lmb_count is up to date
+ * Nodes with no memory and nodes with CPUs are ignored here.
+ */
+ numa_foreach_node(&numa, nid, node) {
+ if (node->n_cpus ||!node->n_lmbs)
+ continue;
+ node->ratio = (node->n_lmbs * 100) / numa.cpuless_lmb_count;
+ }

+}
+
+/*

+ * Remove LMBs from node without CPUs only.
+ * The more the node has LMBs, the more LMBs will be removed from it.

+ *

+ * We have to retry the operation multiple times because some LMB cannot be
+ * removed due to the page usage in the kernel. In that case, that LMB is no
+ * more taken in account and the node's LMB count is decremented, assuming that
+ * LMB is unremovable at this time. Thus each node's ratio has to be computed on
+ * each iteration. This is not a big deal, usually, there are not so much nodes.
+ */
+static int remove_cpuless_lmbs(uint32_t count)

+{
+ struct numa_node *node;

+ break;
+ }
+

+}
+
+/*

+ * Remove LMBs from node with CPUs.
+ *
+ * The less a node has CPU, the more memory will be removed from it.
+ *
+ * As for the CPU less nodes, we must iterate because some LMBs may not be
+ * removable at this time.
+ */
+static int remove_cpu_lmbs(uint32_t count)

+{
+ struct numa_node *node;

+ break;
+ }
+

+ /* Don't continue if we didn't make any progress. */
+ if (!this_loop)
+ break;
+ done += this_loop;
+ }
+
+ say(DEBUG, "%d / %d LMBs removed from the CPU nodes\n",
+ done, total);
+ return done;
+}
+
+static void build_numa_topology(void)

+{
+ int rc;
+

+ rc = numa_get_topology(&numa);
+ if (rc)

+ return;
+
+ numa_enabled = 1;
+}
+
+static void clear_numa_lmb_links(void)
+{
+ int nid;

+ struct numa_node *node;
+

+ numa_foreach_node(&numa, nid, node)
+ node->lmbs = NULL;
+}
+
+static int numa_based_remove(uint32_t count)
+{
+ struct lmb_list_head *lmb_list;

+ struct numa_node *node;
+ int nid;
+ uint32_t done = 0;
+
+ /*

+ * Read the LMBs
+ * Link the LMBs to their node
+ * Update global counter
+ */
+ lmb_list = get_lmbs(LMB_NORMAL_SORT);
+ if (lmb_list == NULL) {
+ clear_numa_lmb_links();

+ return -1;
+ }
+

+ if (!numa.node_count) {
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);

+ return -EINVAL;
+ }
+

+ numa_foreach_node(&numa, nid, node) {
+ say(INFO, "node %4d %4d CPUs %8d LMBs\n",
+ nid, node->n_cpus, node->n_lmbs);
+ }
+
+ done += remove_cpuless_lmbs(count);
+ count -= done;
+
+ done += remove_cpu_lmbs(count);
+
+ report_resource_count(done);
+
+ clear_numa_lmb_links();
+ free_lmbs(lmb_list);

+ return 0;
+}
+

int do_mem_kernel_dlpar(void)
{
char cmdbuf[128];
int rc, offset;

+
+ if (usr_action == REMOVE && usr_drc_count) {
+ build_numa_topology();
+ if (numa_enabled) {
+ if (!numa_based_remove(usr_drc_count))

+ return 0;
+

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 15, 2020, 5:13:04 AM12/15/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

Changes since V2:
patch 2/3: fix build issue about src/drmgr/common_numa.h in Makefile.am

Changes since V1:
patch 1/3: addressing Nathan's comments
patch 3/3: add comment about the partial status in the commit's
description

Laurent Dufour (3):
drmgr: don't open sysfs file for each command
drmgr: read the CPU NUMA topology
drmgr: introduce NUMA based LMB removal

Makefile.am | 5 +-
configure.ac | 4 +

src/drmgr/common.c | 30 ++--
src/drmgr/common_numa.c | 268 ++++++++++++++++++++++++++++
src/drmgr/common_numa.h | 83 +++++++++
src/drmgr/dr.h | 6 +-

src/drmgr/drslot_chrp_mem.c | 336 +++++++++++++++++++++++++++++++++++-
src/drmgr/ofdt.h | 2 +

8 files changed, 718 insertions(+), 16 deletions(-)

create mode 100644 src/drmgr/common_numa.c
create mode 100644 src/drmgr/common_numa.h

--
2.29.2

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Dec 15, 2020, 7:03:35 AM12/15/20

to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, nathanl@linux.ibm.com, cheloha@linux.ibm.com

Please read "Re: [PATCH v3 0/3] NUMA based LMB remove"
^^

Nathan Lynch

<nathanl@linux.ibm.com>

unread,

Jan 6, 2021, 11:12:38 AM1/6/21

to Laurent Dufour, powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, cheloha@linux.ibm.com

Hi Laurent,

A couple minor suggestions, with a point about naming at the end.

Laurent Dufour <ldu...@linux.ibm.com> writes:
> This will be used in the next commit to compute LMB removal based on the
> NUMA topology.
>
> The NUMA topology is read using the libnuma, so a dependency against it is
> added in the configure file.

[...]

The build system changes look OK to me.

Best to use const where possible:

static int load_property(const char *dir, const char *prop, uint32_t **buf)

get_property_size() and get_property() both have their corresponding
parameters as const.

> + int size;
> +
> + size = get_property_size(dir, prop);
> + if (!size)
> + return -ENOENT;
> +
> + *buf = zalloc(size);
> + if (!*buf) {
> + say(ERROR, "Could not allocate buffer read %s (%d bytes)\n",
> + prop, size);
> + return -ENOMEM;
> + }
> +
> + if (get_property(dir, prop, *buf, size)) {
> + free(*buf);
> + say(ERROR, "Can't retrieve %s/%s\n", dir, prop);
> + return -EINVAL;
> + }
> +
> + return size;
> +}

[...]

> +int numa_get_topology(struct numa_topology *numa)
> +{
> + int rc;
> +
> + rc = get_min_common_depth(numa);
> + if (rc)
> + return rc;
> +
> +
> + rc = get_assoc_arrays(numa);
> + if (rc)
> + return rc;
> +
> + rc = read_numa_topology(numa);
> + if (rc)
> + return rc;
> +
> + if (!numa->node_count)
> + return -1;
> +
> + return 0;
> +}

Consider checking the result of libnuma's numa_available() before doing
any device tree parsing or calling other libnuma APIs such as
numa_max_node():

From /usr/include/numa.h:

/* NUMA support available. If this returns a negative value all other function
in this library are undefined. */
int numa_available(void);

Raising a point about naming. Looking at libnuma's header, I have the
impression that it more or less claims the "numa_*" namespace.

I'm not so concerned about conflicts between the library and the drmgr
code. If one arises due to libnuma introducing a new API, we can just
change drmgr.

I'm more interested in being able to distinguish between calls to
libnuma and calls to internal APIs when reading drmgr's code. Using a
different prefix for internal numa-related identifiers would help a
lot. Maybe ppcnuma_?

I realize this would cause a lot of churn in your patches, sorry.

Laurent Dufour

<ldufour@linux.ibm.com>

unread,

Jan 7, 2021, 5:45:39 AM1/7/21

to Nathan Lynch, powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, cheloha@linux.ibm.com

Hi Nathan,

I've to admit I'm not used to that const prefix. I'll add them here.

Sure I'll add that check.

Sounds goods to me.

>
> I realize this would cause a lot of churn in your patches, sorry.
>

I guess, cscope will manage that easily and drmgr is not so big.

Reply all

Reply to author

Forward