[PATCH next] drmgr: Patch bug in remove memory

11 views
Skip to first unread message

Ryan Whittaker

<ryancwmd@linux.ibm.com>
unread,
Jul 22, 2025, 2:04:30 PMJul 22
to powerpc-utils-devel@googlegroups.com, tyreld@linux.ibm.com, Ryan Whittaker, Mingming Cao
Previously, when a cpuless node’s lmb ratio was too small compared to the other
cpuless nodes, the drmgr would attempt to remove more lmbs than are available
in the node. This upset the ratio of lmbs removed, further breaking counters
racking how many lmbs to remove, causing more lmbs than requested to be
removed. Removing more lmbs than requested caused a decrement of an unsigned
int to suffer integer overflow, eventually leading to the drmgr removing all
available memory in the system.

Existing testing exposed the out of memory issue triggered by counter
overflow. The bug was recreated and testing confirmed that this patch
resolves the issue.

This patch contains the following:

Disallows requesting more lmbs than available in a node.

Adjusts removal todo per node dynamically based on the count value (only if the
amount intended to remove has changed in a particular iteration).

Prevents decrementing total counter by greater than the total count to prevent
integer overflow.

Changes which counter is decremented after unlinking nodes. The code previously
decremented the numa number of cpuless nodes instead of the number of cpuless
lmbs, causing integer overflow.

Signed-off-by: Ryan Whittaker <ryan...@linux.ibm.com>
Reviewed-by: Mingming Cao <m...@linux.ibm.com>
---
src/drmgr/drslot_chrp_mem.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/src/drmgr/drslot_chrp_mem.c b/src/drmgr/drslot_chrp_mem.c
index 6206492..a84d91b 100644
--- a/src/drmgr/drslot_chrp_mem.c
+++ b/src/drmgr/drslot_chrp_mem.c
@@ -1502,7 +1502,7 @@ static int remove_lmb_from_node(struct ppcnuma_node *node, uint32_t count)
if (node->n_cpus)
numa.lmb_count -= unlinked;
else
- numa.cpuless_node_count -= unlinked;
+ numa.cpuless_lmb_count -= unlinked;

if (!node->n_lmbs) {
node->ratio = 0; /* for sanity only */
@@ -1565,11 +1565,13 @@ static int remove_cpuless_lmbs(uint32_t count)
continue;

todo = (count * node->ratio) / 100;
- todo = min(todo, node->n_lmbs);
- /* Fix rounded value to 0 */
- if (!todo && node->n_lmbs)
+ /* Fix rounded value to 0 and fix if a 0 ratio has been processed */
+ if ((!todo && node->n_lmbs) || count - this_loop < todo)
todo = (count - this_loop);

+ /* Never request more than available */
+ todo = min(todo, node->n_lmbs);
+
if (todo)
todo = remove_lmb_from_node(node, todo);

@@ -1583,7 +1585,10 @@ static int remove_cpuless_lmbs(uint32_t count)
if (!this_loop)
break;

- count -= this_loop;
+ if (this_loop < count)
+ count -= this_loop;
+ else
+ count = 0;
}

say(DEBUG, "%d / %d LMBs removed from the CPU less nodes\n",
@@ -1751,6 +1756,7 @@ static int numa_mem_dlpar(uint32_t count)
* Link the LMBs to their node
* Update global counter
*/
+
lmb_list = get_lmbs(LMB_NORMAL_SORT);
if (lmb_list == NULL) {
clear_numa_lmb_links();
--
2.47.1

Ryan

<ryancwmd@gmail.com>
unread,
Jul 22, 2025, 2:09:03 PMJul 22
to Powerpc-utils development mailing list
Here is a sample of the recreated bug, before and after the patch. It demonstrates that the patch works.
You can see in the before that too many LMBs were removed than requested, and that the count value suffered from integer overflow.

Before: 

########## Jul 17 18:14:46 2025 ########## 

drmgr: -r -c mem -q 35 -d 5 

Validating Memory DLPAR capability...yes. 

Found 49 LMBs currently allocated 

node    1    0 CPUs       34 LMBs 

node    3    0 CPUs       11 LMBs 

Try removing 35 / 34 LMBs from node 1 

... 

Removed 31 LMBs from node 1 

Try removing 8 / 11 LMBs from node 3 

... 

Removed 8 LMBs from node 3 

Count value: 4294967292 

Try removing 1 / 1 LMBs from node 3 

... 

Removed 0 LMBs from node 3 

39 / 35 LMBs removed from the CPU less nodes 

0 / 0 LMBs removed from the CPU nodes 

DR_TOTAL_RESOURCES=39 

########## Jul 17 18:14:49 2025 ########## 

 

 

After: 

########## Jul 17 18:24:14 2025 ########## 

drmgr: -r -c mem -q 35 -d 5 

Validating Memory DLPAR capability...yes. 

Found 49 LMBs currently allocated 

node    1    0 CPUs       34 LMBs 

node    3    0 CPUs       11 LMBs 

Try removing 34 / 34 LMBs from node 1 

... 

Removed 31 LMBs from node 1 

Try removing 4 / 11 LMBs from node 3 

... 

Removed 4 LMBs from node 3 

Count value: 0 

35 / 35 LMBs removed from the CPU less nodes 

0 / 0 LMBs removed from the CPU nodes 

DR_TOTAL_RESOURCES=35 

########## Jul 17 18:24:17 2025 ########## 

 


Tyrel Datwyler

<tyreld@linux.ibm.com>
unread,
Aug 15, 2025, 3:41:56 PMAug 15
to Ryan Whittaker, powerpc-utils-devel@googlegroups.com, Mingming Cao
On 7/22/25 11:03 AM, Ryan Whittaker wrote:
> Previously, when a cpuless node’s lmb ratio was too small compared to the other
> cpuless nodes, the drmgr would attempt to remove more lmbs than are available
> in the node. This upset the ratio of lmbs removed, further breaking counters
> racking how many lmbs to remove, causing more lmbs than requested to be
> removed. Removing more lmbs than requested caused a decrement of an unsigned
> int to suffer integer overflow, eventually leading to the drmgr removing all
> available memory in the system.
>
> Existing testing exposed the out of memory issue triggered by counter
> overflow. The bug was recreated and testing confirmed that this patch
> resolves the issue.
>
> This patch contains the following:

The previous sentence suggests that, while related, you are fixing several
discrete things. In practice it best to submit separate patches for each logical
change.

It seems to me this could be a patchset called "drmgr: fix bugs in remove
memory" that contains the next 4 items broken out into individual patches.

-Tyrel
Reply all
Reply to author
Forward
0 new messages