[PATCH RFC 00/19] slab: replace cpu (partial) slabs with sheaves

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:04 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka, Alexander Potapenko, Marco Elver, Dmitry Vyukov

Percpu sheaves caching was introduced as opt-in but the goal was to
eventually move all caches to them. This is the next step, enabling
sheaves for all caches (except the two bootstrap ones) and then removing
the per cpu (partial) slabs and lots of associated code.

Besides (hopefully) improved performance, this removes the rather
complicated code related to the lockless fastpaths (using
this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
kmalloc_nolock().

The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
without repeating the "alien" array flushing of SLUB, and to allow
flushing objects from sheaves to slabs mostly without the node
list_lock.

This is the first RFC to get feedback. Biggest TODOs are:

- cleanup of stat counters to fit the new scheme
- integration of rcu sheaves handling with kfree_rcu batching
- performance evaluation

Git branch: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/sheaves-for-all

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---
Vlastimil Babka (19):
slab: move kfence_alloc() out of internal bulk alloc
slab: handle pfmemalloc slabs properly with sheaves
slub: remove CONFIG_SLUB_TINY specific code paths
slab: prevent recursive kmalloc() in alloc_empty_sheaf()
slab: add sheaves to most caches
slab: introduce percpu sheaves bootstrap
slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
slab: handle kmalloc sheaves bootstrap
slab: add optimized sheaf refill from partial list
slab: remove cpu (partial) slabs usage from allocation paths
slab: remove SLUB_CPU_PARTIAL
slab: remove the do_slab_free() fastpath
slab: remove defer_deactivate_slab()
slab: simplify kmalloc_nolock()
slab: remove struct kmem_cache_cpu
slab: remove unused PREEMPT_RT specific macros
slab: refill sheaves from all nodes
slab: update overview comments
slab: remove frozen slab checks from __slab_free()

include/linux/gfp_types.h | 6 -
include/linux/slab.h | 6 -
mm/Kconfig | 11 -
mm/internal.h | 1 +
mm/page_alloc.c | 5 +
mm/slab.h | 47 +-
mm/slub.c | 2601 ++++++++++++++++-----------------------------
7 files changed, 915 insertions(+), 1762 deletions(-)
---
base-commit: 7b34bb10d15c412cdce0a1ea3b5701888b885673
change-id: 20251002-sheaves-for-all-86ac13dc47a5

Best regards,
--
Vlastimil Babka <vba...@suse.cz>

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:04 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

Before we enable percpu sheaves for kmalloc caches, we need to make sure
kmalloc_nolock() and kfree_nolock() will continue working properly and
not spin when not allowed to.

Percpu sheaves themselves use local_trylock() so they are already
compatible. We just need to be careful with the barn->lock spin_lock.
Pass a new allow_spin parameter where necessary to use
spin_trylock_irqsave().

In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
for now it will always fail until we enable sheaves for kmalloc caches
next. Similarly in kfree_nolock() we can attempt free_to_pcs().

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 52 insertions(+), 22 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ecb10ed5acfe..5d0b2cf66520 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2876,7 +2876,8 @@ static void pcs_destroy(struct kmem_cache *s)
s->cpu_sheaves = NULL;
}

-static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
+ bool allow_spin)
{
struct slab_sheaf *empty = NULL;
unsigned long flags;
@@ -2884,7 +2885,10 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
if (!data_race(barn->nr_empty))
return NULL;

- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return NULL;

if (likely(barn->nr_empty)) {
empty = list_first_entry(&barn->sheaves_empty,
@@ -2961,7 +2965,8 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
* change.
*/
static struct slab_sheaf *
-barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty,
+ bool allow_spin)
{
struct slab_sheaf *full = NULL;
unsigned long flags;
@@ -2969,7 +2974,10 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
if (!data_race(barn->nr_full))
return NULL;

- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return NULL;

if (likely(barn->nr_full)) {
full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
@@ -2990,7 +2998,8 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
* barn. But if there are too many full sheaves, reject this with -E2BIG.
*/
static struct slab_sheaf *
-barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full,
+ bool allow_spin)
{
struct slab_sheaf *empty;
unsigned long flags;
@@ -3001,7 +3010,10 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
if (!data_race(barn->nr_empty))
return ERR_PTR(-ENOMEM);

- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return NULL;

if (likely(barn->nr_empty)) {
empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
@@ -5000,7 +5012,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
return NULL;
}

- full = barn_replace_empty_sheaf(barn, pcs->main);
+ full = barn_replace_empty_sheaf(barn, pcs->main,
+ gfpflags_allow_spinning(gfp));

if (full) {
stat(s, BARN_GET);
@@ -5017,7 +5030,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
empty = pcs->spare;
pcs->spare = NULL;
} else {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);
}
}

@@ -5154,7 +5167,8 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
}

static __fastpath_inline
-unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
+ void **p)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *main;
@@ -5188,7 +5202,8 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
return allocated;
}

- full = barn_replace_empty_sheaf(barn, pcs->main);
+ full = barn_replace_empty_sheaf(barn, pcs->main,
+ gfpflags_allow_spinning(gfp));

if (full) {
stat(s, BARN_GET);
@@ -5693,7 +5708,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
struct kmem_cache *s;
bool can_retry = true;
- void *ret = ERR_PTR(-EBUSY);
+ void *ret;

VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
__GFP_NO_OBJ_EXT));
@@ -5720,6 +5735,13 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
*/
return NULL;

+ ret = alloc_from_pcs(s, alloc_gfp, node);
+
+ if (ret)
+ goto success;
+
+ ret = ERR_PTR(-EBUSY);
+
/*
* Do not call slab_alloc_node(), since trylock mode isn't
* compatible with slab_pre_alloc_hook/should_failslab and
@@ -5756,6 +5778,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
ret = NULL;
}

+success:
maybe_wipe_obj_freeptr(s, ret);
slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
slab_want_init_on_alloc(alloc_gfp, s), size);
@@ -6047,7 +6070,8 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
* unlocked.
*/
static struct slub_percpu_sheaves *
-__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
+__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
+ bool allow_spin)
{
struct slab_sheaf *empty;
struct node_barn *barn;
@@ -6071,7 +6095,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
put_fail = false;

if (!pcs->spare) {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, allow_spin);
if (empty) {
pcs->spare = pcs->main;
pcs->main = empty;
@@ -6085,7 +6109,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
return pcs;
}

- empty = barn_replace_full_sheaf(barn, pcs->main);
+ empty = barn_replace_full_sheaf(barn, pcs->main, allow_spin);

if (!IS_ERR(empty)) {
stat(s, BARN_PUT);
@@ -6093,6 +6117,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
return pcs;
}

+ if (!allow_spin) {
+ local_unlock(&s->cpu_sheaves->lock);
+ return NULL;
+ }
+
if (PTR_ERR(empty) == -E2BIG) {
/* Since we got here, spare exists and is full */
struct slab_sheaf *to_flush = pcs->spare;
@@ -6160,7 +6189,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
* The object is expected to have passed slab_free_hook() already.
*/
static __fastpath_inline
-bool free_to_pcs(struct kmem_cache *s, void *object)
+bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
{
struct slub_percpu_sheaves *pcs;

@@ -6171,7 +6200,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object)

if (unlikely(pcs->main->size == s->sheaf_capacity)) {

- pcs = __pcs_replace_full_main(s, pcs);
+ pcs = __pcs_replace_full_main(s, pcs, allow_spin);
if (unlikely(!pcs))
return false;
}
@@ -6278,7 +6307,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
goto fail;
}

- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);

if (empty) {
pcs->rcu_free = empty;
@@ -6398,7 +6427,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
goto no_empty;

if (!pcs->spare) {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);
if (!empty)
goto no_empty;

@@ -6412,7 +6441,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
goto do_free;
}

- empty = barn_replace_full_sheaf(barn, pcs->main);
+ empty = barn_replace_full_sheaf(barn, pcs->main, true);
if (IS_ERR(empty)) {
stat(s, BARN_PUT_FAIL);
goto no_empty;
@@ -6659,7 +6688,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,

if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
&& likely(!slab_test_pfmemalloc(slab))) {
- if (likely(free_to_pcs(s, object)))
+ if (likely(free_to_pcs(s, object, true)))
return;
}

@@ -6922,7 +6951,8 @@ void kfree_nolock(const void *object)
* since kasan quarantine takes locks and not supported from NMI.
*/
kasan_slab_free(s, x, false, false, /* skip quarantine */true);
- do_slab_free(s, slab, x, x, 0, _RET_IP_);
+ if (!free_to_pcs(s, x, false))
+ do_slab_free(s, slab, x, x, 0, _RET_IP_);
}
EXPORT_SYMBOL_GPL(kfree_nolock);

@@ -7465,7 +7495,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
size--;
}

- i = alloc_from_pcs_bulk(s, size, p);
+ i = alloc_from_pcs_bulk(s, flags, size, p);

if (i < size) {
/*

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:07 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

We have removed the partial slab usage from allocation paths. Now remove
the whole config option and associated code.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/Kconfig | 11 ---
mm/slab.h | 29 ------
mm/slub.c | 309 ++++---------------------------------------------------------
3 files changed, 19 insertions(+), 330 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..c83085e34243 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -247,17 +247,6 @@ config SLUB_STATS
out which slabs are relevant to a particular load.
Try running: slabinfo -DA

-config SLUB_CPU_PARTIAL
- default y
- depends on SMP && !SLUB_TINY
- bool "Enable per cpu partial caches"
- help
- Per cpu partial caches accelerate objects allocation and freeing
- that is local to a processor at the price of more indeterminism
- in the latency of the free. On overflow these caches will be cleared
- which requires the taking of locks that may cause latency spikes.
- Typically one would choose no for a realtime system.
-
config RANDOM_KMALLOC_CACHES
default n
depends on !SLUB_TINY
diff --git a/mm/slab.h b/mm/slab.h
index f7b8df56727d..a103da44ab9d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -61,12 +61,6 @@ struct slab {
struct llist_node llnode;
void *flush_freelist;
};
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- struct {
- struct slab *next;
- int slabs; /* Nr of slabs left */
- };
-#endif
};
/* Double-word boundary */
union {
@@ -206,23 +200,6 @@ static inline size_t slab_size(const struct slab *slab)
return PAGE_SIZE << slab_order(slab);
}

-#ifdef CONFIG_SLUB_CPU_PARTIAL
-#define slub_percpu_partial(c) ((c)->partial)
-
-#define slub_set_percpu_partial(c, p) \
-({ \
- slub_percpu_partial(c) = (p)->next; \
-})
-
-#define slub_percpu_partial_read_once(c) READ_ONCE(slub_percpu_partial(c))
-#else
-#define slub_percpu_partial(c) NULL
-
-#define slub_set_percpu_partial(c, p)
-
-#define slub_percpu_partial_read_once(c) NULL
-#endif // CONFIG_SLUB_CPU_PARTIAL
-
/*
* Word size structure that can be atomically updated or read and that
* contains both the order and the number of objects that a slab of the
@@ -246,12 +223,6 @@ struct kmem_cache {
unsigned int object_size; /* Object size without metadata */
struct reciprocal_value reciprocal_size;
unsigned int offset; /* Free pointer offset */
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- /* Number of per cpu partial objects to keep around */
- unsigned int cpu_partial;
- /* Number of per cpu partial slabs to keep around */
- unsigned int cpu_partial_slabs;
-#endif
unsigned int sheaf_capacity;
struct kmem_cache_order_objects oo;

diff --git a/mm/slub.c b/mm/slub.c
index bd67336e7c1f..d8891d852a8f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -263,15 +263,6 @@ void *fixup_red_left(struct kmem_cache *s, void *p)
return p;
}

-static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
-{
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- return !kmem_cache_debug(s);
-#else
- return false;
-#endif
-}
-
/*
* Issues still to be resolved:
*
@@ -425,9 +416,6 @@ struct kmem_cache_cpu {
freelist_aba_t freelist_tid;
};
struct slab *slab; /* The slab from which we are allocating */
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- struct slab *partial; /* Partially allocated slabs */
-#endif
local_trylock_t lock; /* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
unsigned int stat[NR_SLUB_STAT_ITEMS];
@@ -660,29 +648,6 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
return x.x & OO_MASK;
}

-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
-{
- unsigned int nr_slabs;
-
- s->cpu_partial = nr_objects;
-
- /*
- * We take the number of objects but actually limit the number of
- * slabs on the per cpu partial list, in order to limit excessive
- * growth of the list. For simplicity we assume that the slabs will
- * be half-full.
- */
- nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
- s->cpu_partial_slabs = nr_slabs;
-}
-#elif defined(SLAB_SUPPORTS_SYSFS)
-static inline void
-slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
-{
-}
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
-
/*
* If network-based swap is enabled, slub must keep track of whether memory
* were allocated from pfmemalloc reserves.
@@ -3460,12 +3425,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
return object;
}

-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
-#else
-static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
- int drain) { }
-#endif
static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);

static bool get_partial_node_bulk(struct kmem_cache *s,
@@ -3891,131 +3850,6 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
#define local_unlock_cpu_slab(s, flags) \
local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)

-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
-{
- struct kmem_cache_node *n = NULL, *n2 = NULL;
- struct slab *slab, *slab_to_discard = NULL;
- unsigned long flags = 0;
-
- while (partial_slab) {
- slab = partial_slab;
- partial_slab = slab->next;
-
- n2 = get_node(s, slab_nid(slab));
- if (n != n2) {
- if (n)
- spin_unlock_irqrestore(&n->list_lock, flags);
-
- n = n2;
- spin_lock_irqsave(&n->list_lock, flags);
- }
-
- if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
- slab->next = slab_to_discard;
- slab_to_discard = slab;
- } else {
- add_partial(n, slab, DEACTIVATE_TO_TAIL);
- stat(s, FREE_ADD_PARTIAL);
- }
- }
-
- if (n)
- spin_unlock_irqrestore(&n->list_lock, flags);
-
- while (slab_to_discard) {
- slab = slab_to_discard;
- slab_to_discard = slab_to_discard->next;
-
- stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, slab);
- stat(s, FREE_SLAB);
- }
-}
-
-/*
- * Put all the cpu partial slabs to the node partial list.
- */
-static void put_partials(struct kmem_cache *s)
-{
- struct slab *partial_slab;
- unsigned long flags;
-
- local_lock_irqsave(&s->cpu_slab->lock, flags);
- partial_slab = this_cpu_read(s->cpu_slab->partial);
- this_cpu_write(s->cpu_slab->partial, NULL);
- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-
- if (partial_slab)
- __put_partials(s, partial_slab);
-}
-
-static void put_partials_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
-{
- struct slab *partial_slab;
-
- partial_slab = slub_percpu_partial(c);
- c->partial = NULL;
-
- if (partial_slab)
- __put_partials(s, partial_slab);
-}
-
-/*
- * Put a slab into a partial slab slot if available.
- *
- * If we did not find a slot then simply move all the partials to the
- * per node partial list.
- */
-static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
-{
- struct slab *oldslab;
- struct slab *slab_to_put = NULL;
- unsigned long flags;
- int slabs = 0;
-
- local_lock_cpu_slab(s, flags);
-
- oldslab = this_cpu_read(s->cpu_slab->partial);
-
- if (oldslab) {
- if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
- /*
- * Partial array is full. Move the existing set to the
- * per node partial list. Postpone the actual unfreezing
- * outside of the critical section.
- */
- slab_to_put = oldslab;
- oldslab = NULL;
- } else {
- slabs = oldslab->slabs;
- }
- }
-
- slabs++;
-
- slab->slabs = slabs;
- slab->next = oldslab;
-
- this_cpu_write(s->cpu_slab->partial, slab);
-
- local_unlock_cpu_slab(s, flags);
-
- if (slab_to_put) {
- __put_partials(s, slab_to_put);
- stat(s, CPU_PARTIAL_DRAIN);
- }
-}
-
-#else /* CONFIG_SLUB_CPU_PARTIAL */
-
-static inline void put_partials(struct kmem_cache *s) { }
-static inline void put_partials_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c) { }
-
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
-
static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
unsigned long flags;
@@ -4053,8 +3887,6 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
deactivate_slab(s, slab, freelist);
stat(s, CPUSLAB_FLUSH);
}
-
- put_partials_cpu(s, c);
}

static inline void flush_this_cpu_slab(struct kmem_cache *s)
@@ -4063,15 +3895,13 @@ static inline void flush_this_cpu_slab(struct kmem_cache *s)

if (c->slab)
flush_slab(s, c);
-
- put_partials(s);
}

static bool has_cpu_slab(int cpu, struct kmem_cache *s)
{
struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);

- return c->slab || slub_percpu_partial(c);
+ return c->slab;
}

static bool has_pcs_used(int cpu, struct kmem_cache *s)
@@ -5599,21 +5429,18 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
new.inuse -= cnt;
if ((!new.inuse || !prior) && !was_frozen) {
/* Needs to be taken off a list */
- if (!kmem_cache_has_cpu_partial(s) || prior) {
-
- n = get_node(s, slab_nid(slab));
- /*
- * Speculatively acquire the list_lock.
- * If the cmpxchg does not succeed then we may
- * drop the list_lock without any processing.
- *
- * Otherwise the list_lock will synchronize with
- * other processors updating the list of slabs.
- */
- spin_lock_irqsave(&n->list_lock, flags);
-
- on_node_partial = slab_test_node_partial(slab);
- }
+ n = get_node(s, slab_nid(slab));
+ /*
+ * Speculatively acquire the list_lock.
+ * If the cmpxchg does not succeed then we may
+ * drop the list_lock without any processing.
+ *
+ * Otherwise the list_lock will synchronize with
+ * other processors updating the list of slabs.
+ */
+ spin_lock_irqsave(&n->list_lock, flags);
+
+ on_node_partial = slab_test_node_partial(slab);
}

} while (!slab_update_freelist(s, slab,
@@ -5629,13 +5456,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
* activity can be necessary.
*/
stat(s, FREE_FROZEN);
- } else if (kmem_cache_has_cpu_partial(s) && !prior) {
- /*
- * If we started with a full slab then put it onto the
- * per cpu partial list.
- */
- put_cpu_partial(s, slab, 1);
- stat(s, CPU_PARTIAL_FREE);
}

return;
@@ -5657,7 +5477,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
* Objects left in the slab. If it was not on the partial list before
* then add it.
*/
- if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
+ if (unlikely(!prior)) {
add_partial(n, slab, DEACTIVATE_TO_TAIL);
stat(s, FREE_ADD_PARTIAL);
}
@@ -6298,8 +6118,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
if (unlikely(!allow_spin)) {
/*
* __slab_free() can locklessly cmpxchg16 into a slab,
- * but then it might need to take spin_lock or local_lock
- * in put_cpu_partial() for further processing.
+ * but then it might need to take spin_lock
+ * for further processing.
* Avoid the complexity and simply add to a deferred list.
*/
defer_free(s, head);
@@ -7615,39 +7435,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
return 1;
}

-static void set_cpu_partial(struct kmem_cache *s)
-{
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- unsigned int nr_objects;
-
- /*
- * cpu_partial determined the maximum number of objects kept in the
- * per cpu partial lists of a processor.
- *
- * Per cpu partial lists mainly contain slabs that just have one
- * object freed. If they are used for allocation then they can be
- * filled up again with minimal effort. The slab will never hit the
- * per node partial lists and therefore no locking will be required.
- *
- * For backwards compatibility reasons, this is determined as number
- * of objects, even though we now limit maximum number of pages, see
- * slub_set_cpu_partial()
- */
- if (!kmem_cache_has_cpu_partial(s))
- nr_objects = 0;
- else if (s->size >= PAGE_SIZE)
- nr_objects = 6;
- else if (s->size >= 1024)
- nr_objects = 24;
- else if (s->size >= 256)
- nr_objects = 52;
- else
- nr_objects = 120;
-
- slub_set_cpu_partial(s, nr_objects);
-#endif
-}
-
static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
struct kmem_cache_args *args)

@@ -8517,8 +8304,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);

- set_cpu_partial(s);
-
s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
if (!s->cpu_sheaves) {
err = -ENOMEM;
@@ -8882,20 +8667,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
total += x;
nodes[node] += x;

-#ifdef CONFIG_SLUB_CPU_PARTIAL
- slab = slub_percpu_partial_read_once(c);
- if (slab) {
- node = slab_nid(slab);
- if (flags & SO_TOTAL)
- WARN_ON_ONCE(1);
- else if (flags & SO_OBJECTS)
- WARN_ON_ONCE(1);
- else
- x = data_race(slab->slabs);
- total += x;
- nodes[node] += x;
- }
-#endif
}
}

@@ -9030,12 +8801,7 @@ SLAB_ATTR(min_partial);

static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
{
- unsigned int nr_partial = 0;
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- nr_partial = s->cpu_partial;
-#endif
-
- return sysfs_emit(buf, "%u\n", nr_partial);
+ return sysfs_emit(buf, "0\n");
}

static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
@@ -9047,11 +8813,9 @@ static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
err = kstrtouint(buf, 10, &objects);
if (err)
return err;
- if (objects && !kmem_cache_has_cpu_partial(s))
+ if (objects)
return -EINVAL;

- slub_set_cpu_partial(s, objects);
- flush_all(s);
return length;
}
SLAB_ATTR(cpu_partial);
@@ -9090,42 +8854,7 @@ SLAB_ATTR_RO(objects_partial);

static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
{
- int objects = 0;
- int slabs = 0;
- int cpu __maybe_unused;
- int len = 0;
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- for_each_online_cpu(cpu) {
- struct slab *slab;
-
- slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
-
- if (slab)
- slabs += data_race(slab->slabs);
- }
-#endif
-
- /* Approximate half-full slabs, see slub_set_cpu_partial() */
- objects = (slabs * oo_objects(s->oo)) / 2;
- len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- for_each_online_cpu(cpu) {
- struct slab *slab;
-
- slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
- if (slab) {
- slabs = data_race(slab->slabs);
- objects = (slabs * oo_objects(s->oo)) / 2;
- len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
- cpu, objects, slabs);
- }
- }
-#endif
- len += sysfs_emit_at(buf, len, "\n");
-
- return len;
+ return sysfs_emit(buf, "0(0)\n");
}
SLAB_ATTR_RO(slabs_cpu_partial);

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:09 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka, Alexander Potapenko, Marco Elver, Dmitry Vyukov

SLUB's internal bulk allocation __kmem_cache_alloc_bulk() can currently
allocate some objects from KFENCE, i.e. when refilling a sheaf. It works
but it's conceptually the wrong layer, as KFENCE allocations should only
happen when objects are actually handed out from slab to its users.

Currently for sheaf-enabled caches, slab_alloc_node() can return KFENCE
object via kfence_alloc(), but also via alloc_from_pcs() when a sheaf
was refilled with KFENCE objects. Continuing like this would also
complicate the upcoming sheaf refill changes.

Thus remove KFENCE allocation from __kmem_cache_alloc_bulk() and move it
to the places that return slab objects to users. slab_alloc_node() is
already covered (see above). Add kfence_alloc() to
kmem_cache_alloc_from_sheaf() to handle KFENCE allocations from
prefilled sheafs, with a comment that the caller should not expect the
sheaf size to decrease after every allocation because of this
possibility.

For kmem_cache_alloc_bulk() implement a different strategy to handle
KFENCE upfront and rely on internal batched operations afterwards.
Assume there will be at most once KFENCE allocation per bulk allocation
and then assign its index in the array of objects randomly.

Cc: Alexander Potapenko <gli...@google.com>
Cc: Marco Elver <el...@google.com>
Cc: Dmitry Vyukov <dvy...@google.com>

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 44 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 87a1d2f9de0d..4731b9e461c2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5530,6 +5530,9 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
*
* The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
* memcg charging is forced over limit if necessary, to avoid failure.
+ *
+ * It is possible that the allocation comes from kfence and then the sheaf
+ * size is not decreased.
*/
void *
kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
@@ -5541,7 +5544,10 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
if (sheaf->size == 0)
goto out;

- ret = sheaf->objects[--sheaf->size];
+ ret = kfence_alloc(s, s->object_size, gfp);
+
+ if (likely(!ret))
+ ret = sheaf->objects[--sheaf->size];

init = slab_want_init_on_alloc(gfp, s);

@@ -7361,14 +7367,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
local_lock_irqsave(&s->cpu_slab->lock, irqflags);

for (i = 0; i < size; i++) {
- void *object = kfence_alloc(s, s->object_size, flags);
-
- if (unlikely(object)) {
- p[i] = object;
- continue;
- }
+ void *object = c->freelist;

- object = c->freelist;
if (unlikely(!object)) {
/*
* We may have removed an object from c->freelist using
@@ -7449,6 +7449,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
unsigned int i = 0;
+ void *kfence_obj;

if (!size)
return 0;
@@ -7457,6 +7458,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
if (unlikely(!s))
return 0;

+ /*
+ * to make things simpler, only assume at most once kfence allocated
+ * object per bulk allocation and choose its index randomly
+ */
+ kfence_obj = kfence_alloc(s, s->object_size, flags);
+
+ if (unlikely(kfence_obj)) {
+ if (unlikely(size == 1)) {
+ p[0] = kfence_obj;
+ goto out;
+ }
+ size--;
+ }
+
if (s->cpu_sheaves)

i = alloc_from_pcs_bulk(s, size, p);

@@ -7468,10 +7483,23 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
if (i > 0)
__kmem_cache_free_bulk(s, i, p);
+ if (kfence_obj)
+ __kfence_free(kfence_obj);
return 0;
}
}

+ if (unlikely(kfence_obj)) {
+ int idx = get_random_u32_below(size + 1);
+
+ if (idx != size)
+ p[size] = p[idx];
+ p[idx] = kfence_obj;
+
+ size++;
+ }
+
+out:
/*
* memcg and kmem_cache debug support and memory initialization.
* Done outside of the IRQ disabled fastpath loop.

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:12 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

The cpu slab is not used anymore for allocation or freeing, the
remaining code is for flushing, but it's effectively dead. Remove the
whole struct kmem_cache_cpu, the flushing code and other orphaned
functions.

The remaining used field of kmem_cache_cpu is the stat array with
CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats.
In struct kmem_cache, the field is cpu_stats and placed near the
end of the struct.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slab.h | 7 +-
mm/slub.c | 298 +++++---------------------------------------------------------
2 files changed, 24 insertions(+), 281 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 7dde0b56a7b0..62db8a347edf 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -21,14 +21,12 @@
# define system_has_freelist_aba() system_has_cmpxchg128()
# define try_cmpxchg_freelist try_cmpxchg128
# endif
-#define this_cpu_try_cmpxchg_freelist this_cpu_try_cmpxchg128
typedef u128 freelist_full_t;
#else /* CONFIG_64BIT */
# ifdef system_has_cmpxchg64
# define system_has_freelist_aba() system_has_cmpxchg64()
# define try_cmpxchg_freelist try_cmpxchg64
# endif
-#define this_cpu_try_cmpxchg_freelist this_cpu_try_cmpxchg64
typedef u64 freelist_full_t;
#endif /* CONFIG_64BIT */

@@ -207,7 +205,6 @@ struct kmem_cache_order_objects {
* Slab cache management.
*/
struct kmem_cache {
- struct kmem_cache_cpu __percpu *cpu_slab;
struct slub_percpu_sheaves __percpu *cpu_sheaves;
/* Used for retrieving partial slabs, etc. */
slab_flags_t flags;
@@ -256,6 +253,10 @@ struct kmem_cache {
unsigned int usersize; /* Usercopy region size */
#endif

+#ifdef CONFIG_SLUB_STATS
+ struct kmem_cache_stats __percpu *cpu_stats;
+#endif
+
struct kmem_cache_node *node[MAX_NUMNODES];
};

diff --git a/mm/slub.c b/mm/slub.c
index 6dd7fd153391..dcf28fc3a112 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -403,24 +403,11 @@ enum stat_item {
NR_SLUB_STAT_ITEMS
};

-/*
- * When changing the layout, make sure freelist and tid are still compatible
- * with this_cpu_cmpxchg_double() alignment requirements.
- */
-struct kmem_cache_cpu {
- union {
- struct {
- void **freelist; /* Pointer to next available object */
- unsigned long tid; /* Globally unique transaction id */
- };
- freelist_aba_t freelist_tid;
- };
- struct slab *slab; /* The slab from which we are allocating */
- local_trylock_t lock; /* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
+struct kmem_cache_stats {
unsigned int stat[NR_SLUB_STAT_ITEMS];
-#endif
};
+#endif

static inline void stat(const struct kmem_cache *s, enum stat_item si)
{
@@ -429,7 +416,7 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
* The rmw is racy on a preemptible kernel but this is acceptable, so
* avoid this_cpu_add()'s irq-disable overhead.
*/
- raw_cpu_inc(s->cpu_slab->stat[si]);
+ raw_cpu_inc(s->cpu_stats->stat[si]);
#endif
}

@@ -437,7 +424,7 @@ static inline
void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
{
#ifdef CONFIG_SLUB_STATS
- raw_cpu_add(s->cpu_slab->stat[si], v);
+ raw_cpu_add(s->cpu_stats->stat[si], v);
#endif
}

@@ -1158,20 +1145,6 @@ static void object_err(struct kmem_cache *s, struct slab *slab,
WARN_ON(1);
}

-static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
- void **freelist, void *nextfree)
-{
- if ((s->flags & SLAB_CONSISTENCY_CHECKS) &&
- !check_valid_pointer(s, slab, nextfree) && freelist) {
- object_err(s, slab, *freelist, "Freechain corrupt");
- *freelist = NULL;
- slab_fix(s, "Isolate corrupted freechain");
- return true;
- }
-
- return false;
-}
-
static void __slab_err(struct slab *slab)
{
if (slab_in_kunit_test())
@@ -1949,11 +1922,6 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node,
int objects) {}
static inline void dec_slabs_node(struct kmem_cache *s, int node,
int objects) {}
-static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
- void **freelist, void *nextfree)
-{
- return false;
-}
#endif /* CONFIG_SLUB_DEBUG */

/*
@@ -3640,195 +3608,6 @@ static void *get_partial(struct kmem_cache *s, int node,
return get_any_partial(s, pc);
}

-#ifdef CONFIG_PREEMPTION
-/*
- * Calculate the next globally unique transaction for disambiguation
- * during cmpxchg. The transactions start with the cpu number and are then
- * incremented by CONFIG_NR_CPUS.
- */
-#define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
-#else
-/*
- * No preemption supported therefore also no need to check for
- * different cpus.
- */
-#define TID_STEP 1
-#endif /* CONFIG_PREEMPTION */
-
-static inline unsigned long next_tid(unsigned long tid)
-{
- return tid + TID_STEP;
-}
-
-#ifdef SLUB_DEBUG_CMPXCHG
-static inline unsigned int tid_to_cpu(unsigned long tid)
-{
- return tid % TID_STEP;
-}
-
-static inline unsigned long tid_to_event(unsigned long tid)
-{
- return tid / TID_STEP;
-}
-#endif
-
-static inline unsigned int init_tid(int cpu)
-{
- return cpu;
-}
-
-static void init_kmem_cache_cpus(struct kmem_cache *s)
-{
- int cpu;
- struct kmem_cache_cpu *c;
-
- for_each_possible_cpu(cpu) {
- c = per_cpu_ptr(s->cpu_slab, cpu);
- local_trylock_init(&c->lock);
- c->tid = init_tid(cpu);
- }
-}
-
-/*
- * Finishes removing the cpu slab. Merges cpu's freelist with slab's freelist,
- * unfreezes the slabs and puts it on the proper list.
- * Assumes the slab has been already safely taken away from kmem_cache_cpu
- * by the caller.
- */
-static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
- void *freelist)
-{
- struct kmem_cache_node *n = get_node(s, slab_nid(slab));
- int free_delta = 0;
- void *nextfree, *freelist_iter, *freelist_tail;
- int tail = DEACTIVATE_TO_HEAD;

- unsigned long flags = 0;

- struct slab new;
- struct slab old;
-
- if (READ_ONCE(slab->freelist)) {
- stat(s, DEACTIVATE_REMOTE_FREES);
- tail = DEACTIVATE_TO_TAIL;
- }
-
- /*
- * Stage one: Count the objects on cpu's freelist as free_delta and
- * remember the last object in freelist_tail for later splicing.
- */
- freelist_tail = NULL;
- freelist_iter = freelist;
- while (freelist_iter) {
- nextfree = get_freepointer(s, freelist_iter);
-
- /*
- * If 'nextfree' is invalid, it is possible that the object at
- * 'freelist_iter' is already corrupted. So isolate all objects
- * starting at 'freelist_iter' by skipping them.
- */
- if (freelist_corrupted(s, slab, &freelist_iter, nextfree))
- break;
-
- freelist_tail = freelist_iter;
- free_delta++;
-
- freelist_iter = nextfree;
- }
-
- /*
- * Stage two: Unfreeze the slab while splicing the per-cpu
- * freelist to the head of slab's freelist.
- */
- do {
- old.freelist = READ_ONCE(slab->freelist);
- old.counters = READ_ONCE(slab->counters);
- VM_BUG_ON(!old.frozen);
-
- /* Determine target state of the slab */
- new.counters = old.counters;
- new.frozen = 0;
- if (freelist_tail) {
- new.inuse -= free_delta;
- set_freepointer(s, freelist_tail, old.freelist);
- new.freelist = freelist;
- } else {
- new.freelist = old.freelist;
- }
- } while (!slab_update_freelist(s, slab,
- old.freelist, old.counters,
- new.freelist, new.counters,
- "unfreezing slab"));
-
- /*
- * Stage three: Manipulate the slab list based on the updated state.
- */
- if (!new.inuse && n->nr_partial >= s->min_partial) {

- stat(s, DEACTIVATE_EMPTY);
- discard_slab(s, slab);
- stat(s, FREE_SLAB);

- } else if (new.freelist) {
- spin_lock_irqsave(&n->list_lock, flags);
- add_partial(n, slab, tail);
- spin_unlock_irqrestore(&n->list_lock, flags);
- stat(s, tail);
- } else {
- stat(s, DEACTIVATE_FULL);
- }
-}
-
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
-{
- unsigned long flags;
- struct slab *slab;
- void *freelist;

-
- local_lock_irqsave(&s->cpu_slab->lock, flags);
-

- slab = c->slab;
- freelist = c->freelist;
-
- c->slab = NULL;
- c->freelist = NULL;
- c->tid = next_tid(c->tid);
-

- local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-

- if (slab) {
- deactivate_slab(s, slab, freelist);
- stat(s, CPUSLAB_FLUSH);
- }
-}
-
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
-{
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
- void *freelist = c->freelist;
- struct slab *slab = c->slab;
-
- c->slab = NULL;
- c->freelist = NULL;
- c->tid = next_tid(c->tid);
-
- if (slab) {
- deactivate_slab(s, slab, freelist);
- stat(s, CPUSLAB_FLUSH);
- }
-}
-
-static inline void flush_this_cpu_slab(struct kmem_cache *s)
-{
- struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
-
- if (c->slab)
- flush_slab(s, c);
-}
-
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
- return c->slab;
-}
-

static bool has_pcs_used(int cpu, struct kmem_cache *s)

{
struct slub_percpu_sheaves *pcs;
@@ -3842,7 +3621,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
}

/*
- * Flush cpu slab.
+ * Flush percpu sheaves
*
* Called from CPU work handler with migration disabled.
*/
@@ -3857,8 +3636,6 @@ static void flush_cpu_slab(struct work_struct *w)

if (s->sheaf_capacity)
pcs_flush_all(s);
-
- flush_this_cpu_slab(s);
}

static void flush_all_cpus_locked(struct kmem_cache *s)
@@ -3871,7 +3648,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)

for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
- if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
+ if (!has_pcs_used(cpu, s)) {
sfw->skip = true;
continue;
}
@@ -3976,7 +3753,6 @@ static int slub_cpu_dead(unsigned int cpu)

mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
- __flush_cpu_slab(s, cpu);
if (s->sheaf_capacity)
__pcs_flush_all_cpu(s, cpu);
}
@@ -7036,26 +6812,21 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
barn_init(barn);
}

-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+#ifdef CONFIG_SLUB_STATS
+static inline int alloc_kmem_cache_stats(struct kmem_cache *s)
{
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH *
- sizeof(struct kmem_cache_cpu));
+ sizeof(struct kmem_cache_stats));

- /*
- * Must align to double word boundary for the double cmpxchg
- * instructions to work; see __pcpu_double_call_return_bool().
- */
- s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu),
- 2 * sizeof(void *));
+ s->cpu_stats = alloc_percpu(struct kmem_cache_stats);

- if (!s->cpu_slab)
+ if (!s->cpu_stats)
return 0;

- init_kmem_cache_cpus(s);
-
return 1;
}
+#endif

static int init_percpu_sheaves(struct kmem_cache *s)
{
@@ -7166,7 +6937,9 @@ void __kmem_cache_release(struct kmem_cache *s)
{
cache_random_seq_destroy(s);
pcs_destroy(s);
- free_percpu(s->cpu_slab);
+#ifdef CONFIG_SLUB_STATS
+ free_percpu(s->cpu_stats);
+#endif
free_kmem_cache_nodes(s);
}

@@ -7847,12 +7620,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)

memcpy(s, static_cache, kmem_cache->object_size);

- /*
- * This runs very early, and only the boot processor is supposed to be
- * up. Even if it weren't true, IRQs are not up so we couldn't fire
- * IPIs around.
- */
- __flush_cpu_slab(s, smp_processor_id());
for_each_kmem_cache_node(s, node, n) {
struct slab *p;

@@ -8092,8 +7859,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
if (!init_kmem_cache_nodes(s))
goto out;

- if (!alloc_kmem_cache_cpus(s))
+#ifdef CONFIG_SLUB_STATS
+ if (!alloc_kmem_cache_stats(s))
goto out;
+#endif

err = init_percpu_sheaves(s);
if (err)
@@ -8412,33 +8181,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
if (!nodes)
return -ENOMEM;

- if (flags & SO_CPU) {
- int cpu;
-
- for_each_possible_cpu(cpu) {
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
- cpu);
- int node;

- struct slab *slab;
-

- slab = READ_ONCE(c->slab);
- if (!slab)
- continue;
-

- node = slab_nid(slab);
- if (flags & SO_TOTAL)

- x = slab->objects;

- else if (flags & SO_OBJECTS)

- x = slab->inuse;
- else
- x = 1;
-

- total += x;
- nodes[node] += x;
-
- }

- }
-
/*
* It is impossible to take "mem_hotplug_lock" here with "kernfs_mutex"
* already held which will conflict with an existing lock order:
@@ -8809,7 +8551,7 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
return -ENOMEM;

for_each_online_cpu(cpu) {
- unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
+ unsigned x = per_cpu_ptr(s->cpu_stats, cpu)->stat[si];

data[cpu] = x;
sum += x;
@@ -8835,7 +8577,7 @@ static void clear_stat(struct kmem_cache *s, enum stat_item si)
int cpu;

for_each_online_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
+ per_cpu_ptr(s->cpu_stats, cpu)->stat[si] = 0;
}

#define STAT_ATTR(si, text) \

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:12 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

We have removed cpu slab usage from allocation paths. Now remove
do_slab_free() which was freeing objects to the cpu slab when
the object belonged to it. Instead call __slab_free() directly,
which was previously the fallback.

This simplifies kfree_nolock() - when freeing to percpu sheaf
fails, we can call defer_free() directly.

Also remove functions that became unused.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 149 ++++++--------------------------------------------------------
1 file changed, 13 insertions(+), 136 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d8891d852a8f..a35eb397caa9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3671,29 +3671,6 @@ static inline unsigned int init_tid(int cpu)
return cpu;
}

-static inline void note_cmpxchg_failure(const char *n,
- const struct kmem_cache *s, unsigned long tid)
-{
-#ifdef SLUB_DEBUG_CMPXCHG
- unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
-
- pr_info("%s %s: cmpxchg redo ", n, s->name);
-
- if (IS_ENABLED(CONFIG_PREEMPTION) &&
- tid_to_cpu(tid) != tid_to_cpu(actual_tid)) {
- pr_warn("due to cpu change %d -> %d\n",
- tid_to_cpu(tid), tid_to_cpu(actual_tid));
- } else if (tid_to_event(tid) != tid_to_event(actual_tid)) {
- pr_warn("due to cpu running other code. Event %ld->%ld\n",
- tid_to_event(tid), tid_to_event(actual_tid));
- } else {
- pr_warn("for unknown reason: actual=%lx was=%lx target=%lx\n",
- actual_tid, tid, next_tid(tid));
- }
-#endif
- stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
-}
-
static void init_kmem_cache_cpus(struct kmem_cache *s)
{
#ifdef CONFIG_PREEMPT_RT
@@ -4231,18 +4208,6 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags)
return true;
}

-static inline bool
-__update_cpu_freelist_fast(struct kmem_cache *s,
- void *freelist_old, void *freelist_new,
- unsigned long tid)
-{
- freelist_aba_t old = { .freelist = freelist_old, .counter = tid };
- freelist_aba_t new = { .freelist = freelist_new, .counter = next_tid(tid) };
-
- return this_cpu_try_cmpxchg_freelist(s->cpu_slab->freelist_tid.full,
- &old.full, new.full);
-}
-
/*
* Get the slab's freelist and do not freeze it.
*
@@ -6076,99 +6041,6 @@ void defer_free_barrier(void)
irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
}

-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- *
- * Bulk free of a freelist with several objects (all pointing to the
- * same slab) possible by specifying head and tail ptr, plus objects
- * count (cnt). Bulk free indicated by tail pointer being set.
- */
-static __always_inline void do_slab_free(struct kmem_cache *s,
- struct slab *slab, void *head, void *tail,
- int cnt, unsigned long addr)
-{
- /* cnt == 0 signals that it's called from kfree_nolock() */
- bool allow_spin = cnt;
- struct kmem_cache_cpu *c;
- unsigned long tid;
- void **freelist;
-
-redo:
- /*
- * Determine the currently cpus per cpu slab.
- * The cpu may change afterward. However that does not matter since
- * data is retrieved via this pointer. If we are on the same cpu
- * during the cmpxchg then the free will succeed.
- */
- c = raw_cpu_ptr(s->cpu_slab);
- tid = READ_ONCE(c->tid);
-
- /* Same with comment on barrier() in __slab_alloc_node() */
- barrier();
-
- if (unlikely(slab != c->slab)) {
- if (unlikely(!allow_spin)) {
- /*
- * __slab_free() can locklessly cmpxchg16 into a slab,

- * but then it might need to take spin_lock

- * for further processing.
- * Avoid the complexity and simply add to a deferred list.
- */
- defer_free(s, head);
- } else {
- __slab_free(s, slab, head, tail, cnt, addr);
- }
- return;
- }
-
- if (unlikely(!allow_spin)) {
- if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
- local_lock_is_locked(&s->cpu_slab->lock)) {
- defer_free(s, head);
- return;
- }
- cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
- }
-
- if (USE_LOCKLESS_FAST_PATH()) {
- freelist = READ_ONCE(c->freelist);
-
- set_freepointer(s, tail, freelist);
-
- if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
- note_cmpxchg_failure("slab_free", s, tid);
- goto redo;
- }
- } else {
- __maybe_unused unsigned long flags = 0;
-
- /* Update the free list under the local lock */
- local_lock_cpu_slab(s, flags);
- c = this_cpu_ptr(s->cpu_slab);
- if (unlikely(slab != c->slab)) {
- local_unlock_cpu_slab(s, flags);
- goto redo;
- }
- tid = c->tid;

- freelist = c->freelist;
-

- set_freepointer(s, tail, freelist);
- c->freelist = head;
- c->tid = next_tid(tid);

-
- local_unlock_cpu_slab(s, flags);
- }

- stat_add(s, FREE_FASTPATH, cnt);
-}
-
static __fastpath_inline
void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
unsigned long addr)
@@ -6185,7 +6057,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
return;
}

- do_slab_free(s, slab, object, object, 1, addr);
+ __slab_free(s, slab, object, object, 1, addr);
}

#ifdef CONFIG_MEMCG
@@ -6194,7 +6066,7 @@ static noinline
void memcg_alloc_abort_single(struct kmem_cache *s, void *object)
{
if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
- do_slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_);
+ __slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_);
}
#endif

@@ -6209,7 +6081,7 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
* to remove objects, whose reuse must be delayed.
*/
if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
- do_slab_free(s, slab, head, tail, cnt, addr);
+ __slab_free(s, slab, head, tail, cnt, addr);
}

#ifdef CONFIG_SLUB_RCU_DEBUG
@@ -6235,14 +6107,14 @@ static void slab_free_after_rcu_debug(struct rcu_head *rcu_head)

/* resume freeing */
if (slab_free_hook(s, object, slab_want_init_on_free(s), true))
- do_slab_free(s, slab, object, object, 1, _THIS_IP_);
+ __slab_free(s, slab, object, object, 1, _THIS_IP_);
}
#endif /* CONFIG_SLUB_RCU_DEBUG */

#ifdef CONFIG_KASAN_GENERIC
void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
{
- do_slab_free(cache, virt_to_slab(x), x, x, 1, addr);
+ __slab_free(cache, virt_to_slab(x), x, x, 1, addr);
}
#endif

@@ -6444,8 +6316,13 @@ void kfree_nolock(const void *object)

* since kasan quarantine takes locks and not supported from NMI.
*/
kasan_slab_free(s, x, false, false, /* skip quarantine */true);

+ /*
+ * __slab_free() can locklessly cmpxchg16 into a slab, but then it might
+ * need to take spin_lock for further processing.
+ * Avoid the complexity and simply add to a deferred list.
+ */
if (!free_to_pcs(s, x, false))

- do_slab_free(s, slab, x, x, 0, _RET_IP_);

+ defer_free(s, x);
}
EXPORT_SYMBOL_GPL(kfree_nolock);

@@ -6862,7 +6739,7 @@ static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
if (kfence_free(df.freelist))
continue;

- do_slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
+ __slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
_RET_IP_);
} while (likely(size));
}
@@ -6945,7 +6822,7 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
cnt++;
object = get_freepointer(s, object);
} while (object);
- do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
+ __slab_free(s, slab, head, tail, cnt, _RET_IP_);
}

if (refilled >= max)

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:14 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove
them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code
being removed. The only remaining usage is in fact testing whether we
can assert irqs disabled, because spin_lock_irqsave() only does that on
!RT. Test for CONFIG_PREEMPT_RT instead.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 24 +-----------------------
1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index dcf28fc3a112..d55afa9b277f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -201,28 +201,6 @@ enum slab_flags {
SL_pfmemalloc = PG_active, /* Historical reasons for this bit */
};

-/*
- * We could simply use migrate_disable()/enable() but as long as it's a
- * function call even on !PREEMPT_RT, use inline preempt_disable() there.
- */
-#ifndef CONFIG_PREEMPT_RT
-#define slub_get_cpu_ptr(var) get_cpu_ptr(var)
-#define slub_put_cpu_ptr(var) put_cpu_ptr(var)
-#define USE_LOCKLESS_FAST_PATH() (true)
-#else
-#define slub_get_cpu_ptr(var) \
-({ \
- migrate_disable(); \
- this_cpu_ptr(var); \
-})
-#define slub_put_cpu_ptr(var) \
-do { \
- (void)(var); \
- migrate_enable(); \
-} while (0)
-#define USE_LOCKLESS_FAST_PATH() (false)
-#endif
-
#ifndef CONFIG_SLUB_TINY
#define __fastpath_inline __always_inline
#else
@@ -715,7 +693,7 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla
{
bool ret;

- if (USE_LOCKLESS_FAST_PATH())
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
lockdep_assert_irqs_disabled();

if (s->flags & __CMPXCHG_DOUBLE) {

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:15 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

When a pfmemalloc allocation actually dips into reserves, the slab is
marked accordingly and non-pfmemalloc allocations should not be allowed
to allocate from it. The sheaves percpu caching currently doesn't follow
this rule, so implement it before we expand sheaves usage to all caches.

Make sure objects from pfmemalloc slabs don't end up in percpu sheaves.
When freeing, skip sheaves when freeing an object from pfmemalloc slab.
When refilling sheaves, use __GFP_NOMEMALLOC to override any pfmemalloc
context - the allocation will fallback to regular slab allocations when
sheaves are depleted and can't be refilled because of the override.

For kfree_rcu(), detect pfmemalloc slabs after processing the rcu_sheaf
after the grace period in __rcu_free_sheaf_prepare() and simply flush
it if any object is from pfmemalloc slabs.

For prefilled sheaves, try to refill them first with __GFP_NOMEMALLOC
and if it fails, retry without __GFP_NOMEMALLOC but then mark the sheaf
pfmemalloc, which makes it flushed back to slabs when returned.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 51 insertions(+), 14 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4731b9e461c2..ab03f29dc3bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -469,7 +469,10 @@ struct slab_sheaf {
struct rcu_head rcu_head;
struct list_head barn_list;
/* only used for prefilled sheafs */
- unsigned int capacity;
+ struct {
+ unsigned int capacity;
+ bool pfmemalloc;
+ };
};
struct kmem_cache *cache;
unsigned int size;
@@ -2645,7 +2648,7 @@ static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
if (!sheaf)
return NULL;

- if (refill_sheaf(s, sheaf, gfp)) {
+ if (refill_sheaf(s, sheaf, gfp | __GFP_NOMEMALLOC)) {
free_empty_sheaf(s, sheaf);
return NULL;
}
@@ -2723,12 +2726,13 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
sheaf->size = 0;
}

-static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+static bool __rcu_free_sheaf_prepare(struct kmem_cache *s,
struct slab_sheaf *sheaf)
{
bool init = slab_want_init_on_free(s);
void **p = &sheaf->objects[0];

unsigned int i = 0;

+ bool pfmemalloc = false;

while (i < sheaf->size) {
struct slab *slab = virt_to_slab(p[i]);
@@ -2741,8 +2745,13 @@ static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
continue;
}

+ if (slab_test_pfmemalloc(slab))
+ pfmemalloc = true;
+
i++;
}
+
+ return pfmemalloc;
}

static void rcu_free_sheaf_nobarn(struct rcu_head *head)
@@ -5031,7 +5040,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
return NULL;

if (empty) {
- if (!refill_sheaf(s, empty, gfp)) {
+ if (!refill_sheaf(s, empty, gfp | __GFP_NOMEMALLOC)) {
full = empty;
} else {
/*
@@ -5331,6 +5340,26 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
}
EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);

+static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
+ struct slab_sheaf *sheaf, gfp_t gfp)
+{
+ int ret = 0;
+
+ ret = refill_sheaf(s, sheaf, gfp | __GFP_NOMEMALLOC);
+
+ if (likely(!ret || !gfp_pfmemalloc_allowed(gfp)))
+ return ret;
+
+ /*
+ * if we are allowed to, refill sheaf with pfmemalloc but then remember
+ * it for when it's returned
+ */
+ ret = refill_sheaf(s, sheaf, gfp);
+ sheaf->pfmemalloc = true;
+
+ return ret;
+}
+
/*
* returns a sheaf that has at least the requested size
* when prefilling is needed, do so with given gfp flags
@@ -5401,17 +5430,18 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
if (!sheaf)
sheaf = alloc_empty_sheaf(s, gfp);

- if (sheaf && sheaf->size < size) {
- if (refill_sheaf(s, sheaf, gfp)) {
+ if (sheaf) {
+ sheaf->capacity = s->sheaf_capacity;
+ sheaf->pfmemalloc = false;
+
+ if (sheaf->size < size &&
+ __prefill_sheaf_pfmemalloc(s, sheaf, gfp)) {
sheaf_flush_unused(s, sheaf);
free_empty_sheaf(s, sheaf);
sheaf = NULL;
}
}

- if (sheaf)
- sheaf->capacity = s->sheaf_capacity;
-
return sheaf;
}

@@ -5431,7 +5461,8 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
struct slub_percpu_sheaves *pcs;
struct node_barn *barn;

- if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
+ if (unlikely((sheaf->capacity != s->sheaf_capacity)
+ || sheaf->pfmemalloc)) {
sheaf_flush_unused(s, sheaf);
kfree(sheaf);
return;
@@ -5497,7 +5528,7 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,

if (likely(sheaf->capacity >= size)) {
if (likely(sheaf->capacity == s->sheaf_capacity))
- return refill_sheaf(s, sheaf, gfp);
+ return __prefill_sheaf_pfmemalloc(s, sheaf, gfp);

if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
&sheaf->objects[sheaf->size])) {
@@ -6177,8 +6208,12 @@ static void rcu_free_sheaf(struct rcu_head *head)
* handles it fine. The only downside is that sheaf will serve fewer
* allocations when reused. It only happens due to debugging, which is a
* performance hit anyway.
+ *
+ * If it returns true, there was at least one object from pfmemalloc
+ * slab so simply flush everything.
*/
- __rcu_free_sheaf_prepare(s, sheaf);
+ if (__rcu_free_sheaf_prepare(s, sheaf))
+ goto flush;

n = get_node(s, sheaf->node);
if (!n)
@@ -6333,7 +6368,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
continue;
}

- if (unlikely(IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)) {
+ if (unlikely((IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)
+ || slab_test_pfmemalloc(slab))) {
remote_objects[remote_nr] = p[i];
p[i] = p[--size];
if (++remote_nr >= PCS_BATCH_MAX)
@@ -6631,7 +6667,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
return;

if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
- slab_nid(slab) == numa_mem_id())) {
+ slab_nid(slab) == numa_mem_id())
+ && likely(!slab_test_pfmemalloc(slab))) {
if (likely(free_to_pcs(s, object)))
return;
}

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:17 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

Currently slabs are only frozen after consistency checks failed. This
can happen only in caches with debugging enabled, and those use
free_to_partial_list() for freeing. The non-debug operation of
__slab_free() can thus stop considering the frozen field, and we can
remove the FREE_FROZEN stat.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 20 +++++---------------
1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 515a2b59cb52..9b551c48c2eb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -336,7 +336,6 @@ enum stat_item {
FREE_RCU_SHEAF_FAIL, /* Failed to free to a rcu_free sheaf */
FREE_FASTPATH, /* Free to cpu slab */
FREE_SLOWPATH, /* Freeing not to cpu slab */
- FREE_FROZEN, /* Freeing to frozen slab */
FREE_ADD_PARTIAL, /* Freeing moves slab to partial list */
FREE_REMOVE_PARTIAL, /* Freeing removes last object */
ALLOC_FROM_PARTIAL, /* Cpu slab acquired from node partial list */
@@ -5036,7 +5035,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,

{
void *prior;
- int was_frozen;
struct slab new;
unsigned long counters;
struct kmem_cache_node *n = NULL;
@@ -5059,9 +5057,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
counters = slab->counters;
set_freepointer(s, tail, prior);
new.counters = counters;
- was_frozen = new.frozen;
new.inuse -= cnt;
- if ((!new.inuse || !prior) && !was_frozen) {
+ if (!new.inuse || !prior) {

/* Needs to be taken off a list */

n = get_node(s, slab_nid(slab));
/*
@@ -5083,15 +5080,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
"__slab_free"));

if (likely(!n)) {
-
- if (likely(was_frozen)) {
- /*
- * The list lock was not taken therefore no list
- * activity can be necessary.
- */
- stat(s, FREE_FROZEN);
- }
-
+ /*
+ * The list lock was not taken therefore no list activity can be
+ * necessary.
+ */
return;
}

@@ -8648,7 +8640,6 @@ STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
-STAT_ATTR(FREE_FROZEN, free_frozen);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
@@ -8753,7 +8744,6 @@ static struct attribute *slab_attrs[] = {
&free_rcu_sheaf_fail_attr.attr,
&free_fastpath_attr.attr,
&free_slowpath_attr.attr,
- &free_frozen_attr.attr,
&free_add_partial_attr.attr,
&free_remove_partial_attr.attr,
&alloc_from_partial_attr.attr,

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:18 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

There are no more cpu slabs so we don't need their deferred
deactivation. The function is now only used from a place where we
allocate a new slab but then can't spin on node list_lock to put it on
the partial list. Instead of the deferred action we can free it directly
via __free_slab(), we just need to tell it to use _nolock() freeing of
the underlying pages and take care of the accounting.

Since free_frozen_pages_nolock() variant does not yet exist for code
outside of the page allocator, create it as a trivial wrapper for
__free_frozen_pages(..., FPI_TRYLOCK).

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/internal.h | 1 +
mm/page_alloc.c | 5 +++++
mm/slab.h | 8 +-------
mm/slub.c | 50 +++++++++++++++-----------------------------------
4 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..64c5eda7c1ae 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -846,6 +846,7 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord
struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
#define alloc_frozen_pages_nolock(...) \
alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__))
+void free_frozen_pages_nolock(struct page *page, unsigned int order);

extern void zone_pcp_reset(struct zone *zone);
extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23..f8ac3232db41 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2944,6 +2944,11 @@ void free_frozen_pages(struct page *page, unsigned int order)
__free_frozen_pages(page, order, FPI_NONE);
}

+void free_frozen_pages_nolock(struct page *page, unsigned int order)
+{
+ __free_frozen_pages(page, order, FPI_TRYLOCK);
+}
+
/*
* Free a batch of folios
*/
diff --git a/mm/slab.h b/mm/slab.h
index a103da44ab9d..b2663cc594f3 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -55,13 +55,7 @@ struct slab {
struct kmem_cache *slab_cache;
union {
struct {
- union {
- struct list_head slab_list;
- struct { /* For deferred deactivate_slab() */
- struct llist_node llnode;
- void *flush_freelist;
- };
- };
+ struct list_head slab_list;

/* Double-word boundary */
union {

struct {
diff --git a/mm/slub.c b/mm/slub.c
index a35eb397caa9..6f5ca26bbb00 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3246,7 +3246,7 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}

-static void __free_slab(struct kmem_cache *s, struct slab *slab)
+static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
{
struct folio *folio = slab_folio(slab);
int order = folio_order(folio);
@@ -3257,14 +3257,18 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
__folio_clear_slab(folio);
mm_account_reclaimed_pages(pages);
unaccount_slab(slab, order, s);
- free_frozen_pages(&folio->page, order);
+
+ if (allow_spin)
+ free_frozen_pages(&folio->page, order);
+ else
+ free_frozen_pages_nolock(&folio->page, order);
}

static void rcu_free_slab(struct rcu_head *h)
{
struct slab *slab = container_of(h, struct slab, rcu_head);

- __free_slab(slab->slab_cache, slab);
+ __free_slab(slab->slab_cache, slab, true);
}

static void free_slab(struct kmem_cache *s, struct slab *slab)
@@ -3280,7 +3284,7 @@ static void free_slab(struct kmem_cache *s, struct slab *slab)
if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
call_rcu(&slab->rcu_head, rcu_free_slab);
else
- __free_slab(s, slab);
+ __free_slab(s, slab, true);
}

static void discard_slab(struct kmem_cache *s, struct slab *slab)
@@ -3373,8 +3377,6 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
return object;
}

-static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
-
/*
* Called only for kmem_cache_debug() caches to allocate from a freshly
* allocated slab. Allocate a single object instead of whole freelist
@@ -3390,8 +3392,12 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
void *object;

if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
- /* Unlucky, discard newly allocated slab */
- defer_deactivate_slab(slab, NULL);
+ /*
+ * Unlucky, discard newly allocated slab.
+ * Since it was just allocated, we can skip the actions
+ * in discard_slab() and free_slab().
+ */
+ __free_slab(s, slab, false);
return NULL;
}

@@ -5949,7 +5955,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)

struct defer_free {
struct llist_head objects;
- struct llist_head slabs;
struct irq_work work;
};

@@ -5957,7 +5962,6 @@ static void free_deferred_objects(struct irq_work *work);

static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
.objects = LLIST_HEAD_INIT(objects),
- .slabs = LLIST_HEAD_INIT(slabs),
.work = IRQ_WORK_INIT(free_deferred_objects),
};

@@ -5970,10 +5974,9 @@ static void free_deferred_objects(struct irq_work *work)
{
struct defer_free *df = container_of(work, struct defer_free, work);
struct llist_head *objs = &df->objects;
- struct llist_head *slabs = &df->slabs;
struct llist_node *llnode, *pos, *t;

- if (llist_empty(objs) && llist_empty(slabs))
+ if (llist_empty(objs))
return;

llnode = llist_del_all(objs);
@@ -5997,16 +6000,6 @@ static void free_deferred_objects(struct irq_work *work)

__slab_free(s, slab, x, x, 1, _THIS_IP_);
}
-
- llnode = llist_del_all(slabs);
- llist_for_each_safe(pos, t, llnode) {
- struct slab *slab = container_of(pos, struct slab, llnode);
-
- if (slab->frozen)
- deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
- else
- free_slab(slab->slab_cache, slab);
- }
}

static void defer_free(struct kmem_cache *s, void *head)
@@ -6020,19 +6013,6 @@ static void defer_free(struct kmem_cache *s, void *head)
irq_work_queue(&df->work);
}

-static void defer_deactivate_slab(struct slab *slab, void *flush_freelist)
-{
- struct defer_free *df;
-
- slab->flush_freelist = flush_freelist;
-
- guard(preempt)();
-
- df = this_cpu_ptr(&defer_free_objects);
- if (llist_add(&slab->llnode, &df->slabs))
- irq_work_queue(&df->work);
-}
-
void defer_free_barrier(void)
{
int cpu;

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:20 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

We want to expand usage of sheaves to all non-boot caches, including
kmalloc caches. Since sheaves themselves are also allocated by
kmalloc(), we need to prevent excessive or infinite recursion -
depending on sheaf size, the sheaf can be allocated from smaller, same
or larger kmalloc size bucket, there's no particular constraint.

This is similar to allocating the objext arrays so let's just reuse the
existing mechanisms for those. __GFP_NO_OBJ_EXT in alloc_empty_sheaf()
will prevent a nested kmalloc() from allocating a sheaf itself - it will
either have sheaves already, or fallback to a non-sheaf-cached
allocation (so bootstrap of sheaves in a kmalloc cache that allocates
sheaves from its own size bucket is possible). Additionally, reuse
OBJCGS_CLEAR_MASK to clear unwanted gfp flags from the nested
allocation.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

include/linux/gfp_types.h | 6 ------
mm/slub.c | 36 ++++++++++++++++++++++++++----------
2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 65db9349f905..3de43b12209e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -55,9 +55,7 @@ enum {
#ifdef CONFIG_LOCKDEP
___GFP_NOLOCKDEP_BIT,
#endif
-#ifdef CONFIG_SLAB_OBJ_EXT
___GFP_NO_OBJ_EXT_BIT,
-#endif
___GFP_LAST_BIT
};

@@ -98,11 +96,7 @@ enum {
#else
#define ___GFP_NOLOCKDEP 0
#endif
-#ifdef CONFIG_SLAB_OBJ_EXT
#define ___GFP_NO_OBJ_EXT BIT(___GFP_NO_OBJ_EXT_BIT)
-#else
-#define ___GFP_NO_OBJ_EXT 0
-#endif

/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
diff --git a/mm/slub.c b/mm/slub.c
index 68867cd52c4f..f2b2a6180759 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2031,6 +2031,14 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
}
#endif /* CONFIG_SLUB_DEBUG */

+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
+ __GFP_ACCOUNT | __GFP_NOFAIL)
+
#ifdef CONFIG_SLAB_OBJ_EXT

#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
@@ -2081,14 +2089,6 @@ static inline void handle_failed_objexts_alloc(unsigned long obj_exts,

#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */

-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
- __GFP_ACCOUNT | __GFP_NOFAIL)
-
static inline void init_slab_obj_exts(struct slab *slab)
{
slab->obj_exts = 0;
@@ -2590,8 +2590,24 @@ static void *setup_object(struct kmem_cache *s, void *object)

static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
{
- struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
- s->sheaf_capacity), gfp);
+ struct slab_sheaf *sheaf;
+ size_t sheaf_size;
+
+ if (gfp & __GFP_NO_OBJ_EXT)
+ return NULL;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+
+ /*
+ * Prevent recursion to the same cache, or a deep stack of kmallocs of
+ * varying sizes (sheaf capacity might differ for each kmalloc size
+ * bucket)
+ */
+ if (s->flags & SLAB_KMALLOC)
+ gfp |= __GFP_NO_OBJ_EXT;
+
+ sheaf_size = struct_size(sheaf, objects, s->sheaf_capacity);
+ sheaf = kzalloc(sheaf_size, gfp);

if (unlikely(!sheaf))
return NULL;

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:22 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

The kmalloc_nolock() implementation has several complications and
restrictions due to SLUB's cpu slab locking, lockless fastpath and
PREEMPT_RT differences. With cpu slab usage removed, we can simplify
things:

- the local_lock_cpu_slab() macros became unused, remove them

- we no longer need to set up lockdep classes on PREEMPT_RT

- we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
since there's no lockless cpu freelist manipulation anymore

- __slab_alloc_node() can be called from kmalloc_nolock_noprof()
unconditionally

Note that we still need __CMPXCHG_DOUBLE, because while it was removed
we don't use cmpxchg16b on cpu freelist anymore, we still use it on
slab freelist, and the alternative is slab_lock() which can be
interrupted by a nmi. Clarify the comment to mention it specifically.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slab.h | 1 -
mm/slub.c | 100 ++++----------------------------------------------------------
2 files changed, 6 insertions(+), 95 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index b2663cc594f3..7dde0b56a7b0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -208,7 +208,6 @@ struct kmem_cache_order_objects {
*/
struct kmem_cache {
struct kmem_cache_cpu __percpu *cpu_slab;
- struct lock_class_key lock_key;

struct slub_percpu_sheaves __percpu *cpu_sheaves;
/* Used for retrieving partial slabs, etc. */
slab_flags_t flags;

diff --git a/mm/slub.c b/mm/slub.c
index 6f5ca26bbb00..6dd7fd153391 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3679,29 +3679,12 @@ static inline unsigned int init_tid(int cpu)

static void init_kmem_cache_cpus(struct kmem_cache *s)
{

-#ifdef CONFIG_PREEMPT_RT
- /*
- * Register lockdep key for non-boot kmem caches to avoid
- * WARN_ON_ONCE(static_obj(key))) in lockdep_register_key()
- */
- bool finegrain_lockdep = !init_section_contains(s, 1);
-#else
- /*
- * Don't bother with different lockdep classes for each
- * kmem_cache, since we only use local_trylock_irqsave().
- */
- bool finegrain_lockdep = false;
-#endif
int cpu;
struct kmem_cache_cpu *c;

- if (finegrain_lockdep)
- lockdep_register_key(&s->lock_key);
for_each_possible_cpu(cpu) {

c = per_cpu_ptr(s->cpu_slab, cpu);

local_trylock_init(&c->lock);
- if (finegrain_lockdep)
- lockdep_set_class(&c->lock, &s->lock_key);
c->tid = init_tid(cpu);
}
}
@@ -3792,47 +3775,6 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
}
}

-/*
- * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
- * can be acquired without a deadlock before invoking the function.
- *
- * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
- * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
- * and kmalloc() is not used in an unsupported context.
- *
- * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
- * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
- * lockdep_assert() will catch a bug in case:
- * #1
- * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
- * or
- * #2
- * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
- *
- * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
- * disabled context. The lock will always be acquired and if needed it
- * block and sleep until the lock is available.
- * #1 is possible in !PREEMPT_RT only.
- * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
- * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
- * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
- *
- * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
- */
-#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
-#define local_lock_cpu_slab(s, flags) \
- local_lock_irqsave(&(s)->cpu_slab->lock, flags)
-#else
-#define local_lock_cpu_slab(s, flags) \
- do { \
- bool __l = local_trylock_irqsave(&(s)->cpu_slab->lock, flags); \
- lockdep_assert(__l); \
- } while (0)
-#endif
-
-#define local_unlock_cpu_slab(s, flags) \
- local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
-

static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)

{
unsigned long flags;
@@ -4320,19 +4262,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

return freelist;
}
-/*
- * We disallow kprobes in ___slab_alloc() to prevent reentrance
- *
- * kmalloc() -> ___slab_alloc() -> local_lock_cpu_slab() protected part of
- * ___slab_alloc() manipulating c->freelist -> kprobe -> bpf ->
- * kmalloc_nolock() or kfree_nolock() -> __update_cpu_freelist_fast()
- * manipulating c->freelist without lock.
- *
- * This does not prevent kprobe in functions called from ___slab_alloc() such as
- * local_lock_irqsave() itself, and that is fine, we only need to protect the
- * c->freelist manipulation in ___slab_alloc() itself.
- */
-NOKPROBE_SYMBOL(___slab_alloc);

static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
@@ -5201,10 +5130,11 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
/*
* kmalloc_nolock() is not supported on architectures that
- * don't implement cmpxchg16b, but debug caches don't use
- * per-cpu slab and per-cpu partial slabs. They rely on
- * kmem_cache_node->list_lock, so kmalloc_nolock() can
- * attempt to allocate from debug caches by
+ * don't implement cmpxchg16b and thus need slab_lock()
+ * which could be preempted by a nmi.
+ * But debug caches don't use that and only rely on
+ * kmem_cache_node->list_lock, so kmalloc_nolock() can attempt
+ * to allocate from debug caches by
* spin_trylock_irqsave(&n->list_lock, ...)
*/
return NULL;
@@ -5214,27 +5144,13 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
if (ret)
goto success;

- ret = ERR_PTR(-EBUSY);
-

/*
* Do not call slab_alloc_node(), since trylock mode isn't
* compatible with slab_pre_alloc_hook/should_failslab and

* kfence_alloc. Hence call __slab_alloc_node() (at most twice)
* and slab_post_alloc_hook() directly.
- *
- * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
- * in irq saved region. It assumes that the same cpu will not
- * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
- * Therefore use in_nmi() to check whether particular bucket is in
- * irq protected section.
- *
- * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
- * this cpu was interrupted somewhere inside ___slab_alloc() after
- * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
- * In this case fast path with __update_cpu_freelist_fast() is not safe.
*/
- if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
- ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+ ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);

if (PTR_ERR(ret) == -EBUSY) {
if (can_retry) {
@@ -7250,10 +7166,6 @@ void __kmem_cache_release(struct kmem_cache *s)
{
cache_random_seq_destroy(s);
pcs_destroy(s);
-#ifdef CONFIG_PREEMPT_RT
- if (s->cpu_slab)
- lockdep_unregister_key(&s->lock_key);
-#endif
free_percpu(s->cpu_slab);
free_kmem_cache_nodes(s);
}

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:24 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

In the first step to replace cpu (partial) slabs with sheaves, enable
sheaves for almost all caches. Treat args->sheaf_capacity as a minimum,
and calculate sheaf capacity with a formula that roughly follows the
formula for number of objects in cpu partial slabs in set_cpu_partial().

This should achieve roughly similar contention on the barn spin lock as
there's currently for node list_lock without sheaves, to make
benchmarking results comparable. It can be further tuned later.

Don't enable sheaves for kmalloc caches yet, as that needs further
changes to bootstraping.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

include/linux/slab.h | 6 ------
mm/slub.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index cf443f064a66..e42aa6a3d202 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -57,9 +57,7 @@ enum _slab_flag_bits {
#endif
_SLAB_OBJECT_POISON,
_SLAB_CMPXCHG_DOUBLE,
-#ifdef CONFIG_SLAB_OBJ_EXT
_SLAB_NO_OBJ_EXT,
-#endif
_SLAB_FLAGS_LAST_BIT
};

@@ -238,11 +236,7 @@ enum _slab_flag_bits {
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */

/* Slab created using create_boot_cache */
-#ifdef CONFIG_SLAB_OBJ_EXT
#define SLAB_NO_OBJ_EXT __SLAB_FLAG_BIT(_SLAB_NO_OBJ_EXT)
-#else
-#define SLAB_NO_OBJ_EXT __SLAB_FLAG_UNUSED
-#endif

/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
diff --git a/mm/slub.c b/mm/slub.c
index f2b2a6180759..a6e58d3708f4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7810,6 +7810,48 @@ static void set_cpu_partial(struct kmem_cache *s)
#endif
}

+static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
+ struct kmem_cache_args *args)
+
+{
+ unsigned int capacity;
+ size_t size;
+
+
+ if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
+ return 0;
+
+ /* bootstrap caches can't have sheaves for now */
+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return 0;
+
+ /*
+ * For now we use roughly similar formula (divided by two as there are
+ * two percpu sheaves) as what was used for percpu partial slabs, which
+ * should result in similar lock contention (barn or list_lock)
+ */
+ if (s->size >= PAGE_SIZE)
+ capacity = 4;
+ else if (s->size >= 1024)
+ capacity = 12;
+ else if (s->size >= 256)
+ capacity = 26;
+ else
+ capacity = 60;
+
+ /* Increment capacity to make sheaf exactly a kmalloc size bucket */
+ size = struct_size_t(struct slab_sheaf, objects, capacity);
+ size = kmalloc_size_roundup(size);
+ capacity = (size - struct_size_t(struct slab_sheaf, objects, 0)) / sizeof(void *);
+
+ /*
+ * Respect an explicit request for capacity that's typically motivated by
+ * expected maximum size of kmem_cache_prefill_sheaf() to not end up
+ * using low-performance oversize sheaves
+ */
+ return max(capacity, args->sheaf_capacity);
+}
+
/*
* calculate_sizes() determines the order and the distribution of data within
* a slab object.
@@ -7944,6 +7986,10 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
if (s->flags & SLAB_RECLAIM_ACCOUNT)
s->allocflags |= __GFP_RECLAIMABLE;

+ /* kmalloc caches need extra care to support sheaves */
+ if (!is_kmalloc_cache(s))
+ s->sheaf_capacity = calculate_sheaf_capacity(s, args);
+
/*
* Determine the number of objects per slab
*/
@@ -8562,15 +8608,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,

set_cpu_partial(s);

- if (args->sheaf_capacity && !IS_ENABLED(CONFIG_SLUB_TINY)
- && !(s->flags & SLAB_DEBUG_FLAGS)) {
+ if (s->sheaf_capacity) {

s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
if (!s->cpu_sheaves) {
err = -ENOMEM;

goto out;
}
- // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
- s->sheaf_capacity = args->sheaf_capacity;
}

#ifdef CONFIG_NUMA

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:25 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

__refill_objects() currently only attempts to get partial slabs from the
local node and then allocates new slab(s). Expand it to trying also
other nodes while observing the remote node defrag ratio, similarly to
get_any_partial().

This will prevent allocating new slabs on a node while other nodes have
many free slabs. It does mean sheaves will contain non-local objects in
that case. Allocations that care about specific node will still be
served appropriately, but might get a slowpath allocation.

Like get_any_partial() we do observe cpuset_zone_allowed(), although we
might be refilling a sheaf that will be then used from a different
allocation context.

We can also use the resulting refill_objects() in
__kmem_cache_alloc_bulk() for non-debug caches. This means
kmem_cache_alloc_bulk() will get better performance when sheaves are
exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
it's compatible with sheaves refill in preferring the local node.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 102 insertions(+), 28 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d55afa9b277f..4e003493ba60 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2505,8 +2505,8 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
}

static unsigned int
-__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
- unsigned int max);
+refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+ unsigned int max);

static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
gfp_t gfp)
@@ -2517,8 +2517,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
if (!to_fill)
return 0;

- filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
- to_fill, to_fill);
+ filled = refill_objects(s, &sheaf->objects[sheaf->size], gfp, to_fill,
+ to_fill);

sheaf->size += filled;

@@ -6423,25 +6423,21 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
EXPORT_SYMBOL(kmem_cache_free_bulk);

static unsigned int
-__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
- unsigned int max)
+__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+ unsigned int max, struct kmem_cache_node *n)
{
struct slab *slab, *slab2;
struct partial_context pc;
unsigned int refilled = 0;
unsigned long flags;
void *object;
- int node;

pc.flags = gfp;
pc.min_objects = min;
pc.max_objects = max;

- node = numa_mem_id();
-
- /* TODO: consider also other nodes? */
- if (!get_partial_node_bulk(s, get_node(s, node), &pc))
- goto new_slab;
+ if (!get_partial_node_bulk(s, n, &pc))
+ return 0;

list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {

@@ -6480,8 +6476,6 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
}

if (unlikely(!list_empty(&pc.slabs))) {
- struct kmem_cache_node *n = get_node(s, node);
-
spin_lock_irqsave(&n->list_lock, flags);

list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
@@ -6503,13 +6497,91 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
}
}

+ return refilled;
+}

- if (likely(refilled >= min))
- goto out;
+#ifdef CONFIG_NUMA
+static unsigned int
+__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+ unsigned int max, int local_node)
+{
+ struct zonelist *zonelist;
+ struct zoneref *z;
+ struct zone *zone;
+ enum zone_type highest_zoneidx = gfp_zone(gfp);
+ unsigned int cpuset_mems_cookie;
+ unsigned int refilled = 0;
+
+ /* see get_any_partial() for the defrag ratio description */
+ if (!s->remote_node_defrag_ratio ||
+ get_cycles() % 1024 > s->remote_node_defrag_ratio)
+ return 0;
+
+ do {
+ cpuset_mems_cookie = read_mems_allowed_begin();
+ zonelist = node_zonelist(mempolicy_slab_node(), gfp);
+ for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
+ struct kmem_cache_node *n;
+ unsigned int r;
+
+ n = get_node(s, zone_to_nid(zone));
+
+ if (!n || !cpuset_zone_allowed(zone, gfp) ||
+ n->nr_partial <= s->min_partial)
+ continue;
+
+ r = __refill_objects_node(s, p, gfp, min, max, n);
+ refilled += r;
+
+ if (r >= min) {
+ /*
+ * Don't check read_mems_allowed_retry() here -
+ * if mems_allowed was updated in parallel, that
+ * was a harmless race between allocation and
+ * the cpuset update
+ */
+ return refilled;
+ }
+ p += r;
+ min -= r;
+ max -= r;
+ }
+ } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+ return refilled;
+}
+#else
+static inline unsigned int
+__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+ unsigned int max, int local_node)
+{
+ return 0;
+}
+#endif
+
+static unsigned int
+refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+ unsigned int max)
+{
+ int local_node = numa_mem_id();
+ unsigned int refilled;
+ unsigned long flags;
+ struct slab *slab;
+ void *object;
+
+ refilled = __refill_objects_node(s, p, gfp, min, max,
+ get_node(s, local_node));
+ if (refilled >= min)
+ return refilled;
+
+ refilled += __refill_objects_any(s, p + refilled, gfp, min - refilled,
+ max - refilled, local_node);
+ if (refilled >= min)
+ return refilled;

new_slab:

- slab = new_slab(s, pc.flags, node);
+ slab = new_slab(s, gfp, local_node);
if (!slab)
goto out;

@@ -6541,8 +6613,8 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,

if (refilled < min)
goto new_slab;
-out:

+out:
return refilled;
}

@@ -6552,18 +6624,20 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
{
int i;

- /*
- * TODO: this might be more efficient (if necessary) by reusing
- * __refill_objects()
- */
- for (i = 0; i < size; i++) {
+ if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
+ for (i = 0; i < size; i++) {

- p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
- s->object_size);
- if (unlikely(!p[i]))
- goto error;
+ p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
+ s->object_size);
+ if (unlikely(!p[i]))
+ goto error;

- maybe_wipe_obj_freeptr(s, p[i]);
+ maybe_wipe_obj_freeptr(s, p[i]);
+ }
+ } else {
+ i = refill_objects(s, p, flags, size, size);
+ if (i < size)
+ goto error;
}

return i;

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:27 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
sheaves enabled. Since we want to enable them for almost all caches,
it's suboptimal to test the pointer in the fast paths, so instead
allocate it for all caches in do_kmem_cache_create(). Instead of testing
the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
kmem_cache->sheaf_capacity for being 0, where needed.

However, for the fast paths sake we also assume that the main sheaf
always exists (pcs->main is !NULL), and during bootstrap we cannot
allocate sheaves yet.

Solve this by introducing a single static bootstrap_sheaf that's
assigned as pcs->main during bootstrap. It has a size of 0, so during
allocations, the fast path will find it's empty. Since the size of 0
matches sheaf_capacity of 0, the freeing fast paths will find it's
"full". In the slow path handlers, we check sheaf_capacity to recognize
that the cache doesn't (yet) have real sheaves, and fall back. Thus
sharing the single bootstrap sheaf like this for multiple caches and
cpus is safe.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 70 insertions(+), 26 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a6e58d3708f4..ecb10ed5acfe 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2850,6 +2850,10 @@ static void pcs_destroy(struct kmem_cache *s)
if (!pcs->main)
continue;

+ /* bootstrap or debug caches, it's the bootstrap_sheaf */
+ if (!pcs->main->cache)
+ continue;
+
/*
* We have already passed __kmem_cache_shutdown() so everything
* was flushed and there should be no objects allocated from
@@ -4054,7 +4058,7 @@ static void flush_cpu_slab(struct work_struct *w)

s = sfw->s;

- if (s->cpu_sheaves)
+ if (s->sheaf_capacity)
pcs_flush_all(s);

flush_this_cpu_slab(s);
@@ -4176,7 +4180,7 @@ static int slub_cpu_dead(unsigned int cpu)
mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
__flush_cpu_slab(s, cpu);
- if (s->cpu_sheaves)
+ if (s->sheaf_capacity)
__pcs_flush_all_cpu(s, cpu);
}
mutex_unlock(&slab_mutex);
@@ -4979,6 +4983,12 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,

lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));

+ /* Bootstrap or debug cache, back off */
+ if (unlikely(!s->sheaf_capacity)) {
+ local_unlock(&s->cpu_sheaves->lock);

+ return NULL;
+ }
+

if (pcs->spare && pcs->spare->size > 0) {
swap(pcs->main, pcs->spare);
return pcs;
@@ -5162,6 +5172,11 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
struct slab_sheaf *full;
struct node_barn *barn;

+ if (unlikely(!s->sheaf_capacity)) {
+ local_unlock(&s->cpu_sheaves->lock);
+ return allocated;
+ }
+
if (pcs->spare && pcs->spare->size > 0) {
swap(pcs->main, pcs->spare);
goto do_alloc;
@@ -5241,8 +5256,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
if (unlikely(object))
goto out;

- if (s->cpu_sheaves)
- object = alloc_from_pcs(s, gfpflags, node);
+ object = alloc_from_pcs(s, gfpflags, node);

if (!object)
object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
@@ -6042,6 +6056,12 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
restart:
lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));

+ /* Bootstrap or debug cache, back off */
+ if (unlikely(!s->sheaf_capacity)) {
+ local_unlock(&s->cpu_sheaves->lock);

+ return NULL;
+ }
+

barn = get_barn(s);
if (!barn) {
local_unlock(&s->cpu_sheaves->lock);
@@ -6240,6 +6260,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)

struct slab_sheaf *empty;
struct node_barn *barn;

+ /* Bootstrap or debug cache, fall back */
+ if (!unlikely(s->sheaf_capacity)) {
+ local_unlock(&s->cpu_sheaves->lock);
+ goto fail;
+ }
+
if (pcs->spare && pcs->spare->size == 0) {
pcs->rcu_free = pcs->spare;
pcs->spare = NULL;
@@ -6364,6 +6390,9 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
if (likely(pcs->main->size < s->sheaf_capacity))
goto do_free;

+ if (unlikely(!s->sheaf_capacity))
+ goto no_empty;
+
barn = get_barn(s);
if (!barn)
goto no_empty;
@@ -6628,9 +6657,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
return;

- if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
- slab_nid(slab) == numa_mem_id())
- && likely(!slab_test_pfmemalloc(slab))) {
+ if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())

+ && likely(!slab_test_pfmemalloc(slab))) {
if (likely(free_to_pcs(s, object)))
return;
}

@@ -7437,8 +7465,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
size--;
}

- if (s->cpu_sheaves)
- i = alloc_from_pcs_bulk(s, size, p);
+ i = alloc_from_pcs_bulk(s, size, p);

if (i < size) {
/*
@@ -7649,6 +7676,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)

static int init_percpu_sheaves(struct kmem_cache *s)
{

+ static struct slab_sheaf bootstrap_sheaf = {};
int cpu;

for_each_possible_cpu(cpu) {
@@ -7658,7 +7686,28 @@ static int init_percpu_sheaves(struct kmem_cache *s)

local_trylock_init(&pcs->lock);

- pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+ /*
+ * Bootstrap sheaf has zero size so fast-path allocation fails.
+ * It has also size == s->sheaf_capacity, so fast-path free
+ * fails. In the slow paths we recognize the situation by
+ * checking s->sheaf_capacity. This allows fast paths to assume
+ * s->pcs_sheaves and pcs->main always exists and is valid.
+ * It's also safe to share the single static bootstrap_sheaf
+ * with zero-sized objects array as it's never modified.
+ *
+ * bootstrap_sheaf also has NULL pointer to kmem_cache so we
+ * recognize it and not attempt to free it when destroying the
+ * cache
+ *
+ * We keep bootstrap_sheaf for kmem_cache and kmem_cache_node,
+ * caches with debug enabled, and all caches with SLUB_TINY.
+ * For kmalloc caches it's used temporarily during the initial
+ * bootstrap.
+ */
+ if (!s->sheaf_capacity)
+ pcs->main = &bootstrap_sheaf;
+ else
+ pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);

if (!pcs->main)
return -ENOMEM;
@@ -7733,8 +7782,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)

void __kmem_cache_release(struct kmem_cache *s)
{
cache_random_seq_destroy(s);

- if (s->cpu_sheaves)
- pcs_destroy(s);
+ pcs_destroy(s);
#ifdef CONFIG_PREEMPT_RT
if (s->cpu_slab)
lockdep_unregister_key(&s->lock_key);
@@ -7756,7 +7804,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
continue;
}

- if (s->cpu_sheaves) {
+ if (s->sheaf_capacity) {
barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);

if (!barn)
@@ -8074,7 +8122,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
flush_all_cpus_locked(s);

/* we might have rcu sheaves in flight */
- if (s->cpu_sheaves)
+ if (s->sheaf_capacity)
rcu_barrier();

/* Attempt to free all objects */
@@ -8375,7 +8423,7 @@ static int slab_mem_going_online_callback(int nid)
if (get_node(s, nid))
continue;

- if (s->cpu_sheaves) {
+ if (s->sheaf_capacity) {
barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);

if (!barn) {
@@ -8608,12 +8656,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,

set_cpu_partial(s);

- if (s->sheaf_capacity) {
- s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
- if (!s->cpu_sheaves) {
- err = -ENOMEM;
- goto out;
- }
+ s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+ if (!s->cpu_sheaves) {
+ err = -ENOMEM;
+ goto out;
}

#ifdef CONFIG_NUMA
@@ -8632,11 +8678,9 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
if (!alloc_kmem_cache_cpus(s))
goto out;

- if (s->cpu_sheaves) {
- err = init_percpu_sheaves(s);
- if (err)
- goto out;
- }
+ err = init_percpu_sheaves(s);
+ if (err)
+ goto out;

err = 0;

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:28 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

The changes related to sheaves made the description of locking and other
details outdated. Update it to reflect current state.

Also add a new copyright line due to major changes.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 141 +++++++++++++++++++++++++++++---------------------------------
1 file changed, 67 insertions(+), 74 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4e003493ba60..515a2b59cb52 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1,13 +1,15 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * SLUB: A slab allocator that limits cache line use instead of queuing
- * objects in per cpu and per node lists.
+ * SLUB: A slab allocator with low overhead percpu array caches and mostly
+ * lockless freeing of objects to slabs in the slowpath.
*
- * The allocator synchronizes using per slab locks or atomic operations
- * and only uses a centralized lock to manage a pool of partial slabs.
+ * The allocator synchronizes using spin_trylock for percpu arrays in the
+ * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing.
+ * Uses a centralized lock to manage a pool of partial slabs.
*
* (C) 2007 SGI, Christoph Lameter
* (C) 2011 Linux Foundation, Christoph Lameter
+ * (C) 2025 SUSE, Vlastimil Babka
*/

#include <linux/mm.h>
@@ -53,11 +55,13 @@

/*
* Lock order:
- * 1. slab_mutex (Global Mutex)
- * 2. node->list_lock (Spinlock)
- * 3. kmem_cache->cpu_slab->lock (Local lock)
- * 4. slab_lock(slab) (Only on some arches)
- * 5. object_map_lock (Only for debugging)
+ * 0. cpu_hotplug_lock
+ * 1. slab_mutex (Global Mutex)
+ * 2a. kmem_cache->cpu_sheaves->lock (Local trylock)
+ * 2b. node->barn->lock (Spinlock)
+ * 2c. node->list_lock (Spinlock)
+ * 3. slab_lock(slab) (Only on some arches)
+ * 4. object_map_lock (Only for debugging)
*
* slab_mutex
*
@@ -78,31 +82,38 @@
* C. slab->objects -> Number of objects in slab
* D. slab->frozen -> frozen state
*
- * Frozen slabs
+ * SL_partial slabs
+ *
+ * Slabs on node partial list have at least one free object. A limited number
+ * of slabs on the list can be fully free (slab->inuse == 0), until we start
+ * discarding them. These slabs are marked with SL_partial, and the flag is
+ * cleared while removing them, usually to grab their freelist afterwards.
+ * This clearing also exempts them from list management. Please see
+ * __slab_free() for more details.
*
- * If a slab is frozen then it is exempt from list management. It is
- * the cpu slab which is actively allocated from by the processor that
- * froze it and it is not on any list. The processor that froze the
- * slab is the one who can perform list operations on the slab. Other
- * processors may put objects onto the freelist but the processor that
- * froze the slab is the only one that can retrieve the objects from the
- * slab's freelist.
+ * Full slabs
*
- * CPU partial slabs
+ * For caches without debugging enabled, full slabs (slab->inuse ==
+ * slab->objects and slab->freelist == NULL) are not placed on any list.
+ * The __slab_free() freeing the first object from such a slab will place
+ * it on the partial list. Caches with debugging enabled place such slab
+ * on the full list and use different allocation and freeing paths.
+ *
+ * Frozen slabs
*
- * The partially empty slabs cached on the CPU partial list are used
- * for performance reasons, which speeds up the allocation process.
- * These slabs are not frozen, but are also exempt from list management,
- * by clearing the SL_partial flag when moving out of the node
- * partial list. Please see __slab_free() for more details.
+ * If a slab is frozen then it is exempt from list management. It is used to
+ * indicate a slab that has failed consistency checks and thus cannot be
+ * allocated from anymore - it is also marked as full. Any previously
+ * allocated objects will be simply leaked upon freeing instead of attempting
+ * to modify the potentially corrupted freelist and metadata.
*
* To sum up, the current scheme is:
- * - node partial slab: SL_partial && !frozen
- * - cpu partial slab: !SL_partial && !frozen
- * - cpu slab: !SL_partial && frozen
- * - full slab: !SL_partial && !frozen
+ * - node partial slab: SL_partial && !full && !frozen
+ * - taken off partial list: !SL_partial && !full && !frozen
+ * - full slab, not on any list: !SL_partial && full && !frozen
+ * - frozen due to inconsistency: !SL_partial && full && frozen
*
- * list_lock
+ * node->list_lock (spinlock)
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter. If taken then no new slabs may be added or
@@ -112,47 +123,46 @@
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible. As long as SLUB does not have to handle partial
- * slabs, operations can continue without any centralized lock. F.e.
- * allocating a long series of objects that fill up slabs does not require
- * the list lock.
+ * slabs, operations can continue without any centralized lock.
*
* For debug caches, all allocations are forced to go through a list_lock
* protected region to serialize against concurrent validation.
*
- * cpu_slab->lock local lock
+ * cpu_sheaves->lock (local_trylock)
*
- * This locks protect slowpath manipulation of all kmem_cache_cpu fields
- * except the stat counters. This is a percpu structure manipulated only by
- * the local cpu, so the lock protects against being preempted or interrupted
- * by an irq. Fast path operations rely on lockless operations instead.
+ * This lock protects fastpath operations on the percpu sheaves. On !RT it
+ * only disables preemption and does no atomic operations. As long as the main
+ * or spare sheaf can handle the allocation or free, there is no other
+ * overhead.
*
- * On PREEMPT_RT, the local lock neither disables interrupts nor preemption
- * which means the lockless fastpath cannot be used as it might interfere with
- * an in-progress slow path operations. In this case the local lock is always
- * taken but it still utilizes the freelist for the common operations.
+ * node->barn->lock (spinlock)
*
- * lockless fastpaths
+ * This lock protects the operations on per-NUMA-node barn. It can quickly
+ * serve an empty or full sheaf if available, and avoid more expensive refill
+ * or flush operation.
*
- * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
- * are fully lockless when satisfied from the percpu slab (and when
- * cmpxchg_double is possible to use, otherwise slab_lock is taken).
- * They also don't disable preemption or migration or irqs. They rely on
- * the transaction id (tid) field to detect being preempted or moved to
- * another cpu.
+ * Lockless freeing
+ *
+ * Objects may have to be freed to their slabs when they are from a remote
+ * node (where we want to avoid filling local sheaves with remote objects)
+ * or when there are too many full sheaves. On architectures supporting
+ * cmpxchg_double this is done by a lockless update of slab's freelist and
+ * counters, otherwise slab_lock is taken. This only needs to take the
+ * list_lock if it's a first free to a full slab, or when there are too many
+ * fully free slabs and some need to be discarded.
*
* irq, preemption, migration considerations
*
- * Interrupts are disabled as part of list_lock or local_lock operations, or
+ * Interrupts are disabled as part of list_lock or barn lock operations, or
* around the slab_lock operation, in order to make the slab allocator safe
* to use in the context of an irq.
+ * Preemption is disabled as part of local_trylock operations.
+ * kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
+ * their limitations.
*
- * In addition, preemption (or migration on PREEMPT_RT) is disabled in the
- * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
- * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
- * doesn't have to be revalidated in each section protected by the local lock.
- *
- * SLUB assigns one slab for allocation to each processor.
- * Allocations only occur from these slabs called cpu slabs.
+ * SLUB assigns two object arrays called sheaves for caching allocation and
+ * frees on each cpu, with a NUMA node shared barn for balancing between cpus.
+ * Allocations and frees are primarily served from these sheaves.
*
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used. If an object in a full slab is
@@ -160,25 +170,8 @@
* We track full slabs for debugging purposes though because otherwise we
* cannot scan all objects.
*
- * Slabs are freed when they become empty. Teardown and setup is
- * minimal so we rely on the page allocators per cpu caches for
- * fast frees and allocs.
- *
- * slab->frozen The slab is frozen and exempt from list processing.
- * This means that the slab is dedicated to a purpose
- * such as satisfying allocations for a specific
- * processor. Objects may be freed in the slab while
- * it is frozen but slab_free will then skip the usual
- * list operations. It is up to the processor holding
- * the slab to integrate the slab into the slab lists
- * when the slab is no longer needed.
- *
- * One use of this flag is to mark slabs that are
- * used for allocations. Then such a slab becomes a cpu
- * slab. The cpu slab may be equipped with an additional
- * freelist that allows lockless access to
- * free objects in addition to the regular freelist
- * that requires the slab lock.
+ * Slabs are freed when they become empty. Teardown and setup is minimal so we
+ * rely on the page allocators per cpu caches for fast frees and allocs.
*
* SLAB_DEBUG_FLAGS Slab requires special handling due to debug
* options set. This moves slab handling out of

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:29 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

At this point we have sheaves enabled for all caches, but their refill
is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
slabs - now a redundant caching layer that we are about to remove.

The refill will thus be done from slabs on the node partial list.
Introduce new functions that can do that in an optimized way as it's
easier than modifying the __kmem_cache_alloc_bulk() call chain.

Extend struct partial_context so it can return a list of slabs from the
partial list with the sum of free objects in them within the requested
min and max.

Introduce get_partial_node_bulk() that removes the slabs from freelist
and returns them in the list.

Introduce get_freelist_nofreeze() which grabs the freelist without
freezing the slab.

Introduce __refill_objects() that uses the functions above to fill an
array of objects. It has to handle the possibility that the slabs will
contain more objects that were requested, due to concurrent freeing of
objects to those slabs. When no more slabs on partial lists are
available, it will allocate new slabs.

Finally, switch refill_sheaf() to use __refill_objects().

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 230 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a84027fbca78..e2b052657d11 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -246,6 +246,9 @@ struct partial_context {
gfp_t flags;
unsigned int orig_size;
void *object;
+ unsigned int min_objects;
+ unsigned int max_objects;
+ struct list_head slabs;
};

static inline bool kmem_cache_debug(struct kmem_cache *s)
@@ -2633,9 +2636,9 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
stat(s, SHEAF_FREE);
}

-static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
- size_t size, void **p);
-
+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,

+ unsigned int max);

static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
gfp_t gfp)

@@ -2646,8 +2649,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
if (!to_fill)
return 0;

- filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
- &sheaf->objects[sheaf->size]);
+ filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
+ to_fill, to_fill);

sheaf->size += filled;

@@ -3508,6 +3511,69 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
#endif
static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);

+static bool get_partial_node_bulk(struct kmem_cache *s,
+ struct kmem_cache_node *n,
+ struct partial_context *pc)
+{
+ struct slab *slab, *slab2;
+ unsigned int total_free = 0;

+ unsigned long flags;
+

+ /*
+ * Racy check. If we mistakenly see no partial slabs then we
+ * just allocate an empty slab. If we mistakenly try to get a
+ * partial slab and there is none available then get_partial()
+ * will return NULL.
+ */
+ if (!n || !n->nr_partial)
+ return false;
+
+ INIT_LIST_HEAD(&pc->slabs);
+
+ if (gfpflags_allow_spinning(pc->flags))
+ spin_lock_irqsave(&n->list_lock, flags);
+ else if (!spin_trylock_irqsave(&n->list_lock, flags))
+ return false;
+
+ list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+ struct slab slab_counters;
+ unsigned int slab_free;
+
+ if (!pfmemalloc_match(slab, pc->flags))
+ continue;
+
+ /*
+ * due to atomic updates done by a racing free we should not
+ * read garbage here, but do a sanity check anyway
+ *
+ * slab_free is a lower bound due to subsequent concurrent
+ * freeing, the caller might get more objects than requested and
+ * must deal with it
+ */
+ slab_counters.counters = data_race(READ_ONCE(slab->counters));
+ slab_free = slab_counters.objects - slab_counters.inuse;
+
+ if (unlikely(slab_free > oo_objects(s->oo)))
+ continue;
+
+ /* we have already min and this would get us over the max */
+ if (total_free >= pc->min_objects
+ && total_free + slab_free > pc->max_objects)
+ continue;
+
+ remove_partial(n, slab);
+
+ list_add(&slab->slab_list, &pc->slabs);
+
+ total_free += slab_free;
+ if (total_free >= pc->max_objects)
+ break;
+ }
+
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return total_free > 0;
+}
+
/*
* Try to allocate a partial slab from a specific node.
*/
@@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
return freelist;
}

+/*
+ * Get the slab's freelist and do not freeze it.
+ *
+ * Assumes the slab is isolated from node partial list and not frozen.
+ *
+ * Assumes this is performed only for caches without debugging so we
+ * don't need to worry about adding the slab to the full list
+ */
+static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
+{
+ struct slab new;
+ unsigned long counters;
+ void *freelist;
+
+ do {
+ freelist = slab->freelist;
+ counters = slab->counters;
+
+ new.counters = counters;
+ VM_BUG_ON(new.frozen);
+
+ new.inuse = slab->objects;
+ new.frozen = 0;
+
+ } while (!slab_update_freelist(s, slab,
+ freelist, counters,
+ NULL, new.counters,
+ "get_freelist_nofreeze"));
+
+ return freelist;
+}
+
/*
* Freeze the partial slab and return the pointer to the freelist.
*/
@@ -5373,6 +5471,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
return ret;
}

+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+ size_t size, void **p);

+
/*
* returns a sheaf that has at least the requested size
* when prefilling is needed, do so with given gfp flags

@@ -7409,6 +7510,130 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
}
EXPORT_SYMBOL(kmem_cache_free_bulk);

+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,

+ unsigned int max)
+{

+ struct slab *slab, *slab2;
+ struct partial_context pc;

+ unsigned int refilled = 0;

+ unsigned long flags;
+ void *object;
+ int node;
+
+ pc.flags = gfp;
+ pc.min_objects = min;
+ pc.max_objects = max;
+
+ node = numa_mem_id();
+
+ /* TODO: consider also other nodes? */
+ if (!get_partial_node_bulk(s, get_node(s, node), &pc))
+ goto new_slab;
+
+ list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+ list_del(&slab->slab_list);
+
+ object = get_freelist_nofreeze(s, slab);
+
+ while (object && refilled < max) {
+ p[refilled] = object;
+ object = get_freepointer(s, object);
+ maybe_wipe_obj_freeptr(s, p[refilled]);
+
+ refilled++;
+ }
+
+ /*
+ * Freelist had more objects than we can accomodate, we need to
+ * free them back. We can treat it like a detached freelist, just
+ * need to find the tail object.
+ */
+ if (unlikely(object)) {
+ void *head = object;
+ void *tail;
+ int cnt = 0;
+
+ do {
+ tail = object;
+ cnt++;
+ object = get_freepointer(s, object);
+ } while (object);
+ do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
+ }
+
+ if (refilled >= max)
+ break;
+ }
+
+ if (unlikely(!list_empty(&pc.slabs))) {
+ struct kmem_cache_node *n = get_node(s, node);
+

+ spin_lock_irqsave(&n->list_lock, flags);
+

+ list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+ if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
+ continue;
+
+ list_del(&slab->slab_list);
+ add_partial(n, slab, DEACTIVATE_TO_HEAD);
+ }
+
+ spin_unlock_irqrestore(&n->list_lock, flags);
+
+ /* any slabs left are completely free and for discard */
+ list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+ list_del(&slab->slab_list);
+ discard_slab(s, slab);
+ }
+ }
+
+
+ if (likely(refilled >= min))
+ goto out;
+
+new_slab:
+
+ slab = new_slab(s, pc.flags, node);
+ if (!slab)
+ goto out;
+
+ stat(s, ALLOC_SLAB);
+ inc_slabs_node(s, slab_nid(slab), slab->objects);
+
+ /*
+ * TODO: possible optimization - if we know we will consume the whole
+ * slab we might skip creating the freelist?
+ */
+ object = slab->freelist;
+ while (object && refilled < max) {
+ p[refilled] = object;
+ object = get_freepointer(s, object);
+ maybe_wipe_obj_freeptr(s, p[refilled]);
+
+ slab->inuse++;
+ refilled++;
+ }
+ slab->freelist = object;
+
+ if (slab->freelist) {
+ struct kmem_cache_node *n = get_node(s, slab_nid(slab));
+
+ spin_lock_irqsave(&n->list_lock, flags);
+ add_partial(n, slab, DEACTIVATE_TO_HEAD);
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+
+ if (refilled < min)
+ goto new_slab;
+out:

+
+ return refilled;
+}
+

static inline
int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:31 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL,
we can simply allow them in calculate_sizes() as they are created later
than KMALLOC_NORMAL caches and can allocate sheaves and barns from
those.

For KMALLOC_NORMAL caches we perform additional step after first
creating them without sheaves. Then bootstrap_cache_sheaves() simply
allocates and initializes barns and sheaves and finally sets
s->sheaf_capacity to make them actually used.

Afterwards the only caches left without sheaves (unless SLUB_TINY or
debugging is enabled) are kmem_cache and kmem_cache_node. These are only
used when creating or destroying other kmem_caches. Thus they are not
performance critical and we can simply leave it that way.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 84 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 5d0b2cf66520..a84027fbca78 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2588,7 +2588,8 @@ static void *setup_object(struct kmem_cache *s, void *object)
return object;
}

-static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
+ unsigned int capacity)
{
struct slab_sheaf *sheaf;
size_t sheaf_size;
@@ -2606,7 +2607,7 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)

if (s->flags & SLAB_KMALLOC)

gfp |= __GFP_NO_OBJ_EXT;

- sheaf_size = struct_size(sheaf, objects, s->sheaf_capacity);
+ sheaf_size = struct_size(sheaf, objects, capacity);

sheaf = kzalloc(sheaf_size, gfp);

if (unlikely(!sheaf))

@@ -2619,6 +2620,12 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
return sheaf;
}

+static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
+ gfp_t gfp)
+{
+ return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+}
+

static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)

{
kfree(sheaf);
@@ -8064,8 +8071,11 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)

if (s->flags & SLAB_RECLAIM_ACCOUNT)
s->allocflags |= __GFP_RECLAIMABLE;

- /* kmalloc caches need extra care to support sheaves */
- if (!is_kmalloc_cache(s))
+ /*
+ * For KMALLOC_NORMAL caches we enable sheaves later by
+ * bootstrap_kmalloc_sheaves() to avoid recursion
+ */
+ if (!is_kmalloc_normal(s))

s->sheaf_capacity = calculate_sheaf_capacity(s, args);

/*
@@ -8549,6 +8559,74 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
return s;
}

+/*
+ * Finish the sheaves initialization done normally by init_percpu_sheaves() and
+ * init_kmem_cache_nodes(). For normal kmalloc caches we have to bootstrap it
+ * since sheaves and barns are allocated by kmalloc.
+ */
+static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
+{
+ struct kmem_cache_args empty_args = {};
+ unsigned int capacity;
+ bool failed = false;
+ int node, cpu;
+
+ capacity = calculate_sheaf_capacity(s, &empty_args);
+
+ /* capacity can be 0 due to debugging or SLUB_TINY */
+ if (!capacity)
+ return;
+
+ for_each_node_mask(node, slab_nodes) {
+ struct node_barn *barn;
+
+ barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+ if (!barn) {
+ failed = true;

+ goto out;
+ }
+

+ barn_init(barn);
+ get_node(s, node)->barn = barn;
+ }
+
+ for_each_possible_cpu(cpu) {
+ struct slub_percpu_sheaves *pcs;
+
+ pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+ pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+
+ if (!pcs->main) {
+ failed = true;

+ break;
+ }
+ }
+

+out:
+ /*
+ * It's still early in boot so treat this like same as a failure to
+ * create the kmalloc cache in the first place
+ */
+ if (failed)
+ panic("Out of memory when creating kmem_cache %s\n", s->name);
+
+ s->sheaf_capacity = capacity;
+}
+
+static void __init bootstrap_kmalloc_sheaves(void)
+{
+ enum kmalloc_cache_type type;
+
+ for (type = KMALLOC_NORMAL; type <= KMALLOC_RANDOM_END; type++) {
+ for (int idx = 0; idx < KMALLOC_SHIFT_HIGH + 1; idx++) {
+ if (kmalloc_caches[type][idx])
+ bootstrap_cache_sheaves(kmalloc_caches[type][idx]);
+ }
+ }
+}
+
void __init kmem_cache_init(void)
{
static __initdata struct kmem_cache boot_kmem_cache,
@@ -8592,6 +8670,8 @@ void __init kmem_cache_init(void)
setup_kmalloc_cache_index_table();
create_kmalloc_caches();

+ bootstrap_kmalloc_sheaves();
+
/* Setup random freelists for each cache */
init_freelist_randomization();

--
2.51.1

Vlastimil Babka

unread,

Oct 23, 2025, 9:53:33 AMOct 23

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka

We now rely on sheaves as the percpu caching layer and can refill them
directly from partial or newly allocated slabs. Start removing the cpu
(partial) slabs code, first from allocation paths.

This means that any allocation not satisfied from percpu sheaves will
end up in ___slab_alloc(), where we remove the usage of cpu (partial)
slabs, so it will only perform get_partial() or new_slab().

In get_partial_node() we used to return a slab for freezing as the cpu
slab and to refill the partial slab. Now we only want to return a single
object and leave the slab on the list (unless it became full). We can't
simply reuse alloc_single_from_partial() as that assumes freeing uses
free_to_partial_list(). Instead we need to use __slab_update_freelist()
to work properly against a racing __slab_free().

The rest of the changes is removing functions that no longer have any
callers.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 614 ++++++++------------------------------------------------------
1 file changed, 71 insertions(+), 543 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e2b052657d11..bd67336e7c1f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -245,7 +245,6 @@ static DEFINE_STATIC_KEY_FALSE(strict_numa);

struct partial_context {
gfp_t flags;
unsigned int orig_size;

- void *object;
unsigned int min_objects;
unsigned int max_objects;
struct list_head slabs;
@@ -598,36 +597,6 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
return freelist_ptr_decode(s, p, ptr_addr);
}

-static void prefetch_freepointer(const struct kmem_cache *s, void *object)
-{
- prefetchw(object + s->offset);
-}
-
-/*
- * When running under KMSAN, get_freepointer_safe() may return an uninitialized
- * pointer value in the case the current thread loses the race for the next
- * memory chunk in the freelist. In that case this_cpu_cmpxchg_double() in
- * slab_alloc_node() will fail, so the uninitialized value won't be used, but
- * KMSAN will still check all arguments of cmpxchg because of imperfect
- * handling of inline assembly.
- * To work around this problem, we apply __no_kmsan_checks to ensure that
- * get_freepointer_safe() returns initialized memory.
- */
-__no_kmsan_checks
-static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
-{
- unsigned long freepointer_addr;
- freeptr_t p;
-
- if (!debug_pagealloc_enabled_static())
- return get_freepointer(s, object);
-
- object = kasan_reset_tag(object);
- freepointer_addr = (unsigned long)object + s->offset;
- copy_from_kernel_nofault(&p, (freeptr_t *)freepointer_addr, sizeof(p));
- return freelist_ptr_decode(s, p, freepointer_addr);
-}
-
static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
{
unsigned long freeptr_addr = (unsigned long)object + s->offset;
@@ -707,23 +676,11 @@ static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)

nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));

s->cpu_partial_slabs = nr_slabs;
}
-
-static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
-{
- return s->cpu_partial_slabs;
-}
-#else
-#ifdef SLAB_SUPPORTS_SYSFS
+#elif defined(SLAB_SUPPORTS_SYSFS)
static inline void

slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)

{
}
-#endif
-
-static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
-{
- return 0;
-}
#endif /* CONFIG_SLUB_CPU_PARTIAL */

/*
@@ -1075,7 +1032,7 @@ static void set_track_update(struct kmem_cache *s, void *object,
p->handle = handle;
#endif
p->addr = addr;
- p->cpu = smp_processor_id();
+ p->cpu = raw_smp_processor_id();
p->pid = current->pid;
p->when = jiffies;
}
@@ -3575,15 +3532,15 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
}

/*
- * Try to allocate a partial slab from a specific node.
+ * Try to allocate object from a partial slab on a specific node.
*/
-static struct slab *get_partial_node(struct kmem_cache *s,
- struct kmem_cache_node *n,
- struct partial_context *pc)
+static void *get_partial_node(struct kmem_cache *s,

+ struct kmem_cache_node *n,
+ struct partial_context *pc)

{
- struct slab *slab, *slab2, *partial = NULL;

+ struct slab *slab, *slab2;

unsigned long flags;
- unsigned int partial_slabs = 0;
+ void *object;

/*

* Racy check. If we mistakenly see no partial slabs then we

@@ -3599,54 +3556,54 @@ static struct slab *get_partial_node(struct kmem_cache *s,

else if (!spin_trylock_irqsave(&n->list_lock, flags))

return NULL;

list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+

+ unsigned long counters;
+ struct slab new;

+
if (!pfmemalloc_match(slab, pc->flags))

continue;

if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
- void *object = alloc_single_from_partial(s, n, slab,
+ object = alloc_single_from_partial(s, n, slab,
pc->orig_size);
- if (object) {
- partial = slab;
- pc->object = object;
+ if (object)
break;
- }
continue;
}

- remove_partial(n, slab);
-
- if (!partial) {
- partial = slab;
- stat(s, ALLOC_FROM_PARTIAL);
-
- if ((slub_get_cpu_partial(s) == 0)) {
- break;
- }
- } else {
- put_cpu_partial(s, slab, 0);
- stat(s, CPU_PARTIAL_NODE);
-
- if (++partial_slabs > slub_get_cpu_partial(s) / 2) {
- break;
- }
- }
+ /*
+ * get a single object from the slab. This might race against
+ * __slab_free(), which however has to take the list_lock if
+ * it's about to make the slab fully free.
+ */
+ do {

+ object = slab->freelist;

+ counters = slab->counters;

+ new.freelist = get_freepointer(s, object);
+ new.counters = counters;
+ new.inuse++;
+ } while (!__slab_update_freelist(s, slab,
+ object, counters,
+ new.freelist, new.counters,
+ "get_partial_node"));
+
+ if (!new.freelist)
+ remove_partial(n, slab);
}
spin_unlock_irqrestore(&n->list_lock, flags);
- return partial;
+ return object;
}

/*
- * Get a slab from somewhere. Search in increasing NUMA distances.
+ * Get an object from somewhere. Search in increasing NUMA distances.
*/
-static struct slab *get_any_partial(struct kmem_cache *s,
- struct partial_context *pc)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zoneref *z;
struct zone *zone;
enum zone_type highest_zoneidx = gfp_zone(pc->flags);
- struct slab *slab;
unsigned int cpuset_mems_cookie;

/*
@@ -3681,8 +3638,8 @@ static struct slab *get_any_partial(struct kmem_cache *s,

if (n && cpuset_zone_allowed(zone, pc->flags) &&
n->nr_partial > s->min_partial) {
- slab = get_partial_node(s, n, pc);
- if (slab) {
+ void *object = get_partial_node(s, n, pc);
+ if (object) {
/*

* Don't check read_mems_allowed_retry()

* here - if mems_allowed was updated in
@@ -3690,7 +3647,7 @@ static struct slab *get_any_partial(struct kmem_cache *s,
* between allocation and the cpuset
* update
*/
- return slab;
+ return object;
}
}
}
@@ -3700,20 +3657,20 @@ static struct slab *get_any_partial(struct kmem_cache *s,
}

/*
- * Get a partial slab, lock it and return it.
+ * Get an object from a partial slab
*/
-static struct slab *get_partial(struct kmem_cache *s, int node,
- struct partial_context *pc)
+static void *get_partial(struct kmem_cache *s, int node,
+ struct partial_context *pc)
{
- struct slab *slab;
int searchnode = node;
+ void *object;

if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();

- slab = get_partial_node(s, get_node(s, searchnode), pc);
- if (slab || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
- return slab;
+ object = get_partial_node(s, get_node(s, searchnode), pc);
+ if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
+ return object;

return get_any_partial(s, pc);
}
@@ -4272,19 +4229,6 @@ static int slub_cpu_dead(unsigned int cpu)
return 0;
}

-/*
- * Check if the objects in a per cpu structure fit numa
- * locality expectations.
- */
-static inline int node_match(struct slab *slab, int node)
-{
-#ifdef CONFIG_NUMA
- if (node != NUMA_NO_NODE && slab_nid(slab) != node)
- return 0;
-#endif
- return 1;
-}
-
#ifdef CONFIG_SLUB_DEBUG
static int count_free(struct slab *slab)
{
@@ -4469,39 +4413,6 @@ __update_cpu_freelist_fast(struct kmem_cache *s,
&old.full, new.full);
}

-/*
- * Check the slab->freelist and either transfer the freelist to the
- * per cpu freelist or deactivate the slab.
- *
- * The slab is still frozen if the return value is not NULL.
- *
- * If this function returns NULL then the slab has been unfrozen.
- */
-static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
-{
- struct slab new;
- unsigned long counters;
- void *freelist;
-
- lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
-
- do {
- freelist = slab->freelist;
- counters = slab->counters;
-
- new.counters = counters;
-
- new.inuse = slab->objects;
- new.frozen = freelist != NULL;
-
- } while (!__slab_update_freelist(s, slab,
- freelist, counters,
- NULL, new.counters,
- "get_freelist"));
-
- return freelist;
-}
-
/*

* Get the slab's freelist and do not freeze it.

*
@@ -4535,197 +4446,23 @@ static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *sla
}

/*
- * Freeze the partial slab and return the pointer to the freelist.
- */
-static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
-{
- struct slab new;
- unsigned long counters;
- void *freelist;
-
- do {
- freelist = slab->freelist;
- counters = slab->counters;
-
- new.counters = counters;
- VM_BUG_ON(new.frozen);
-
- new.inuse = slab->objects;
- new.frozen = 1;

-
- } while (!slab_update_freelist(s, slab,

- freelist, counters,
- NULL, new.counters,
- "freeze_slab"));
-
- return freelist;
-}
-
-/*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
+ * Slow path. We failed to allocate via percpu sheaves or they are not available
+ * due to bootstrap or debugging enabled or SLUB_TINY.
*
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
- *
- * Version of __slab_alloc to use when we know that preemption is
- * already disabled (which is the case for bulk allocation).
+ * We try to allocate from partial slab lists and fall back to allocating a new
+ * slab.
*/

static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

- unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
+ unsigned long addr, unsigned int orig_size)
{
bool allow_spin = gfpflags_allow_spinning(gfpflags);
void *freelist;
struct slab *slab;
- unsigned long flags;
struct partial_context pc;
bool try_thisnode = true;

stat(s, ALLOC_SLOWPATH);

-reread_slab:

-
- slab = READ_ONCE(c->slab);
- if (!slab) {

- /*
- * if the node is not online or has no normal memory, just
- * ignore the node constraint
- */
- if (unlikely(node != NUMA_NO_NODE &&
- !node_isset(node, slab_nodes)))
- node = NUMA_NO_NODE;
- goto new_slab;
- }
-
- if (unlikely(!node_match(slab, node))) {
- /*
- * same as above but node_match() being false already
- * implies node != NUMA_NO_NODE.
- *
- * We don't strictly honor pfmemalloc and NUMA preferences
- * when !allow_spin because:
- *
- * 1. Most kmalloc() users allocate objects on the local node,
- * so kmalloc_nolock() tries not to interfere with them by
- * deactivating the cpu slab.
- *
- * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause
- * unnecessary slab allocations even when n->partial list
- * is not empty.
- */
- if (!node_isset(node, slab_nodes) ||
- !allow_spin) {
- node = NUMA_NO_NODE;
- } else {
- stat(s, ALLOC_NODE_MISMATCH);
- goto deactivate_slab;

- }
- }
-
- /*

- * By rights, we should be searching for a slab page that was
- * PFMEMALLOC but right now, we are losing the pfmemalloc
- * information when the page leaves the per-cpu allocator
- */
- if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
- goto deactivate_slab;
-
- /* must check again c->slab in case we got preempted and it changed */
- local_lock_cpu_slab(s, flags);
-

- if (unlikely(slab != c->slab)) {
- local_unlock_cpu_slab(s, flags);

- goto reread_slab;
- }

- freelist = c->freelist;

- if (freelist)
- goto load_freelist;
-
- freelist = get_freelist(s, slab);
-
- if (!freelist) {

- c->slab = NULL;

- c->tid = next_tid(c->tid);

- local_unlock_cpu_slab(s, flags);
- stat(s, DEACTIVATE_BYPASS);
- goto new_slab;
- }
-
- stat(s, ALLOC_REFILL);
-
-load_freelist:
-
- lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
-
- /*
- * freelist is pointing to the list of objects to be used.
- * slab is pointing to the slab from which the objects are obtained.
- * That slab must be frozen for per cpu allocations to work.
- */
- VM_BUG_ON(!c->slab->frozen);
- c->freelist = get_freepointer(s, freelist);

- c->tid = next_tid(c->tid);

- local_unlock_cpu_slab(s, flags);
- return freelist;
-
-deactivate_slab:
-
- local_lock_cpu_slab(s, flags);
- if (slab != c->slab) {
- local_unlock_cpu_slab(s, flags);
- goto reread_slab;
- }

- freelist = c->freelist;

- c->slab = NULL;
- c->freelist = NULL;
- c->tid = next_tid(c->tid);

- local_unlock_cpu_slab(s, flags);

- deactivate_slab(s, slab, freelist);
-

-new_slab:
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
- while (slub_percpu_partial(c)) {
- local_lock_cpu_slab(s, flags);
- if (unlikely(c->slab)) {
- local_unlock_cpu_slab(s, flags);
- goto reread_slab;
- }
- if (unlikely(!slub_percpu_partial(c))) {
- local_unlock_cpu_slab(s, flags);
- /* we were preempted and partial list got empty */
- goto new_objects;
- }
-
- slab = slub_percpu_partial(c);
- slub_set_percpu_partial(c, slab);
-
- if (likely(node_match(slab, node) &&
- pfmemalloc_match(slab, gfpflags)) ||
- !allow_spin) {
- c->slab = slab;
- freelist = get_freelist(s, slab);
- VM_BUG_ON(!freelist);
- stat(s, CPU_PARTIAL_ALLOC);
- goto load_freelist;
- }

-
- local_unlock_cpu_slab(s, flags);
-

- slab->next = NULL;
- __put_partials(s, slab);
- }
-#endif
-
new_objects:

pc.flags = gfpflags;
@@ -4750,33 +4487,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}

pc.orig_size = orig_size;
- slab = get_partial(s, node, &pc);
- if (slab) {
- if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
- freelist = pc.object;
- /*
- * For debug caches here we had to go through
- * alloc_single_from_partial() so just store the
- * tracking info and return the object.
- *
- * Due to disabled preemption we need to disallow
- * blocking. The flags are further adjusted by
- * gfp_nested_mask() in stack_depot itself.
- */
- if (s->flags & SLAB_STORE_USER)
- set_track(s, freelist, TRACK_ALLOC, addr,
- gfpflags & ~(__GFP_DIRECT_RECLAIM));
+ freelist = get_partial(s, node, &pc);
+ if (freelist) {
+ if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
+ set_track(s, freelist, TRACK_ALLOC, addr, gfpflags);

- return freelist;
- }
-
- freelist = freeze_slab(s, slab);
- goto retry_load_slab;
+ return freelist;
}

- slub_put_cpu_ptr(s->cpu_slab);

slab = new_slab(s, pc.flags, node);

- c = slub_get_cpu_ptr(s->cpu_slab);

if (unlikely(!slab)) {
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
@@ -4790,66 +4509,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

stat(s, ALLOC_SLAB);

- if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
- freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
-
- if (unlikely(!freelist))
- goto new_objects;
-
- if (s->flags & SLAB_STORE_USER)
- set_track(s, freelist, TRACK_ALLOC, addr,
- gfpflags & ~(__GFP_DIRECT_RECLAIM));
-
- return freelist;
- }
-
- /*
- * No other reference to the slab yet so we can
- * muck around with it freely without cmpxchg
- */
- freelist = slab->freelist;
- slab->freelist = NULL;
- slab->inuse = slab->objects;
- slab->frozen = 1;
-
- inc_slabs_node(s, slab_nid(slab), slab->objects);
+ freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);

- if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
- /*
- * For !pfmemalloc_match() case we don't load freelist so that
- * we don't make further mismatched allocations easier.
- */
- deactivate_slab(s, slab, get_freepointer(s, freelist));
- return freelist;
- }
+ if (unlikely(!freelist))
+ goto new_objects;

-retry_load_slab:
+ if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
+ set_track(s, freelist, TRACK_ALLOC, addr, gfpflags);

- local_lock_cpu_slab(s, flags);
- if (unlikely(c->slab)) {
- void *flush_freelist = c->freelist;
- struct slab *flush_slab = c->slab;

-
- c->slab = NULL;
- c->freelist = NULL;
- c->tid = next_tid(c->tid);

-
- local_unlock_cpu_slab(s, flags);
-

- if (unlikely(!allow_spin)) {
- /* Reentrant slub cannot take locks, defer */
- defer_deactivate_slab(flush_slab, flush_freelist);
- } else {
- deactivate_slab(s, flush_slab, flush_freelist);
- }
-
- stat(s, CPUSLAB_FLUSH);
-
- goto retry_load_slab;
- }
- c->slab = slab;
-
- goto load_freelist;
+ return freelist;
}
/*

* We disallow kprobes in ___slab_alloc() to prevent reentrance

@@ -4865,87 +4533,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
NOKPROBE_SYMBOL(___slab_alloc);

-/*
- * A wrapper for ___slab_alloc() for contexts where preemption is not yet
- * disabled. Compensates for possible cpu changes by refetching the per cpu area
- * pointer.
- */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
-{
- void *p;
-
-#ifdef CONFIG_PREEMPT_COUNT
- /*
- * We may have been preempted and rescheduled on a different
- * cpu before disabling preemption. Need to reload cpu area
- * pointer.
- */
- c = slub_get_cpu_ptr(s->cpu_slab);
-#endif
- if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
- if (local_lock_is_locked(&s->cpu_slab->lock)) {
- /*
- * EBUSY is an internal signal to kmalloc_nolock() to
- * retry a different bucket. It's not propagated
- * to the caller.
- */
- p = ERR_PTR(-EBUSY);
- goto out;
- }
- }
- p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
-out:
-#ifdef CONFIG_PREEMPT_COUNT
- slub_put_cpu_ptr(s->cpu_slab);
-#endif
- return p;
-}
-

static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)

{
- struct kmem_cache_cpu *c;
- struct slab *slab;
- unsigned long tid;
void *object;

-redo:
- /*
- * Must read kmem_cache cpu data via this cpu ptr. Preemption is
- * enabled. We may switch back and forth between cpus while
- * reading from one cpu area. That does not matter as long
- * as we end up on the original cpu again when doing the cmpxchg.
- *
- * We must guarantee that tid and kmem_cache_cpu are retrieved on the
- * same cpu. We read first the kmem_cache_cpu pointer and use it to read
- * the tid. If we are preempted and switched to another cpu between the
- * two reads, it's OK as the two are still associated with the same cpu
- * and cmpxchg later will validate the cpu.

- */
- c = raw_cpu_ptr(s->cpu_slab);
- tid = READ_ONCE(c->tid);
-
- /*

- * Irqless object alloc/free algorithm used here depends on sequence
- * of fetching cpu_slab's data. tid should be fetched before anything
- * on c to guarantee that object and slab associated with previous tid
- * won't be used with current tid. If we fetch tid first, object and
- * slab could be one associated with next tid and our alloc/free
- * request will be failed. In this case, we will retry. So, no problem.
- */
- barrier();
-
- /*
- * The transaction ids are globally unique per cpu and per operation on
- * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
- * occurs on the right processor and that there was no operation on the
- * linked list in between.
- */
-

- object = c->freelist;

- slab = c->slab;
-

#ifdef CONFIG_NUMA
if (static_branch_unlikely(&strict_numa) &&
node == NUMA_NO_NODE) {
@@ -4954,47 +4546,20 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,

if (mpol) {
/*
- * Special BIND rule support. If existing slab
+ * Special BIND rule support. If the local node
* is in permitted set then do not redirect
* to a particular node.
* Otherwise we apply the memory policy to get
* the node we need to allocate on.
*/
- if (mpol->mode != MPOL_BIND || !slab ||
- !node_isset(slab_nid(slab), mpol->nodes))
-
+ if (mpol->mode != MPOL_BIND ||
+ !node_isset(numa_mem_id(), mpol->nodes))
node = mempolicy_slab_node();
}
}
#endif

- if (!USE_LOCKLESS_FAST_PATH() ||
- unlikely(!object || !slab || !node_match(slab, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
- } else {
- void *next_object = get_freepointer_safe(s, object);
-
- /*
- * The cmpxchg will only match if there was no additional
- * operation and if we are on the right processor.
- *
- * The cmpxchg does the following atomically (without lock
- * semantics!)
- * 1. Relocate first pointer to the current per cpu area.
- * 2. Verify that tid and freelist have not been changed
- * 3. If they were not changed replace tid and freelist
- *
- * Since this is without lock semantics the protection is only
- * against code executing on this cpu *not* from access by
- * other cpus.
- */
- if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
- note_cmpxchg_failure("slab_alloc", s, tid);
- goto redo;
- }
- prefetch_freepointer(s, next_object);
- stat(s, ALLOC_FASTPATH);
- }
+ object = ___slab_alloc(s, gfpflags, node, addr, orig_size);

return object;
}
@@ -7638,62 +7203,25 @@ static inline

int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)

{
- struct kmem_cache_cpu *c;
- unsigned long irqflags;
int i;

/*
- * Drain objects in the per cpu slab, while disabling local
- * IRQs, which protects against PREEMPT and interrupts
- * handlers invoking normal fastpath.
+ * TODO: this might be more efficient (if necessary) by reusing
+ * __refill_objects()
*/
- c = slub_get_cpu_ptr(s->cpu_slab);
- local_lock_irqsave(&s->cpu_slab->lock, irqflags);

-
for (i = 0; i < size; i++) {

- void *object = c->freelist;

- if (unlikely(!object)) {
- /*
- * We may have removed an object from c->freelist using
- * the fastpath in the previous iteration; in that case,
- * c->tid has not been bumped yet.
- * Since ___slab_alloc() may reenable interrupts while
- * allocating memory, we should bump c->tid now.
- */

- c->tid = next_tid(c->tid);

+ p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
+ s->object_size);
+ if (unlikely(!p[i]))
+ goto error;

- local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
-
- /*
- * Invoking slow path likely have side-effect
- * of re-populating per CPU c->freelist
- */

- p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,

- _RET_IP_, c, s->object_size);

- if (unlikely(!p[i]))
- goto error;

-

- c = this_cpu_ptr(s->cpu_slab);

- maybe_wipe_obj_freeptr(s, p[i]);

-
- local_lock_irqsave(&s->cpu_slab->lock, irqflags);
-
- continue; /* goto for-loop */
- }
- c->freelist = get_freepointer(s, object);

- p[i] = object;

maybe_wipe_obj_freeptr(s, p[i]);
- stat(s, ALLOC_FASTPATH);

}
- c->tid = next_tid(c->tid);

- local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
- slub_put_cpu_ptr(s->cpu_slab);

return i;

error:
- slub_put_cpu_ptr(s->cpu_slab);
__kmem_cache_free_bulk(s, i, p);
return 0;

--
2.51.1

Marco Elver

unread,

Oct 23, 2025, 11:21:31 AMOct 23

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Dmitry Vyukov

Might be nice to briefly write a comment here in code as well instead
of having to dig through the commit logs.

The tests still pass? (CONFIG_KFENCE_KUNIT_TEST=y)

Chris Mason

unread,

Oct 24, 2025, 10:05:46 AMOct 24

to Vlastimil Babka, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, 23 Oct 2025 15:52:29 +0200 Vlastimil Babka <vba...@suse.cz> wrote:

> Before we enable percpu sheaves for kmalloc caches, we need to make sure
> kmalloc_nolock() and kfree_nolock() will continue working properly and
> not spin when not allowed to.
>
> Percpu sheaves themselves use local_trylock() so they are already
> compatible. We just need to be careful with the barn->lock spin_lock.
> Pass a new allow_spin parameter where necessary to use
> spin_trylock_irqsave().
>
> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
> for now it will always fail until we enable sheaves for kmalloc caches
> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
>

Hi Vlastimil,

We're trying out the AI patch review automation on the BPF commits and it had
some questions about a few of these. Since the goal is to actually test the
automation, I'm putting them in unedited, but I did try and make sure they
were valid before sending.

This one is a little verbose for s/NULL/ERR_PTR/ suggestions, but:

> diff --git a/mm/slub.c b/mm/slub.c
> index ecb10ed5acfe..5d0b2cf66520 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

> @@ -2990,7 +2998,8 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> * barn. But if there are too many full sheaves, reject this with -E2BIG.
> */
> static struct slab_sheaf *
> -barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full,
> + bool allow_spin)
> {
> struct slab_sheaf *empty;
> unsigned long flags;
> @@ -3001,7 +3010,10 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> if (!data_race(barn->nr_empty))
> return ERR_PTR(-ENOMEM);
>
> - spin_lock_irqsave(&barn->lock, flags);
> + if (likely(allow_spin))
> + spin_lock_irqsave(&barn->lock, flags);
> + else if (!spin_trylock_irqsave(&barn->lock, flags))
> + return NULL;

barn_replace_full_sheaf() now returns NULL when the trylock fails, in
addition to its existing ERR_PTR(-ENOMEM) and ERR_PTR(-E2BIG) returns.

[ ... ]

> @@ -6047,7 +6070,8 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> * unlocked.
> */
> static struct slub_percpu_sheaves *
> -__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> + bool allow_spin)
> {
> struct slab_sheaf *empty;
> struct node_barn *barn;
> @@ -6071,7 +6095,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> put_fail = false;
>
> if (!pcs->spare) {
> - empty = barn_get_empty_sheaf(barn);
> + empty = barn_get_empty_sheaf(barn, allow_spin);
> if (empty) {
> pcs->spare = pcs->main;
> pcs->main = empty;
> @@ -6085,7 +6109,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> return pcs;
> }
>
> - empty = barn_replace_full_sheaf(barn, pcs->main);
> + empty = barn_replace_full_sheaf(barn, pcs->main, allow_spin);
>
> if (!IS_ERR(empty)) {
> stat(s, BARN_PUT);

> pcs->main = empty;
> return pcs;
> }

Can this set pcs->main to NULL? When allow_spin=false and the trylock
fails, barn_replace_full_sheaf() returns NULL. Since !IS_ERR(NULL) is
true, the code executes pcs->main = NULL and returns pcs. This violates
the documented invariant that main is "never NULL when unlocked" (from
struct slub_percpu_sheaves comment at mm/slub.c:481).

The caller free_to_pcs() then dereferences pcs->main->objects at
mm/slub.c:6208, which would be a NULL pointer dereference:

pcs->main->objects[pcs->main->size++] = object;

The execution path is:
1. kfree_nolock() calls free_to_pcs() with allow_spin=false
2. free_to_pcs() calls __pcs_replace_full_main() with allow_spin=false
3. pcs->spare exists and is full (size == s->sheaf_capacity)
4. barn_replace_full_sheaf() trylock fails, returns NULL
5. !IS_ERR(NULL) = true, so pcs->main = NULL is executed
6. __pcs_replace_full_main() returns pcs with NULL main
7. free_to_pcs() dereferences pcs->main->objects causing NULL deref

Should the check be if (empty && !IS_ERR(empty)) to properly handle the
new NULL return case?

> @@ -6093,6 +6117,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> return pcs;
> }
>
> + if (!allow_spin) {
> + local_unlock(&s->cpu_sheaves->lock);
> + return NULL;
> + }

This check is never reached when barn_replace_full_sheaf() returns NULL,
because the early return at the !IS_ERR(empty) check above executes first.

[ ... ]

Chris Mason

unread,

Oct 24, 2025, 10:21:59 AMOct 24

to Vlastimil Babka, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, 23 Oct 2025 15:52:24 +0200 Vlastimil Babka <vba...@suse.cz> wrote:

> When a pfmemalloc allocation actually dips into reserves, the slab is
> marked accordingly and non-pfmemalloc allocations should not be allowed
> to allocate from it. The sheaves percpu caching currently doesn't follow
> this rule, so implement it before we expand sheaves usage to all caches.
>
> Make sure objects from pfmemalloc slabs don't end up in percpu sheaves.
> When freeing, skip sheaves when freeing an object from pfmemalloc slab.
> When refilling sheaves, use __GFP_NOMEMALLOC to override any pfmemalloc
> context - the allocation will fallback to regular slab allocations when
> sheaves are depleted and can't be refilled because of the override.
>
> For kfree_rcu(), detect pfmemalloc slabs after processing the rcu_sheaf
> after the grace period in __rcu_free_sheaf_prepare() and simply flush
> it if any object is from pfmemalloc slabs.
>
> For prefilled sheaves, try to refill them first with __GFP_NOMEMALLOC
> and if it fails, retry without __GFP_NOMEMALLOC but then mark the sheaf
> pfmemalloc, which makes it flushed back to slabs when returned.
>

Hi Vlastimil,

We're trying out the AI patch review automation on the BPF commits and it had
some questions about a few of these. Since the goal is to actually test the
automation, I'm putting them in unedited, but I did try and make sure they
were valid before sending.

For this one, I'm not sure if the behavior AI is flagging is intentional or
not, so it feels like a reasonable question to me at least:

> diff --git a/mm/slub.c b/mm/slub.c
> index 4731b9e461c2..ab03f29dc3bf 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

> @@ -5331,6 +5340,26 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
> }
> EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>
> +static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
> + struct slab_sheaf *sheaf, gfp_t gfp)
> +{
> + int ret = 0;
> +
> + ret = refill_sheaf(s, sheaf, gfp | __GFP_NOMEMALLOC);
> +
> + if (likely(!ret || !gfp_pfmemalloc_allowed(gfp)))
> + return ret;
> +
> + /*
> + * if we are allowed to, refill sheaf with pfmemalloc but then remember
> + * it for when it's returned
> + */
> + ret = refill_sheaf(s, sheaf, gfp);
> + sheaf->pfmemalloc = true;
> +
> + return ret;
> +}

[ ... ]

> @@ -5497,7 +5528,7 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
>
> if (likely(sheaf->capacity >= size)) {
> if (likely(sheaf->capacity == s->sheaf_capacity))
> - return refill_sheaf(s, sheaf, gfp);
> + return __prefill_sheaf_pfmemalloc(s, sheaf, gfp);
>
> if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
> &sheaf->objects[sheaf->size])) {

^^^

In kmem_cache_refill_sheaf(), does the oversize sheaf path (when
sheaf->capacity != s->sheaf_capacity) need __GFP_NOMEMALLOC too?

The commit message says "When refilling sheaves, use __GFP_NOMEMALLOC
to override any pfmemalloc context", and the normal capacity path now
calls __prefill_sheaf_pfmemalloc() which adds __GFP_NOMEMALLOC.

But this oversize path still calls __kmem_cache_alloc_bulk() with the
raw gfp flags. If the calling context is pfmemalloc-enabled (e.g.,
during swap or network operations), could pfmemalloc objects be
allocated into the sheaf? Those objects would then be returned via
kmem_cache_alloc_from_sheaf() to potentially non-pfmemalloc callers.

Similar issue exists in kmem_cache_prefill_sheaf() for the oversize
path at the function's start, though that code wasn't modified by this
patch.

> return -ENOMEM;
> }
> sheaf->size = sheaf->capacity;

[ ... ]

Chris Mason

unread,

Oct 24, 2025, 10:29:50 AMOct 24

to Vlastimil Babka, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, 23 Oct 2025 15:52:32 +0200 Vlastimil Babka <vba...@suse.cz> wrote:

> We now rely on sheaves as the percpu caching layer and can refill them
> directly from partial or newly allocated slabs. Start removing the cpu
> (partial) slabs code, first from allocation paths.
>
> This means that any allocation not satisfied from percpu sheaves will
> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
> slabs, so it will only perform get_partial() or new_slab().
>
> In get_partial_node() we used to return a slab for freezing as the cpu
> slab and to refill the partial slab. Now we only want to return a single
> object and leave the slab on the list (unless it became full). We can't
> simply reuse alloc_single_from_partial() as that assumes freeing uses
> free_to_partial_list(). Instead we need to use __slab_update_freelist()
> to work properly against a racing __slab_free().
>
> The rest of the changes is removing functions that no longer have any
> callers.
>

Hi Vlastimil,

We're trying out the AI patch review automation on the BPF commits and it had
some questions about a few of these. Since the goal is to actually test the
automation, I'm putting them in unedited, but I did try and make sure they
were valid before sending.

This one seems pretty unlikely but still worth fixing:

>
> diff --git a/mm/slub.c b/mm/slub.c
> index e2b052657d11..bd67336e7c1f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

Can get_partial_node() return an uninitialized pointer? The variable
'object' is declared but never initialized. If all slabs in the partial
list fail the pfmemalloc_match() check, the loop completes without
setting 'object', then returns it at the end of the function.

In the previous version, the equivalent 'partial' variable was explicitly
initialized to NULL. When all slabs were skipped, NULL was returned.

[ ... ]

Chris Mason

unread,

Oct 24, 2025, 11:29:46 AMOct 24

to Vlastimil Babka, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, 23 Oct 2025 15:52:28 +0200 Vlastimil Babka <vba...@suse.cz> wrote:

> Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
> sheaves enabled. Since we want to enable them for almost all caches,
> it's suboptimal to test the pointer in the fast paths, so instead
> allocate it for all caches in do_kmem_cache_create(). Instead of testing
> the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
> kmem_cache->sheaf_capacity for being 0, where needed.
>
> However, for the fast paths sake we also assume that the main sheaf
> always exists (pcs->main is !NULL), and during bootstrap we cannot
> allocate sheaves yet.
>
> Solve this by introducing a single static bootstrap_sheaf that's
> assigned as pcs->main during bootstrap. It has a size of 0, so during
> allocations, the fast path will find it's empty. Since the size of 0
> matches sheaf_capacity of 0, the freeing fast paths will find it's
> "full". In the slow path handlers, we check sheaf_capacity to recognize
> that the cache doesn't (yet) have real sheaves, and fall back. Thus
> sharing the single bootstrap sheaf like this for multiple caches and
> cpus is safe.
>

Hi Vlastimil,

We're trying out the AI patch review automation on the BPF commits and it had
some questions about a few of these. Since the goal is to actually test the
automation, I'm putting them in unedited, but I did try and make sure they
were valid before sending.

> diff --git a/mm/slub.c b/mm/slub.c
> index a6e58d3708f4..ecb10ed5acfe 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

[ ... ]

> @@ -7733,8 +7782,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
> void __kmem_cache_release(struct kmem_cache *s)
> {
> cache_random_seq_destroy(s);
> - if (s->cpu_sheaves)
> - pcs_destroy(s);
> + pcs_destroy(s);

Can pcs_destroy() handle a NULL cpu_sheaves pointer? Looking at the
error path in do_kmem_cache_create():

s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);

if (!s->cpu_sheaves) {
err = -ENOMEM;
goto out;
}

...
out:
if (err)
__kmem_cache_release(s);

If alloc_percpu() fails, __kmem_cache_release() now calls pcs_destroy()
unconditionally. In pcs_destroy(), the first operation is:

for_each_possible_cpu(cpu) {

pcs = per_cpu_ptr(s->cpu_sheaves, cpu);

Does per_cpu_ptr() dereference s->cpu_sheaves when it's NULL?

> #ifdef CONFIG_PREEMPT_RT
> if (s->cpu_slab)
> lockdep_unregister_key(&s->lock_key);

[ ... ]

> @@ -8608,12 +8656,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
> set_cpu_partial(s);
>
> - if (s->sheaf_capacity) {
> - s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> - if (!s->cpu_sheaves) {
> - err = -ENOMEM;
> - goto out;
> - }
> + s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> + if (!s->cpu_sheaves) {
> + err = -ENOMEM;
> + goto out;
> }

This error path triggers the call chain: do_kmem_cache_create() error
path -> __kmem_cache_release() -> pcs_destroy() with NULL cpu_sheaves.

Alexei Starovoitov

unread,

Oct 24, 2025, 3:43:34 PMOct 24

to Vlastimil Babka, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

AI did a good job here. I spent an hour staring at the patch
for other reasons. Noticed this bug too and then went
"ohh, wait, AI mentioned it already". Time to retire.

I would remove the empty line here.

and would add a comment here to elaborate that the next
steps like sheaf_flush_unused() and alloc_empty_sheaf()
cannot handle !allow_spin.

I'm allergic to booleans in arguments. They make callsites
hard to read. Especially if there are multiple bools.
We have horrendous lines in the verifier that we still need
to clean up due to bools:
check_load_mem(env, insn, true, false, false, "atomic_load");

barn_get_empty_sheaf(barn, true); looks benign,
but I would still use enum { DONT_SPIN, ALLOW_SPIN }
and use that in all functions instead of 'bool allow_spin'.

Aside from that I got worried that sheaves fast path
may be not optimized well by the compiler:
if (unlikely(pcs->main->size == 0)) ...
object = pcs->main->objects[pcs->main->size - 1];
// object is accessed here
pcs->main->size--;

since object may alias into pcs->main and the compiler
may be tempted to reload 'main'.
Looks like it's fine, since object point is not actually read or written.
gcc15 asm looks good:
movq 8(%rbx), %rdx # _68->main, _69
movl 24(%rdx), %eax # _69->size, _70
# ../mm/slub.c:5129: if (unlikely(pcs->main->size == 0)) {
testl %eax, %eax # _70
je .L2076 #,
.L1953:
# ../mm/slub.c:5135: object = pcs->main->objects[pcs->main->size - 1];
leal -1(%rax), %esi #,
# ../mm/slub.c:5135: object = pcs->main->objects[pcs->main->size - 1];
movq 32(%rdx,%rsi,8), %rdi # prephitmp_309->objects[_81], object
# ../mm/slub.c:5135: object = pcs->main->objects[pcs->main->size - 1];
movq %rsi, %rax #,
# ../mm/slub.c:5137: if (unlikely(node_requested)) {
testb %r15b, %r15b # node_requested
jne .L2077 #,
.L1954:
# ../mm/slub.c:5149: pcs->main->size--;
movl %eax, 24(%rdx) # _81, prephitmp_30->size

Alexei Starovoitov

unread,

Oct 24, 2025, 4:44:07 PMOct 24

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:
>

> static bool has_pcs_used(int cpu, struct kmem_cache *s)
> @@ -5599,21 +5429,18 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> new.inuse -= cnt;
> if ((!new.inuse || !prior) && !was_frozen) {
> /* Needs to be taken off a list */
> - if (!kmem_cache_has_cpu_partial(s) || prior) {

I'm struggling to convince myself that it's correct.
Losing '|| prior' means that we will be grabbing
this "speculative" spin_lock much more often.
While before the change we need spin_lock only when
slab was partially empty
(assuming cpu_partial was on for caches where performance matters).

Also what about later check:
if (prior && !on_node_partial) {
spin_unlock_irqrestore(&n->list_lock, flags);
return;
}
and

if (unlikely(!prior)) {
add_partial(n, slab, DEACTIVATE_TO_TAIL);

Say, new.inuse == 0 then 'n' will be set,
do we lose the slab?
Because before the change it would be added to put_cpu_partial() ?

but... since AI didn't find any bugs here, I must be wrong :)

Alexei Starovoitov

unread,

Oct 24, 2025, 6:32:30 PMOct 24

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:

> @@ -6444,8 +6316,13 @@ void kfree_nolock(const void *object)
> * since kasan quarantine takes locks and not supported from NMI.
> */
> kasan_slab_free(s, x, false, false, /* skip quarantine */true);
> + /*
> + * __slab_free() can locklessly cmpxchg16 into a slab, but then it might
> + * need to take spin_lock for further processing.
> + * Avoid the complexity and simply add to a deferred list.
> + */
> if (!free_to_pcs(s, x, false))
> - do_slab_free(s, slab, x, x, 0, _RET_IP_);
> + defer_free(s, x);

That should be rare, right?
free_to_pcs() should have good chances to succeed,
and pcs->spare should be there for kmalloc sheaves?
So trylock failure due to contention in barn_get_empty_sheaf()
and in barn_replace_full_sheaf() should be rare.

But needs to be benchmarked, of course.
The current fast path cmpxchg16 in !RT is very reliable
in my tests. Hopefully this doesn't regress.

Alexei Starovoitov

unread,

Oct 24, 2025, 7:57:34 PMOct 24

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev, Alexander Potapenko, Marco Elver, Dmitry Vyukov

On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:
>

> Percpu sheaves caching was introduced as opt-in but the goal was to
> eventually move all caches to them. This is the next step, enabling
> sheaves for all caches (except the two bootstrap ones) and then removing
> the per cpu (partial) slabs and lots of associated code.
>
> Besides (hopefully) improved performance, this removes the rather
> complicated code related to the lockless fastpaths (using
> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
> kmalloc_nolock().
>
> The lockless slab freelist+counters update operation using
> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
> without repeating the "alien" array flushing of SLUB, and to allow
> flushing objects from sheaves to slabs mostly without the node
> list_lock.
>
> This is the first RFC to get feedback. Biggest TODOs are:
>
> - cleanup of stat counters to fit the new scheme
> - integration of rcu sheaves handling with kfree_rcu batching

The whole thing looks good, and imo these two are lower priority.

> - performance evaluation

The performance results will be the key.
What kind of benchmarks do you have in mind?

Harry Yoo

unread,

Oct 26, 2025, 8:24:36 PMOct 26

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, Oct 23, 2025 at 03:52:27PM +0200, Vlastimil Babka wrote:
> In the first step to replace cpu (partial) slabs with sheaves, enable
> sheaves for almost all caches. Treat args->sheaf_capacity as a minimum,
> and calculate sheaf capacity with a formula that roughly follows the
> formula for number of objects in cpu partial slabs in set_cpu_partial().

Should we scale sheaf capacity not only based on object size but also
on the number of CPUs, like calculate_order() does?

> This should achieve roughly similar contention on the barn spin lock as
> there's currently for node list_lock without sheaves, to make
> benchmarking results comparable. It can be further tuned later.
>
> Don't enable sheaves for kmalloc caches yet, as that needs further
> changes to bootstraping.
>
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---

--
Cheers,
Harry / Hyeonggon

Harry Yoo

unread,

Oct 27, 2025, 2:12:20 AMOct 27

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, Oct 23, 2025 at 03:52:30PM +0200, Vlastimil Babka wrote:
> Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL,
> we can simply allow them in calculate_sizes() as they are created later
> than KMALLOC_NORMAL caches and can allocate sheaves and barns from
> those.
>
> For KMALLOC_NORMAL caches we perform additional step after first
> creating them without sheaves. Then bootstrap_cache_sheaves() simply
> allocates and initializes barns and sheaves and finally sets
> s->sheaf_capacity to make them actually used.
>
> Afterwards the only caches left without sheaves (unless SLUB_TINY or
> debugging is enabled) are kmem_cache and kmem_cache_node. These are only
> used when creating or destroying other kmem_caches. Thus they are not
> performance critical and we can simply leave it that way.
>
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---
> mm/slub.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 84 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 5d0b2cf66520..a84027fbca78 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
>

> static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> {
> kfree(sheaf);
> @@ -8064,8 +8071,11 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
> if (s->flags & SLAB_RECLAIM_ACCOUNT)
> s->allocflags |= __GFP_RECLAIMABLE;
>
> - /* kmalloc caches need extra care to support sheaves */
> - if (!is_kmalloc_cache(s))
> + /*
> + * For KMALLOC_NORMAL caches we enable sheaves later by
> + * bootstrap_kmalloc_sheaves() to avoid recursion
> + */
> + if (!is_kmalloc_normal(s))
> s->sheaf_capacity = calculate_sheaf_capacity(s, args);

I was going to say we should differentiate KMALLOC_NORMAL caches that
are created for kmalloc buckets.... but no, they don't have the SLAB_KMALLOC
flag.

> /*
> @@ -8549,6 +8559,74 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
> return s;
> }
>
> +/*
> + * Finish the sheaves initialization done normally by init_percpu_sheaves() and
> + * init_kmem_cache_nodes(). For normal kmalloc caches we have to bootstrap it
> + * since sheaves and barns are allocated by kmalloc.
> + */
> +static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
> +{
> + struct kmem_cache_args empty_args = {};
> + unsigned int capacity;
> + bool failed = false;
> + int node, cpu;
> +
> + capacity = calculate_sheaf_capacity(s, &empty_args);
> +
> + /* capacity can be 0 due to debugging or SLUB_TINY */
> + if (!capacity)
> + return;

I think pcs->main should still be !NULL in this case?

--
Cheers,
Harry / Hyeonggon

Harry Yoo

unread,

Oct 27, 2025, 3:21:16 AMOct 27

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

It may end up iterating over all slabs in the n->partial list
when the sum of free objects isn't exactly equal to pc->max_objects?

> + }
> +
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + return total_free > 0;
> +}
> +
> /*
> * Try to allocate a partial slab from a specific node.
> */
> @@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> return freelist;
> }
>

Maybe we don't have to do this if we put slabs into a singly linked list
and use the other word to record the number of objects in the slab.

> +
> + if (refilled >= max)
> + break;
> + }
> +
> + if (unlikely(!list_empty(&pc.slabs))) {
> + struct kmem_cache_node *n = get_node(s, node);
> +
> + spin_lock_irqsave(&n->list_lock, flags);

Do we surely know that trylock will succeed when
we succeeded to acquire it in get_partial_node_bulk()?

I think the answer is yes, but just to double check :)

If slab_nid(slab) != node, we should check gfpflags_allow_spinning()
and call defer_deactivate_slab() if it returns false?

> + }
> +
> + if (refilled < min)
> + goto new_slab;
> +out:
> +
> + return refilled;
> +}
> +
> static inline
> int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> void **p)
>
> --
> 2.51.1
>

Harry Yoo

unread,

Oct 27, 2025, 5:11:41 AMOct 27

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Mon, Oct 27, 2025 at 04:20:56PM +0900, Harry Yoo wrote:
> On Thu, Oct 23, 2025 at 03:52:31PM +0200, Vlastimil Babka wrote:
> > + if (unlikely(!list_empty(&pc.slabs))) {
> > + struct kmem_cache_node *n = get_node(s, node);
> > +
> > + spin_lock_irqsave(&n->list_lock, flags);
>
> Do we surely know that trylock will succeed when
> we succeeded to acquire it in get_partial_node_bulk()?
>
> I think the answer is yes, but just to double check :)

Oh wait, this is not per-cpu lock, so the answer is no!
We need to check gfpflags_allow_spinning() before spinning then.

Vlastimil Babka

unread,

Oct 29, 2025, 10:38:22 AMOct 29

to Marco Elver, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Dmitry Vyukov

>> @@ -7457,6 +7458,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>> if (unlikely(!s))
>> return 0;
>>
>> + /*
>> + * to make things simpler, only assume at most once kfence allocated
>> + * object per bulk allocation and choose its index randomly
>> + */

Here's a comment...

>> + kfence_obj = kfence_alloc(s, s->object_size, flags);
>> +
>> + if (unlikely(kfence_obj)) {
>> + if (unlikely(size == 1)) {
>> + p[0] = kfence_obj;
>> + goto out;
>> + }
>> + size--;
>> + }
>> +
>> if (s->cpu_sheaves)
>> i = alloc_from_pcs_bulk(s, size, p);
>>
>> @@ -7468,10 +7483,23 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>> if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
>> if (i > 0)
>> __kmem_cache_free_bulk(s, i, p);
>> + if (kfence_obj)
>> + __kfence_free(kfence_obj);
>> return 0;
>> }
>> }
>>
>> + if (unlikely(kfence_obj)) {
>
> Might be nice to briefly write a comment here in code as well instead
> of having to dig through the commit logs.

... is the one above enough? The commit log doesn't have much more on this
aspect. Or what would you add?

> The tests still pass? (CONFIG_KFENCE_KUNIT_TEST=y)

They do.

Thanks,
Vlastimil

Vlastimil Babka

unread,

Oct 29, 2025, 11:00:41 AMOct 29

to Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Yes.

> kmem_cache_alloc_from_sheaf() to potentially non-pfmemalloc callers.

The assumption is the caller will use the prefilled sheaf for its purposes
and not pass it to other callers. The reason for caring about pfmemalloc and
setting sheaf->pfmemalloc is only to recognize them when the prefilled sheaf
is returned - so that it's flushed+freed and not attached as pcs->spare -
that would then be available to other non-pfmemalloc callers.

But we always flush oversize sheaves when those are returned, so it's not
necessary to also track pfmemalloc for them. I'll add a comment about it.

Thanks,
Vlastimil

Marco Elver

unread,

Oct 29, 2025, 11:30:42 AMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Dmitry Vyukov

Good enough - thanks.

> > The tests still pass? (CONFIG_KFENCE_KUNIT_TEST=y)
>
> They do.

Great.

Thanks,
-- Marco

Vlastimil Babka

unread,

Oct 29, 2025, 11:42:46 AMOct 29

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On 10/27/25 01:24, Harry Yoo wrote:
> On Thu, Oct 23, 2025 at 03:52:27PM +0200, Vlastimil Babka wrote:
>> In the first step to replace cpu (partial) slabs with sheaves, enable
>> sheaves for almost all caches. Treat args->sheaf_capacity as a minimum,
>> and calculate sheaf capacity with a formula that roughly follows the
>> formula for number of objects in cpu partial slabs in set_cpu_partial().
>
> Should we scale sheaf capacity not only based on object size but also
> on the number of CPUs, like calculate_order() does?

We can try that as a follow-up, right now it's trying to roughly match the
pre-existing amount of caching so that bots hopefully won't report
regressions just because it became smaller (like we've already seen for
maple nodes).

Vlastimil Babka

unread,

Oct 29, 2025, 11:51:06 AMOct 29

to Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Good catch! I will leave the condition in __kmem_cache_release().
Thanks!

Chris Mason

unread,

Oct 29, 2025, 12:06:48 PMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Oh I see, this makes sense. Thanks!

-chris

Vlastimil Babka

unread,

Oct 29, 2025, 1:30:04 PMOct 29

to Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Good catch, cool it can find such bugs.
I'll return ERR_PTR(-EBUSY) which should be compatible with the callers.

Vlastimil Babka

unread,

Oct 29, 2025, 1:46:13 PMOct 29

to Alexei Starovoitov, Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On 10/24/25 21:43, Alexei Starovoitov wrote:
> On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:
>>
>> Before we enable percpu sheaves for kmalloc caches, we need to make sure
>> kmalloc_nolock() and kfree_nolock() will continue working properly and
>> not spin when not allowed to.
>>
>> Percpu sheaves themselves use local_trylock() so they are already
>> compatible. We just need to be careful with the barn->lock spin_lock.
>> Pass a new allow_spin parameter where necessary to use
>> spin_trylock_irqsave().
>>
>> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
>> for now it will always fail until we enable sheaves for kmalloc caches
>> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
>>
>> Signed-off-by: Vlastimil Babka <vba...@suse.cz>

...>> @@ -5720,6 +5735,13 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t

gfp_flags, int node)
>> */
>> return NULL;
>>
>> + ret = alloc_from_pcs(s, alloc_gfp, node);
>> +
>
> I would remove the empty line here.

Ack.

>> @@ -6093,6 +6117,11 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>> return pcs;
>> }
>>
>> + if (!allow_spin) {
>> + local_unlock(&s->cpu_sheaves->lock);
>> + return NULL;
>> + }
>
> and would add a comment here to elaborate that the next
> steps like sheaf_flush_unused() and alloc_empty_sheaf()
> cannot handle !allow_spin.

Will do.

I'll put it on the TODO list. But I think it's just following the pattern of
what you did in all the work leading to kmalloc_nolock() :)
And it's a single bool and for internal function with limited exposure, so
might be an overkill. Will see.

> Aside from that I got worried that sheaves fast path
> may be not optimized well by the compiler:
> if (unlikely(pcs->main->size == 0)) ...
> object = pcs->main->objects[pcs->main->size - 1];
> // object is accessed here

only by virt_to_folio() which takes a const void *x and is probably inlined
anyway...

> pcs->main->size--;
>
> since object may alias into pcs->main and the compiler
> may be tempted to reload 'main'.

Interesting, it wouldn't have thought about the possibility.

> Looks like it's fine, since object point is not actually read or written.

Wonder if it figures that out or just assumes it would be an undefined
behavior (or would we need strict aliasing to allow the assumption?). But
good to know it looks ok, thanks!

Vlastimil Babka

unread,

Oct 29, 2025, 4:06:15 PMOct 29

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On 10/27/25 07:12, Harry Yoo wrote:
>> @@ -8549,6 +8559,74 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
>> return s;
>> }
>>
>> +/*
>> + * Finish the sheaves initialization done normally by init_percpu_sheaves() and
>> + * init_kmem_cache_nodes(). For normal kmalloc caches we have to bootstrap it
>> + * since sheaves and barns are allocated by kmalloc.
>> + */
>> +static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
>> +{
>> + struct kmem_cache_args empty_args = {};
>> + unsigned int capacity;
>> + bool failed = false;
>> + int node, cpu;
>> +
>> + capacity = calculate_sheaf_capacity(s, &empty_args);
>> +
>> + /* capacity can be 0 due to debugging or SLUB_TINY */
>> + if (!capacity)
>> + return;
>
> I think pcs->main should still be !NULL in this case?

It will remain to be set to bootstrap_sheaf, and with s->sheaf_capacity
things will continue to work.

Vlastimil Babka

unread,

Oct 29, 2025, 4:06:49 PMOct 29

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

... s->sheaf_capacity remaining 0

Vlastimil Babka

unread,

Oct 29, 2025, 4:48:30 PMOct 29

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Hmm I think I meant to have break; here. Should deal with your concern below?

>> + remove_partial(n, slab);
>> +
>> + list_add(&slab->slab_list, &pc->slabs);
>> +
>> + total_free += slab_free;
>> + if (total_free >= pc->max_objects)
>> + break;
>
> It may end up iterating over all slabs in the n->partial list
> when the sum of free objects isn't exactly equal to pc->max_objects?

Good catch, thanks.

You mean we wouldn't have to do the counting? I think it wouldn't help as
the number could become stale after we record it, due to concurrent freeing.
Maybe get_freelist_nofreeze() could return it together with the freelist as
it can get both atomically.
However the main reason for the loop is is not to count, but to find the
tail pointer, and I don't see a way around it?

>> +
>> + if (refilled >= max)
>> + break;
>> + }
>> +
>> + if (unlikely(!list_empty(&pc.slabs))) {
>> + struct kmem_cache_node *n = get_node(s, node);
>> +
>> + spin_lock_irqsave(&n->list_lock, flags);
>
> Do we surely know that trylock will succeed when
> we succeeded to acquire it in get_partial_node_bulk()?
>
> I think the answer is yes, but just to double check :)

Yeah as you corrected, answer is no. However I missed that
__pcs_replace_empty_main() will only let us reach here with
gfpflags_allow_blocking() true in the first place. So I didn't have to even
deal with gfpflags_allow_spinning() in get_partial_node_bulk() then. I think
it's the simplest solution.

(side note: gfpflags_allow_blocking() might be too conservative now that
sheafs will be the only caching layer, that condition could be perhaps
changed to gfpflags_allow_spinning() to allow some cheap refill).

Vlastimil Babka

unread,

Oct 29, 2025, 5:31:37 PMOct 29

to Chris Mason, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On 10/24/25 16:29, Chris Mason wrote:
>> else if (!spin_trylock_irqsave(&n->list_lock, flags))
>> return NULL;
>> list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
>> +
>> + unsigned long counters;
>> + struct slab new;
>> +
>> if (!pfmemalloc_match(slab, pc->flags))
>> continue;
>
> Can get_partial_node() return an uninitialized pointer? The variable
> 'object' is declared but never initialized. If all slabs in the partial
> list fail the pfmemalloc_match() check, the loop completes without
> setting 'object', then returns it at the end of the function.
>
> In the previous version, the equivalent 'partial' variable was explicitly
> initialized to NULL. When all slabs were skipped, NULL was returned.

Indeed, this can happen. Thanks!

Vlastimil Babka

unread,

Oct 29, 2025, 6:31:33 PMOct 29

to Alexei Starovoitov, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On 10/24/25 22:43, Alexei Starovoitov wrote:
> On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:
>>
>> static bool has_pcs_used(int cpu, struct kmem_cache *s)
>> @@ -5599,21 +5429,18 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>> new.inuse -= cnt;
>> if ((!new.inuse || !prior) && !was_frozen) {

This line says "if slab is either becoming completely free (1), or becoming
partially free from being full (2)", and at the same time is not frozen (=
exclusively used as a c->slab by a cpu), we might need to take it off the
partial list (1) or add it there (2).

>> /* Needs to be taken off a list */
>> - if (!kmem_cache_has_cpu_partial(s) || prior) {

This line is best explained as a negation. If we have cpu partial lists, and
the slab was full and becoming partially free (case (2)) we will put it on
the cpu partial list, so we will avoid the node partial list and thus don't
need the list_lock. But that's the negation, so if the opposite is true, we
do need it.

And since we're removing the cpu partial lists, we can't put it there even
in case (2) so there's no point in testing for it.

> I'm struggling to convince myself that it's correct.

It should be per above.

> Losing '|| prior' means that we will be grabbing
> this "speculative" spin_lock much more often.
> While before the change we need spin_lock only when
> slab was partially empty
> (assuming cpu_partial was on for caches where performance matters).

That's true. But still, it should happen rarely that slab transitions from
full to partial, it's only on the first free after it became full. Sheaves
should make this rare and prevent degenerate corner case scenarios (slab
oscillating between partial and full with every free/alloc). AFAIK the main
benefit of partial slabs was the batching of taking slabs out from node
partial list under single list_lock and that principle remains with "slab:
add optimized sheaf refill from partial list". This avoidance of list_lock
in slab transitions from full to partial was a nice secondary benefit, but
not crucial.

But yeah, the TODOs about meaningful stats gathering and benchmarking should
answer that concern.

> Also what about later check:
> if (prior && !on_node_partial) {
> spin_unlock_irqrestore(&n->list_lock, flags);
> return;
> }

That's unaffected. It's actually for case (1), but we found it wasn't on the
list so we are not removing it. But we had to take the list_lock to
determine on_node_partial safely.

> and
> if (unlikely(!prior)) {
> add_partial(n, slab, DEACTIVATE_TO_TAIL);

This is for case (2) and we re adding it.

> Say, new.inuse == 0 then 'n' will be set,

That's case (1) so it was already on the partial list. We might just leave
it there with n->nr_partial < s->min_partial otherwise we goto slab_empty,
where it's removed and discarded.

> do we lose the slab?
> Because before the change it would be added to put_cpu_partial() ?

No, see above. Also the code already handled !kmem_cache_has_cpu_partial(s)
before. This patch simply assumes !kmem_cache_has_cpu_partial(s) is now
always true. You can see in __slab_free() it in fact only removes code that
became dead due to kmem_cache_has_cpu_partial(s) being now compile-time
constant false.

> but... since AI didn't find any bugs here, I must be wrong :)

It's tricky. I think we could add a "bool was_partial == (prior != NULL)" or
something to make it more obvious, that one is rather cryptic.

Vlastimil Babka

unread,

Oct 29, 2025, 6:44:38 PMOct 29

to Alexei Starovoitov, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On 10/25/25 00:32, Alexei Starovoitov wrote:
> On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vba...@suse.cz> wrote:
>> @@ -6444,8 +6316,13 @@ void kfree_nolock(const void *object)
>> * since kasan quarantine takes locks and not supported from NMI.
>> */
>> kasan_slab_free(s, x, false, false, /* skip quarantine */true);
>> + /*
>> + * __slab_free() can locklessly cmpxchg16 into a slab, but then it might
>> + * need to take spin_lock for further processing.
>> + * Avoid the complexity and simply add to a deferred list.
>> + */
>> if (!free_to_pcs(s, x, false))
>> - do_slab_free(s, slab, x, x, 0, _RET_IP_);
>> + defer_free(s, x);
>
> That should be rare, right?
> free_to_pcs() should have good chances to succeed,
> and pcs->spare should be there for kmalloc sheaves?

Yes.

> So trylock failure due to contention in barn_get_empty_sheaf()
> and in barn_replace_full_sheaf() should be rare.

Yeah, while of course stress tests like will-it-scale can expose nasty
corner cases.

> But needs to be benchmarked, of course.
> The current fast path cmpxchg16 in !RT is very reliable
> in my tests. Hopefully this doesn't regress.

You mean the one that doesn't go the "if (unlikely(slab != c->slab))" way?
Well that unlikely() there might be quite misleading. It will be true when
free follows shortly after alloc. If not, c->slab can be exhausted and
replaced with a new one. Or the process is migrated to another cpu before
freeing. The probability of slab == c->slab staying true drops quickly.

So if your tests were doing frees shortly after alloc, you would be indeed
hitting it reliably, but is it representative?
However sheaves should work reliably as well too with such a pattern, so if
some real code really does that significantly, it will not regress.

Harry Yoo

unread,

Oct 29, 2025, 8:07:33 PMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Yes!

Yes.

> I think it wouldn't help as
> the number could become stale after we record it, due to concurrent freeing.
> Maybe get_freelist_nofreeze() could return it together with the freelist as
> it can get both atomically.
>
> However the main reason for the loop is is not to count, but to find the
> tail pointer, and I don't see a way around it?

Uh, right. Nevermind then! I don't see a way around either.

> >> +
> >> + if (refilled >= max)
> >> + break;
> >> + }
> >> +
> >> + if (unlikely(!list_empty(&pc.slabs))) {
> >> + struct kmem_cache_node *n = get_node(s, node);
> >> +
> >> + spin_lock_irqsave(&n->list_lock, flags);
> >
> > Do we surely know that trylock will succeed when
> > we succeeded to acquire it in get_partial_node_bulk()?
> >
> > I think the answer is yes, but just to double check :)
>
> Yeah as you corrected, answer is no. However I missed that
> __pcs_replace_empty_main() will only let us reach here with
> gfpflags_allow_blocking() true in the first place.

Oh right, it's already done before it's called!

As you mentioned, __pcs_replace_empty_main() already knows
gfpflags_allow_blocking() == true when calling refill_sheaf().

And bulk allocation, sheaf prefill/return cannot be called from
kmalloc/kfree_nolock() path.

> So I didn't have to even
> deal with gfpflags_allow_spinning() in get_partial_node_bulk() then. I think
> it's the simplest solution.

Right.

> (side note: gfpflags_allow_blocking() might be too conservative now that
> sheafs will be the only caching layer, that condition could be perhaps
> changed to gfpflags_allow_spinning() to allow some cheap refill).

Sounds good to me.

Harry Yoo

unread,

Oct 29, 2025, 8:11:55 PMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Oh right. it's set to bootstrap_sheaf in init_percpu_sheaves() before
bootstrap_cache_sheaves() is called. Looks good then!

Alexei Starovoitov

unread,

Oct 29, 2025, 8:24:31 PMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On Wed, Oct 29, 2025 at 3:44 PM Vlastimil Babka <vba...@suse.cz> wrote:
>
>
> You mean the one that doesn't go the "if (unlikely(slab != c->slab))" way?
> Well that unlikely() there might be quite misleading. It will be true when
> free follows shortly after alloc. If not, c->slab can be exhausted and
> replaced with a new one. Or the process is migrated to another cpu before
> freeing. The probability of slab == c->slab staying true drops quickly.
>
> So if your tests were doing frees shortly after alloc, you would be indeed
> hitting it reliably, but is it representative?
> However sheaves should work reliably as well too with such a pattern, so if
> some real code really does that significantly, it will not regress.

I see. The typical usage of bpf map on the tracing side is
to attach two bpf progs to begin/end of something (like function entry/exit),
then map_update() on entry that allocates an element, populate
with data, then consume this data in 2nd bpf prog on exit
that deletes the element.
So alloc/free happen in a quick succession on the same cpu.
This is, of course, just one of use cases, but it was the dominant
one during early days.

Alexei Starovoitov

unread,

Oct 29, 2025, 8:26:31 PMOct 29

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

On Wed, Oct 29, 2025 at 3:31 PM Vlastimil Babka <vba...@suse.cz> wrote:
>
> > but... since AI didn't find any bugs here, I must be wrong :)
> It's tricky. I think we could add a "bool was_partial == (prior != NULL)" or
> something to make it more obvious, that one is rather cryptic.

That would help. prior and !prior are hard to think about.
Your explanation makes sense. Thanks

Harry Yoo

unread,

Oct 30, 2025, 12:33:01 AMOct 30

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

On Thu, Oct 23, 2025 at 03:52:32PM +0200, Vlastimil Babka wrote:
> We now rely on sheaves as the percpu caching layer and can refill them
> directly from partial or newly allocated slabs. Start removing the cpu
> (partial) slabs code, first from allocation paths.
>
> This means that any allocation not satisfied from percpu sheaves will
> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
> slabs, so it will only perform get_partial() or new_slab().
>
> In get_partial_node() we used to return a slab for freezing as the cpu
> slab and to refill the partial slab. Now we only want to return a single
> object and leave the slab on the list (unless it became full). We can't
> simply reuse alloc_single_from_partial() as that assumes freeing uses
> free_to_partial_list(). Instead we need to use __slab_update_freelist()
> to work properly against a racing __slab_free().
>
> The rest of the changes is removing functions that no longer have any
> callers.

>
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---

> mm/slub.c | 614 ++++++++------------------------------------------------------
> 1 file changed, 71 insertions(+), 543 deletions(-)

> diff --git a/mm/slub.c b/mm/slub.c
> index e2b052657d11..bd67336e7c1f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4790,66 +4509,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
> stat(s, ALLOC_SLAB);
>
> - if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> - freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
> -
> - if (unlikely(!freelist))
> - goto new_objects;
> -
> - if (s->flags & SLAB_STORE_USER)
> - set_track(s, freelist, TRACK_ALLOC, addr,
> - gfpflags & ~(__GFP_DIRECT_RECLAIM));
> -
> - return freelist;
> - }
> -
> - /*
> - * No other reference to the slab yet so we can
> - * muck around with it freely without cmpxchg
> - */
> - freelist = slab->freelist;
> - slab->freelist = NULL;
> - slab->inuse = slab->objects;
> - slab->frozen = 1;
> -
> - inc_slabs_node(s, slab_nid(slab), slab->objects);
> + freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
>
> - if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
> - /*
> - * For !pfmemalloc_match() case we don't load freelist so that
> - * we don't make further mismatched allocations easier.
> - */
> - deactivate_slab(s, slab, get_freepointer(s, freelist));
> - return freelist;
> - }
> + if (unlikely(!freelist))
> + goto new_objects;

We may end up in an endless loop in !allow_spin case?
(e.g., kmalloc_nolock() is called in NMI context and n->list_lock is
held in the process context on the same CPU)

Allocate a new slab, but somebody is holding n->list_lock, so trylock fails,
free the slab, goto new_objects, and repeat.

Vlastimil Babka

unread,

Oct 30, 2025, 9:09:52 AMOct 30

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Ugh, yeah. However, AFAICS this possibility already exists prior to this
patch, only it's limited to SLUB_TINY/kmem_cache_debug(s). But we should fix
it in 6.18 then.
How? Grab the single object and defer deactivation of the slab minus one
object? Would work except for kmem_cache_debug(s) we open again a race for
inconsistency check failure, and we have to undo the simple slab freeing fix
and handle the accounting issue differently again.
Fail the allocation for the debug case to avoid the consistency check
issues? Would it be acceptable for kmalloc_nolock() users?

Vlastimil Babka

unread,

Oct 30, 2025, 9:18:53 AMOct 30

to Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com

Hm now I realized the gfpflags_allow_blocking() check is there to make sure
we can take the local lock without trylock after obtaining a full sheaf, so
we can install it - because it should mean we're not in an interrupt
context. The fact we already succeeded trylock earlier should be enough, but
we'd run again into inventing ugly tricks to make lockdep happy.

Or we use trylock and have failure paths that are only possible to hit on RT
in practice...

Alexei Starovoitov

unread,

Oct 30, 2025, 11:28:08 AMOct 30

to Vlastimil Babka, Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

You mean something like:
diff --git a/mm/slub.c b/mm/slub.c
index a8fcc7e6f25a..e9a8b75f31d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4658,8 +4658,11 @@ static void *___slab_alloc(struct kmem_cache

*s, gfp_t gfpflags, int node,

if (kmem_cache_debug(s)) {

freelist = alloc_single_from_new_slab(s, slab,
orig_size, gfpflags);

- if (unlikely(!freelist))
+ if (unlikely(!freelist)) {
+ if (!allow_spin)
+ return NULL;
goto new_objects;
+ }

or I misunderstood the issue?

Vlastimil Babka

unread,

Oct 30, 2025, 11:35:56 AMOct 30

to Alexei Starovoitov, Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

Yeah that would be the easiest solution, if you can accept the occasional
allocation failures.

Alexei Starovoitov

unread,

Oct 30, 2025, 11:59:30 AMOct 30

to Vlastimil Babka, Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

yeah. not worried about the slub debug case.
Let's reassess when sheav conversion is over.

Harry Yoo

unread,

Nov 2, 2025, 10:44:39 PMNov 2

to Vlastimil Babka, Alexei Starovoitov, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm, LKML, linux-r...@lists.linux.dev, bpf, kasan-dev

Oops, right ;)

> >> How? Grab the single object and defer deactivation of the slab minus one
> >> object? Would work except for kmem_cache_debug(s) we open again a race for
> >> inconsistency check failure, and we have to undo the simple slab freeing fix
> >> and handle the accounting issue differently again.

> >> Fail the allocation for the debug case to avoid the consistency check
> >> issues? Would it be acceptable for kmalloc_nolock() users?

I think this should work (and is simple)!

> > You mean something like:
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a8fcc7e6f25a..e9a8b75f31d7 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4658,8 +4658,11 @@ static void *___slab_alloc(struct kmem_cache
> > *s, gfp_t gfpflags, int node,
> > if (kmem_cache_debug(s)) {
> > freelist = alloc_single_from_new_slab(s, slab,
> > orig_size, gfpflags);
> >
> > - if (unlikely(!freelist))
> > + if (unlikely(!freelist)) {
> > + if (!allow_spin)
> > + return NULL;
> > goto new_objects;
> > + }
> >
> > or I misunderstood the issue?
>
> Yeah that would be the easiest solution, if you can accept the occasional
> allocation failures.

Looks good to me.

Christoph Lameter (Ampere)

unread,

Nov 4, 2025, 5:11:21 PMNov 4

to Vlastimil Babka, Andrew Morton, David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki, Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, linux-r...@lists.linux.dev, b...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Marco Elver, Dmitry Vyukov

On Thu, 23 Oct 2025, Vlastimil Babka wrote:

> Besides (hopefully) improved performance, this removes the rather
> complicated code related to the lockless fastpaths (using
> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
> kmalloc_nolock().

Going back to a strict LIFO scheme for alloc/free removes the following
performance features:

1. Objects are served randomly from a variety of slab pages instead of
serving all available objects from a single slab page and then from the
next. This means that the objects require a larger set of TLB entries to
cover. TLB pressure will increase.

2. The number of partial slabs will increase since the free objects in a
partial page are not used up before moving onto the next. Instead free
objects from random slab pages are used.

Spatial object locality is reduced. Temporal object hotness increases.

> The lockless slab freelist+counters update operation using
> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
> without repeating the "alien" array flushing of SLUB, and to allow
> flushing objects from sheaves to slabs mostly without the node
> list_lock.

Hmm... So potential cache hot objects are lost that way and reused on
another node next. The role of the alien caches in SLAB was to cover that
case and we saw performance regressions without these caches.

The method of freeing still reduces the amount of remote partial slabs
that have to be managed and increases the locality of the objects.

Vlastimil Babka

unread,

Nov 5, 2025, 4:05:34 AMNov 5

to Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Liam R. Howlett, Suren Baghdasaryan, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, b...@vger.kernel.org, kasa...@googlegroups.com, Vlastimil Babka, Alexander Potapenko, Marco Elver, Dmitry Vyukov

SLUB's internal bulk allocation __kmem_cache_alloc_bulk() can currently
allocate some objects from KFENCE, i.e. when refilling a sheaf. It works
but it's conceptually the wrong layer, as KFENCE allocations should only
happen when objects are actually handed out from slab to its users.

Currently for sheaf-enabled caches, slab_alloc_node() can return KFENCE
object via kfence_alloc(), but also via alloc_from_pcs() when a sheaf
was refilled with KFENCE objects. Continuing like this would also
complicate the upcoming sheaf refill changes.

Thus remove KFENCE allocation from __kmem_cache_alloc_bulk() and move it
to the places that return slab objects to users. slab_alloc_node() is
already covered (see above). Add kfence_alloc() to
kmem_cache_alloc_from_sheaf() to handle KFENCE allocations from
prefilled sheafs, with a comment that the caller should not expect the
sheaf size to decrease after every allocation because of this
possibility.

For kmem_cache_alloc_bulk() implement a different strategy to handle
KFENCE upfront and rely on internal batched operations afterwards.
Assume there will be at most once KFENCE allocation per bulk allocation
and then assign its index in the array of objects randomly.

Cc: Alexander Potapenko <gli...@google.com>
Cc: Marco Elver <el...@google.com>
Cc: Dmitry Vyukov <dvy...@google.com>

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---

mm/slub.c | 44 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 074abe8e79f8..0237a329d4e5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5540,6 +5540,9 @@ int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
*
* The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
* memcg charging is forced over limit if necessary, to avoid failure.
+ *
+ * It is possible that the allocation comes from kfence and then the sheaf
+ * size is not decreased.
*/
void *
kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
@@ -5551,7 +5554,10 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
if (sheaf->size == 0)
goto out;

- ret = sheaf->objects[--sheaf->size];
+ ret = kfence_alloc(s, s->object_size, gfp);
+
+ if (likely(!ret))
+ ret = sheaf->objects[--sheaf->size];

init = slab_want_init_on_alloc(gfp, s);

@@ -7399,14 +7405,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
local_lock_irqsave(&s->cpu_slab->lock, irqflags);

for (i = 0; i < size; i++) {
- void *object = kfence_alloc(s, s->object_size, flags);
-
- if (unlikely(object)) {
- p[i] = object;
- continue;
- }
+ void *object = c->freelist;

- object = c->freelist;
if (unlikely(!object)) {
/*
* We may have removed an object from c->freelist using
@@ -7487,6 +7487,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
unsigned int i = 0;
+ void *kfence_obj;

if (!size)
return 0;
@@ -7495,6 +7496,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,

if (unlikely(!s))
return 0;

+ /*
+ * to make things simpler, only assume at most once kfence allocated
+ * object per bulk allocation and choose its index randomly
+ */

+ kfence_obj = kfence_alloc(s, s->object_size, flags);
+
+ if (unlikely(kfence_obj)) {
+ if (unlikely(size == 1)) {
+ p[0] = kfence_obj;
+ goto out;
+ }
+ size--;
+ }
+
if (s->cpu_sheaves)
i = alloc_from_pcs_bulk(s, size, p);

@@ -7506,10 +7521,23 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,

if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
if (i > 0)
__kmem_cache_free_bulk(s, i, p);
+ if (kfence_obj)
+ __kfence_free(kfence_obj);
return 0;
}
}

+ if (unlikely(kfence_obj)) {

+ int idx = get_random_u32_below(size + 1);
+
+ if (idx != size)
+ p[size] = p[idx];
+ p[idx] = kfence_obj;
+
+ size++;
+ }
+
+out:
/*
* memcg and kmem_cache debug support and memory initialization.
* Done outside of the IRQ disabled fastpath loop.

--
2.51.1

Alexei Starovoitov

unread,

Nov 5, 2025, 9:39:24 PMNov 5

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Liam R. Howlett, Suren Baghdasaryan, Alexei Starovoitov, linux-mm, LKML, bpf, kasan-dev, Alexander Potapenko, Marco Elver, Dmitry Vyukov

Judging by this direction you plan to add it to kmalloc/alloc_from_pcs too?
If so it will break sheaves+kmalloc_nolock approach in
your prior patch set, since kfence_alloc() is not trylock-ed.
Or this will stay kmem_cache specific?

Vlastimil Babka

unread,

Nov 6, 2025, 2:23:49 AMNov 6

to Alexei Starovoitov, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo, Liam R. Howlett, Suren Baghdasaryan, Alexei Starovoitov, linux-mm, LKML, bpf, kasan-dev, Alexander Potapenko, Marco Elver, Dmitry Vyukov

No, kmem_cache_alloc_from_sheaf() is a new API for use cases like maple
tree, it's different from the internal alloc_from_pcs() caching.

> If so it will break sheaves+kmalloc_nolock approach in
> your prior patch set, since kfence_alloc() is not trylock-ed.
> Or this will stay kmem_cache specific?

I rechecked the result of the full RFC and kfence_alloc() didn't appear in
kmalloc_nolock() path. I would say this patch moved it rather in the
opposite direction, away from internal layers that could end up in
kmalloc_nolock() path when kmalloc caches have sheaves.

Harry Yoo

unread,

Nov 10, 2025, 3:06:36 AMNov 10

to Vlastimil Babka, Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin, Liam R. Howlett, Suren Baghdasaryan, Alexei Starovoitov, linu...@kvack.org, linux-...@vger.kernel.org, b...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Marco Elver, Dmitry Vyukov

Looks good to me,
Reviewed-by: Harry Yoo <harr...@oracle.com>

Reply all

Reply to author

Forward