[PATCH 00/40] Memory allocation profiling

1 view
Skip to first unread message

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:08 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Memory allocation profiling infrastructure provides a low overhead
mechanism to make all kernel allocations in the system visible. It can be
used to monitor memory usage, track memory hotspots, detect memory leaks,
identify memory regressions.

To keep the overhead to the minimum, we record only allocation sizes for
every allocation in the codebase. With that information, if users are
interested in more detailed context for a specific allocation, they can
enable in-depth context tracking, which includes capturing the pid, tgid,
task name, allocation size, timestamp and call stack for every allocation
at the specified code location.

The data is exposed to the user space via a read-only debugfs file called
allocations. Usage example:

$ sort -hr /sys/kernel/debug/allocations|head
153MiB 8599 mm/slub.c:1826 module:slub func:alloc_slab_page
6.08MiB 49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
5.09MiB 6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
4.54MiB 78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
1.32MiB 338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
1.16MiB 603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
1.00MiB 256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
734KiB 5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
640KiB 160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
640KiB 160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf

For allocation context capture, a new debugfs file called allocations.ctx
is used to select which code location should capture allocation context
and to read captured context information. Usage example:

$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > allocations.ctx
$ cat allocations.ctx
920KiB 230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1474
tgid: 1474
comm: bash
ts: 175332940994
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0xb0
copy_page_range+0x842/0x1640
dup_mm+0x42d/0x580
copy_process+0xfb1/0x1ac0
kernel_clone+0x92/0x3e0
__do_sys_clone+0x66/0x90
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
...

Implementation utilizes a more generic concept of code tagging, introduced
as part of this patchset. Code tag is a structure identifying a specific
location in the source code which is generated at compile time and can be
embedded in an application-specific structure. A number of applications
for code tagging have been presented in the original RFC [1].
Code tagging uses the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.

To profile memory allocations, we instrument page, slab and percpu
allocators to record total memory allocated in the associated code tag at
every allocation in the codebase. Every time an allocation is performed by
an instrumented allocator, the code tag at that location increments its
counter by allocation size. Every time the memory is freed the counter is
decremented. To decrement the counter upon freeing, allocated object needs
a reference to its code tag. Page allocators use page_ext to record this
reference while slab allocators use memcg_data (renamed into more generic
slabobj_ext) of the slab page.

Module allocations are accounted the same way as other kernel allocations.
Module loading and unloading is supported. If a module is unloaded while
one or more of its allocations is still not freed (rather rare condition),
its data section will be kept in memory to allow later code tag
referencing when the allocation is freed later on.

As part of this series we introduce several kernel configs:
CODE_TAGGING - to enable code tagging framework
CONFIG_MEM_ALLOC_PROFILING - to enable memory allocation profiling
CONFIG_MEM_ALLOC_PROFILING_DEBUG - to enable memory allocation profiling
validation
Note: CONFIG_MEM_ALLOC_PROFILING enables CONFIG_PAGE_EXTENSION to store
code tag reference in the page_ext object.

nomem_profiling kernel command-line parameter is also provided to disable
the functionality and avoid the performance overhead.
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below is performance
comparison between the baseline kernel, profiling when enabled, profiling
when disabled (nomem_profiling=y) and (for comparison purposes) baseline
with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:

kmalloc pgalloc
Baseline (6.3-rc7) 9.200s 31.050s
profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)

[1] https://lore.kernel.org/all/20220830214919...@google.com/

Kent Overstreet (15):
lib/string_helpers: Drop space in string_get_size's output
scripts/kallysms: Always include __start and __stop symbols
fs: Convert alloc_inode_sb() to a macro
nodemask: Split out include/linux/nodemask_types.h
prandom: Remove unused include
lib/string.c: strsep_no_empty()
Lazy percpu counters
lib: code tagging query helper functions
mm/slub: Mark slab_free_freelist_hook() __always_inline
mempool: Hook up to memory allocation profiling
timekeeping: Fix a circular include dependency
mm: percpu: Introduce pcpuobj_ext
mm: percpu: Add codetag reference into pcpuobj_ext
arm64: Fix circular header dependency
MAINTAINERS: Add entries for code tagging and memory allocation
profiling

Suren Baghdasaryan (25):
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
lib: code tagging framework
lib: code tagging module support
lib: prevent module unloading if memory is not freed
lib: add allocation tagging support for memory allocation profiling
lib: introduce support for page allocation tagging
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging
mm/page_ext: enable early_page_ext when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
mm: create new codetag references during page splitting
lib: add codetag reference into slabobj_ext
mm/slab: add allocation accounting into slab allocation and free paths
mm/slab: enable slab allocation tagging for kmalloc and friends
mm: percpu: enable per-cpu allocation tagging
move stack capture functionality into a separate function for reuse
lib: code tagging context capture support
lib: implement context capture support for tagged allocations
lib: add memory allocations report in show_mem()
codetag: debug: skip objext checking when it's for objext itself
codetag: debug: mark codetags for reserved pages as empty
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
allocations

.../admin-guide/kernel-parameters.txt | 2 +
MAINTAINERS | 22 +
arch/arm64/include/asm/spectre.h | 4 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/asm-generic/codetag.lds.h | 14 +
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 161 ++++++
include/linux/codetag.h | 159 ++++++
include/linux/codetag_ctx.h | 48 ++
include/linux/dma-map-ops.h | 2 +-
include/linux/fs.h | 6 +-
include/linux/gfp.h | 123 ++--
include/linux/gfp_types.h | 12 +-
include/linux/hrtimer.h | 2 +-
include/linux/lazy-percpu-counter.h | 102 ++++
include/linux/memcontrol.h | 56 +-
include/linux/mempool.h | 73 ++-
include/linux/mm.h | 8 +
include/linux/mm_types.h | 4 +-
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 +-
include/linux/percpu.h | 19 +-
include/linux/pgalloc_tag.h | 95 ++++
include/linux/prandom.h | 1 -
include/linux/sched.h | 32 +-
include/linux/slab.h | 182 +++---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 +-
include/linux/stackdepot.h | 16 +
include/linux/string.h | 1 +
include/linux/time_namespace.h | 2 +
init/Kconfig | 4 +
kernel/dma/mapping.c | 4 +-
kernel/module/main.c | 25 +-
lib/Kconfig | 3 +
lib/Kconfig.debug | 26 +
lib/Makefile | 5 +
lib/alloc_tag.c | 464 +++++++++++++++
lib/codetag.c | 529 ++++++++++++++++++
lib/lazy-percpu-counter.c | 127 +++++
lib/show_mem.c | 15 +
lib/stackdepot.c | 68 +++
lib/string.c | 19 +
lib/string_helpers.c | 3 +-
mm/compaction.c | 9 +-
mm/filemap.c | 6 +-
mm/huge_memory.c | 2 +
mm/kfence/core.c | 14 +-
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +-
mm/mempolicy.c | 30 +-
mm/mempool.c | 28 +-
mm/mm_init.c | 1 +
mm/page_alloc.c | 75 ++-
mm/page_ext.c | 21 +-
mm/page_owner.c | 54 +-
mm/percpu-internal.h | 26 +-
mm/percpu.c | 122 ++--
mm/slab.c | 22 +-
mm/slab.h | 224 ++++++--
mm/slab_common.c | 95 +++-
mm/slub.c | 24 +-
mm/util.c | 10 +-
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
70 files changed, 2765 insertions(+), 554 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/codetag_ctx.h
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 include/linux/nodemask_types.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c
create mode 100644 lib/lazy-percpu-counter.c

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:11 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Trønnes
From: Kent Overstreet <kent.ov...@linux.dev>

Previously, string_get_size() outputted a space between the number and
the units, i.e.
9.88 MiB

This changes it to
9.88MiB

which allows it to be parsed correctly by the 'sort -h' command.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Andy Shevchenko <an...@kernel.org>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jaso...@redhat.com>
Cc: "Noralf Trønnes" <nor...@tronnes.org>
Cc: Jens Axboe <ax...@kernel.dk>
---
lib/string_helpers.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e076..593b29fece32 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -126,8 +126,7 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];

- snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ snprintf(buf, len, "%u%s%s", (u32)size, tmp, unit);
}
EXPORT_SYMBOL(string_get_size);

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:13 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0d2db41177b2..7b7dbeb5bd6e 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -203,6 +203,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}

+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -210,6 +215,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:15 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
From: Kent Overstreet <kent.ov...@linux.dev>

We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Alexander Viro <vi...@zeniv.linux.org.uk>
---
include/linux/fs.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..4905ce14db0b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
* This must be used for allocating filesystems specific inodes to set
* up the inode reclaim context correctly.
*/
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
- return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)

extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:18 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

sched.h, which defines task_struct, needs nodemask_t - but sched.h is a
frequently used header and ideally shouldn't be pulling in any more code
that it needs to.

This splits out nodemask_types.h which has the definition sched.h needs,
which will avoid a circular header dependency in the alloc tagging patch
series, and as a bonus should speed up kernel build times.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
---
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +++++++++
include/linux/sched.h | 2 +-
3 files changed, 11 insertions(+), 2 deletions(-)
create mode 100644 include/linux/nodemask_types.h

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index bb0ee80526b2..fda37b6df274 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -93,10 +93,10 @@
#include <linux/threads.h>
#include <linux/bitmap.h>
#include <linux/minmax.h>
+#include <linux/nodemask_types.h>
#include <linux/numa.h>
#include <linux/random.h>

-typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
extern nodemask_t _unused_nodemask_arg_;

/**
diff --git a/include/linux/nodemask_types.h b/include/linux/nodemask_types.h
new file mode 100644
index 000000000000..84c2f47c4237
--- /dev/null
+++ b/include/linux/nodemask_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NODEMASK_TYPES_H
+#define __LINUX_NODEMASK_TYPES_H
+
+#include <linux/numa.h>
+
+typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
+
+#endif /* __LINUX_NODEMASK_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..35e7efdea2d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -20,7 +20,7 @@
#include <linux/hrtimer.h>
#include <linux/irqflags.h>
#include <linux/seccomp.h>
-#include <linux/nodemask.h>
+#include <linux/nodemask_types.h>
#include <linux/rcupdate.h>
#include <linux/refcount.h>
#include <linux/resource.h>
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:20 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

prandom.h doesn't use percpu.h - this fixes some circular header issues.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/prandom.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index f2ed5b72b3d6..f7f1e5251c67 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -10,7 +10,6 @@

#include <linux/types.h>
#include <linux/once.h>
-#include <linux/percpu.h>
#include <linux/random.h>

struct rnd_state {
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:22 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This adds a new helper which is like strsep, except that it skips empty
tokens.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/string.h | 1 +
lib/string.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index c062c581a98b..6cd5451c262c 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -96,6 +96,7 @@ extern char * strpbrk(const char *,const char *);
#ifndef __HAVE_ARCH_STRSEP
extern char * strsep(char **,const char *);
#endif
+extern char *strsep_no_empty(char **, const char *);
#ifndef __HAVE_ARCH_STRSPN
extern __kernel_size_t strspn(const char *,const char *);
#endif
diff --git a/lib/string.c b/lib/string.c
index 3d55ef890106..dd4914baf45a 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -520,6 +520,25 @@ char *strsep(char **s, const char *ct)
EXPORT_SYMBOL(strsep);
#endif

+/**
+ * strsep_no_empt - Split a string into tokens, but don't return empty tokens
+ * @s: The string to be searched
+ * @ct: The characters to search for
+ *
+ * strsep() updates @s to point after the token, ready for the next call.
+ */
+char *strsep_no_empty(char **s, const char *ct)
+{
+ char *ret;
+
+ do {
+ ret = strsep(s, ct);
+ } while (ret && !*ret);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(strsep_no_empty);
+
#ifndef __HAVE_ARCH_MEMSET
/**
* memset - Fill a region of memory with the given value
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:25 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This patch adds lib/lazy-percpu-counter.c, which implements counters
that start out as atomics, but lazily switch to percpu mode if the
update rate crosses some threshold (arbitrarily set at 256 per second).

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/lazy-percpu-counter.h | 102 ++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/lazy-percpu-counter.c | 127 ++++++++++++++++++++++++++++
4 files changed, 234 insertions(+)
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 lib/lazy-percpu-counter.c

diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
new file mode 100644
index 000000000000..45ca9e2ce58b
--- /dev/null
+++ b/include/linux/lazy-percpu-counter.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Lazy percpu counters:
+ * (C) 2022 Kent Overstreet
+ *
+ * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
+ * the update rate crosses some threshold.
+ *
+ * This means we don't have to decide between low memory overhead atomic
+ * counters and higher performance percpu counters - we can have our cake and
+ * eat it, too!
+ *
+ * Internally we use an atomic64_t, where the low bit indicates whether we're in
+ * percpu mode, and the high 8 bits are a secondary counter that's incremented
+ * when the counter is modified - meaning 55 bits of precision are available for
+ * the counter itself.
+ */
+
+#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
+#define _LINUX_LAZY_PERCPU_COUNTER_H
+
+#include <linux/atomic.h>
+#include <asm/percpu.h>
+
+struct lazy_percpu_counter {
+ atomic64_t v;
+ unsigned long last_wrap;
+};
+
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i);
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c);
+
+/*
+ * We use the high bits of the atomic counter for a secondary counter, which is
+ * incremented every time the counter is touched. When the secondary counter
+ * wraps, we check the time the counter last wrapped, and if it was recent
+ * enough that means the update frequency has crossed our threshold and we
+ * switch to percpu mode:
+ */
+#define COUNTER_MOD_BITS 8
+#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
+#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
+
+/*
+ * We use the low bit of the counter to indicate whether we're in atomic mode
+ * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
+ * percpu counters:
+ */
+#define COUNTER_IS_PCPU_BIT 1
+
+static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+{
+ if (!(v & COUNTER_IS_PCPU_BIT))
+ return NULL;
+
+ v ^= COUNTER_IS_PCPU_BIT;
+ return (u64 __percpu *)(unsigned long)v;
+}
+
+/**
+ * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath(c, i);
+}
+
+/**
+ * lazy_percpu_counter_add_noupgrade: Add a value to a lazy_percpu_counter,
+ * without upgrading to percpu mode
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath_noupgrade(c, i);
+}
+
+static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
+{
+ lazy_percpu_counter_add(c, -i);
+}
+
+#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 5c2da561c516..7380292a8fcd 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -505,6 +505,9 @@ config ASSOCIATIVE_ARRAY

for more information.

+config LAZY_PERCPU_COUNTER
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae34e..293a0858a3f8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -164,6 +164,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o

+obj-$(CONFIG_LAZY_PERCPU_COUNTER) += lazy-percpu-counter.o
+
obj-$(CONFIG_BITREVERSE) += bitrev.o
obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
obj-$(CONFIG_PACKING) += packing.o
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
new file mode 100644
index 000000000000..4f4e32c2dc09
--- /dev/null
+++ b/lib/lazy-percpu-counter.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/atomic.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/percpu.h>
+
+static inline s64 lazy_percpu_counter_atomic_val(s64 v)
+{
+ /* Ensure output is sign extended properly: */
+ return (v << COUNTER_MOD_BITS) >>
+ (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
+}
+
+static void lazy_percpu_counter_switch_to_pcpu(struct lazy_percpu_counter *c)
+{
+ u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
+ u64 old, new, v;
+
+ if (!pcpu_v)
+ return;
+
+ preempt_disable();
+ v = atomic64_read(&c->v);
+ do {
+ if (lazy_percpu_counter_is_pcpu(v)) {
+ free_percpu(pcpu_v);
+ return;
+ }
+
+ old = v;
+ new = (unsigned long)pcpu_v | 1;
+
+ *this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+ } while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
+ preempt_enable();
+}
+
+/**
+ * lazy_percpu_counter_exit: Free resources associated with a
+ * lazy_percpu_counter
+ *
+ * @c: counter to exit
+ */
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+{
+ free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
+
+/**
+ * lazy_percpu_counter_read: Read current value of a lazy_percpu_counter
+ *
+ * @c: counter to read
+ */
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
+{
+ s64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (pcpu_v) {
+ int cpu;
+
+ v = 0;
+ for_each_possible_cpu(cpu)
+ v += *per_cpu_ptr(pcpu_v, cpu);
+ } else {
+ v = lazy_percpu_counter_atomic_val(v);
+ }
+
+ return v;
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_read);
+
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+ atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+
+ if (unlikely(!(v & COUNTER_MOD_MASK))) {
+ unsigned long now = jiffies;
+
+ if (c->last_wrap &&
+ unlikely(time_after(c->last_wrap + HZ, now)))
+ lazy_percpu_counter_switch_to_pcpu(c);
+ else
+ c->last_wrap = now;
+ }
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
+
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath_noupgrade);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:28 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 20 +++--
include/linux/mm_types.h | 4 +-
init/Kconfig | 4 +
mm/kfence/core.c | 14 ++--
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 ++------------
mm/page_owner.c | 2 +-
mm/slab.h | 148 +++++++++++++++++++++++++------------
mm/slab_common.c | 47 ++++++++++++
9 files changed, 185 insertions(+), 114 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c..b9fd9732a52b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -339,8 +339,8 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;

enum page_memcg_data_flags {
- /* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ /* page->memcg_data is a pointer to an slabobj_ext vector */
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -378,7 +378,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -399,7 +399,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -496,7 +496,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;

if (memcg_data & MEMCG_DATA_KMEM) {
@@ -542,7 +542,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}

@@ -1606,6 +1606,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */

+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..e79303e1e30c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -194,7 +194,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;

-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif

@@ -320,7 +320,7 @@ struct folio {
void *private;
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..44267919a2a2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,10 +936,14 @@ config CGROUP_FAVOR_DYNMODS

Say N if unsure.

+config SLAB_OBJ_EXT
+ bool
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+ select SLAB_OBJ_EXT
help
Provides control over the memory footprint of tasks in a cgroup.

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index dad3c0eb70a0..aea6fa145080 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -590,9 +590,9 @@ static unsigned long kfence_init_pool(void)
continue;

__folio_set_slab(slab_folio(slab));
-#ifdef CONFIG_MEMCG
- slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = (unsigned long)&kfence_metadata[i / 2 - 1].obj_exts |
+ MEMCG_DATA_OBJEXTS;
#endif
}

@@ -634,8 +634,8 @@ static unsigned long kfence_init_pool(void)

if (!i || (i % 2))
continue;
-#ifdef CONFIG_MEMCG
- slab->memcg_data = 0;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = 0;
#endif
__folio_clear_slab(slab_folio(slab));
}
@@ -1093,8 +1093,8 @@ void __kfence_free(void *addr)
{
struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

-#ifdef CONFIG_MEMCG
- KFENCE_WARN_ON(meta->objcg);
+#ifdef CONFIG_MEMCG_KMEM
+ KFENCE_WARN_ON(meta->obj_exts.objcg);
#endif
/*
* If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 2aafc46a4aaf..8e0d76c4ea2a 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -97,8 +97,8 @@ struct kfence_metadata {
struct kfence_track free_track;
/* For updating alloc_covered on frees. */
u32 alloc_stack_hash;
-#ifdef CONFIG_MEMCG
- struct obj_cgroup *objcg;
+#ifdef CONFIG_MEMCG_KMEM
+ struct slabobj_ext obj_exts;
#endif
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..f2a7fe718117 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2892,13 +2892,6 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
}

#ifdef CONFIG_MEMCG_KMEM
-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
-
/*
* mod_objcg_mlstate() may be called with irq enabled, so
* mod_memcg_lruvec_state() should be used.
@@ -2917,62 +2910,27 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
rcu_read_unlock();
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
-{
- unsigned int objects = objs_per_slab(s, slab);
- unsigned long memcg_data;
- void *vec;
-
- gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- slab_nid(slab));
- if (!vec)
- return -ENOMEM;
-
- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
- if (new_slab) {
- /*
- * If the slab is brand new and nobody can yet access its
- * memcg_data, no synchronization is required and memcg_data can
- * be simply assigned.
- */
- slab->memcg_data = memcg_data;
- } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
- /*
- * If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
- * objcg vector should be reused.
- */
- kfree(vec);
- return 0;
- }
-
- kmemleak_not_leak(vec);
- return 0;
-}
-
static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
{
/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * slab->memcg_data.
+ * slab->obj_exts.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;

slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;

off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);

return NULL;
}
@@ -2980,7 +2938,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
/*
* folio_memcg_check() is used here, because in theory we can encounter
* a folio where the slab flag has been cleared already, but
- * slab->memcg_data has not been freed yet
+ * slab->obj_exts has not been freed yet
* folio_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 31169b3e7f06..8b6086c666e6 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -372,7 +372,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");

diff --git a/mm/slab.h b/mm/slab.h
index f01ac256a8f5..25d14b3a7280 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,8 +57,8 @@ struct slab {
#endif

atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long memcg_data;
+#ifdef CONFIG_SLAB_OBJ_EXT
+ unsigned long obj_exts;
#endif
};

@@ -67,8 +67,8 @@ struct slab {
SLAB_MATCH(flags, __page_flags);
SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-SLAB_MATCH(memcg_data, memcg_data);
+#ifdef CONFIG_SLAB_OBJ_EXT
+SLAB_MATCH(memcg_data, obj_exts);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
@@ -390,36 +390,106 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(slab->memcg_data);
+ unsigned long obj_exts = READ_ONCE(slab->obj_exts);

- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS),
slab_page(slab));
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
+ VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
+#else
+ return (struct slabobj_ext *)obj_exts;
+#endif
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);

-static inline void memcg_free_slab_cgroups(struct slab *slab)
+static inline bool need_slab_obj_ext(void)
{
- kfree(slab_objcgs(slab));
- slab->memcg_data = 0;
+ /*
+ * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
+ * inside memcg_slab_post_alloc_hook. No other users for now.
+ */
+ return false;
}

+static inline void free_slab_obj_exts(struct slab *slab)
+{
+ struct slabobj_ext *obj_exts;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ kfree(obj_exts);
+ slab->obj_exts = 0;
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ if (!p)
+ return NULL;
+
+ if (!need_slab_obj_ext())
+ return NULL;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab) &&
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name))
+ return NULL;
+
+ return slab_obj_exts(slab) + obj_to_index(s, slab, p);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
+{
+ return NULL;
+}
+
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
+{
+ return 0;
+}
+
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -487,16 +557,15 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);

- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags,
- false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}

off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -509,14 +578,14 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
int i;

if (!memcg_kmem_online())
return;

- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return;

for (i = 0; i < objects; i++) {
@@ -524,11 +593,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
unsigned int off;

off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;

- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -537,27 +606,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
}

#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
-{
- return NULL;
-}
-
static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
{
return NULL;
}

-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
-{
- return 0;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -594,7 +647,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -603,8 +656,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_online())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -684,6 +736,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ struct slabobj_ext *obj_exts;
size_t i;

flags &= gfp_allowed_mask;
@@ -714,6 +767,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
+ obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
}

memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 607249785c07..f11cc072b01e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -204,6 +204,53 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}

+#ifdef CONFIG_SLAB_OBJ_EXT
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
+
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
+{
+ unsigned int objects = objs_per_slab(s, slab);
+ unsigned long obj_exts;
+ void *vec;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+ slab_nid(slab));
+ if (!vec)
+ return -ENOMEM;
+
+ obj_exts = (unsigned long)vec;
+#ifdef CONFIG_MEMCG
+ obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ if (new_slab) {
+ /*
+ * If the slab is brand new and nobody can yet access its
+ * obj_exts, no synchronization is required and obj_exts can
+ * be simply assigned.
+ */
+ slab->obj_exts = obj_exts;
+ } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ /*
+ * If the slab is already in use, somebody can allocate and
+ * assign slabobj_exts in parallel. In this case the existing
+ * objcg vector should be reused.
+ */
+ kfree(vec);
+ return 0;
+ }
+
+ kmemleak_not_leak(vec);
+ return 0;
+}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:30 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/gfp_types.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..aab1959130f9 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_SKIP_ZERO 0
#define ___GFP_SKIP_KASAN 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT 0x4000000u
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x4000000u
+#define ___GFP_NOLOCKDEP 0x8000000u
#else
#define ___GFP_NOLOCKDEP 0
#endif
@@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)

/**
* DOC: Watermark modifiers
@@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:32 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/slab.h | 7 +++++++
mm/slab.c | 2 +-
mm/slub.c | 5 +++--
3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6b3e155b70bf..99a146f3cedf 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -147,6 +147,13 @@
#endif
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */

+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slab.c b/mm/slab.c
index bb57f7fdbae1..ccc76f7455e9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1232,7 +1232,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;

diff --git a/mm/slub.c b/mm/slub.c
index c87628cd8a9a..507b71372ee4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5020,7 +5020,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);

create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

@@ -5030,7 +5031,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:34 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
2 files changed, 8 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 25d14b3a7280..b1c22dc87047 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -450,6 +450,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
if (!need_slab_obj_ext())
return NULL;

+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return NULL;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f11cc072b01e..42777d66d0e3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -220,6 +220,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:37 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce objext_flags to store additional objext flags unrelated to memcg.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++-------
mm/slab.h | 4 +---
2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b9fd9732a52b..5e2da63c525f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -347,7 +347,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS = (1UL << 2),
};

-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG (1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+ /* the next bit after the last actual flag */
+ __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG

static inline bool folio_memcg_kmem(struct folio *folio);

@@ -381,7 +396,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -402,7 +417,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

- return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -459,11 +474,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -502,11 +517,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index b1c22dc87047..bec202bdcfb8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -409,10 +409,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
- return (struct slabobj_ext *)obj_exts;
#endif
+ return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
}

int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:39 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.

Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 71 ++++++++++++++
lib/Kconfig.debug | 4 +
lib/Makefile | 1 +
lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 275 insertions(+)
create mode 100644 include/linux/codetag.h
create mode 100644 lib/codetag.c

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index 000000000000..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include <linux/types.h>
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged. At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+ unsigned int flags; /* used in later patches */
+ unsigned int lineno;
+ const char *modname;
+ const char *function;
+ const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+ struct codetag *ct;
+};
+
+struct codetag_range {
+ struct codetag *start;
+ struct codetag *stop;
+};
+
+struct codetag_module {
+ struct module *mod;
+ struct codetag_range range;
+};
+
+struct codetag_type_desc {
+ const char *section;
+ size_t tag_size;
+};
+
+struct codetag_iterator {
+ struct codetag_type *cttype;
+ struct codetag_module *cmod;
+ unsigned long mod_id;
+ struct codetag *ct;
+};
+
+#define CODE_TAG_INIT { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .lineno = __LINE__, \
+ .flags = 0, \
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..5078da7d3ffb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -957,6 +957,10 @@ config DEBUG_STACKOVERFLOW

If in doubt, say "N".

+config CODE_TAGGING
+ bool
+ select KALLSYMS
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 293a0858a3f8..28d70ecf2976 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -228,6 +228,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

+obj-$(CONFIG_CODE_TAGGING) += codetag.o
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index 000000000000..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/codetag.h>
+#include <linux/idr.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/slab.h>
+
+struct codetag_type {
+ struct list_head link;
+ unsigned int count;
+ struct idr mod_idr;
+ struct rw_semaphore mod_lock; /* protects mod_idr */
+ struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+ if (lock)
+ down_read(&cttype->mod_lock);
+ else
+ up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+ struct codetag_iterator iter = {
+ .cttype = cttype,
+ .cmod = NULL,
+ .mod_id = 0,
+ .ct = NULL,
+ };
+
+ return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+ return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+ struct codetag *res = (struct codetag *)
+ ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+ return res < iter->cmod->range.stop ? res : NULL;
+}
+
+struct codetag *codetag_next_ct(struct codetag_iterator *iter)
+{
+ struct codetag_type *cttype = iter->cttype;
+ struct codetag_module *cmod;
+ struct codetag *ct;
+
+ lockdep_assert_held(&cttype->mod_lock);
+
+ if (unlikely(idr_is_empty(&cttype->mod_idr)))
+ return NULL;
+
+ ct = NULL;
+ while (true) {
+ cmod = idr_find(&cttype->mod_idr, iter->mod_id);
+
+ /* If module was removed move to the next one */
+ if (!cmod)
+ cmod = idr_get_next_ul(&cttype->mod_idr,
+ &iter->mod_id);
+
+ /* Exit if no more modules */
+ if (!cmod)
+ break;
+
+ if (cmod != iter->cmod) {
+ iter->cmod = cmod;
+ ct = get_first_module_ct(cmod);
+ } else
+ ct = get_next_module_ct(iter);
+
+ if (ct)
+ break;
+
+ iter->mod_id++;
+ }
+
+ iter->ct = ct;
+ return ct;
+}
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ seq_buf_printf(out, "%s:%u module:%s func:%s",
+ ct->filename, ct->lineno,
+ ct->modname, ct->function);
+}
+
+static inline size_t range_size(const struct codetag_type *cttype,
+ const struct codetag_range *range)
+{
+ return ((char *)range->stop - (char *)range->start) /
+ cttype->desc.tag_size;
+}
+
+static void *get_symbol(struct module *mod, const char *prefix, const char *name)
+{
+ char buf[64];
+ int res;
+
+ res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
+ if (WARN_ON(res < 1 || res > sizeof(buf)))
+ return NULL;
+
+ return mod ?
+ (void *)find_kallsyms_symbol_value(mod, buf) :
+ (void *)kallsyms_lookup_name(buf);
+}
+
+static struct codetag_range get_section_range(struct module *mod,
+ const char *section)
+{
+ return (struct codetag_range) {
+ get_symbol(mod, "__start_", section),
+ get_symbol(mod, "__stop_", section),
+ };
+}
+
+static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
+{
+ struct codetag_range range;
+ struct codetag_module *cmod;
+ int err;
+
+ range = get_section_range(mod, cttype->desc.section);
+ if (!range.start || !range.stop) {
+ pr_warn("Failed to load code tags of type %s from the module %s\n",
+ cttype->desc.section,
+ mod ? mod->name : "(built-in)");
+ return -EINVAL;
+ }
+
+ /* Ignore empty ranges */
+ if (range.start == range.stop)
+ return 0;
+
+ BUG_ON(range.start > range.stop);
+
+ cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
+ if (unlikely(!cmod))
+ return -ENOMEM;
+
+ cmod->mod = mod;
+ cmod->range = range;
+
+ down_write(&cttype->mod_lock);
+ err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
+ if (err >= 0)
+ cttype->count += range_size(cttype, &range);
+ up_write(&cttype->mod_lock);
+
+ if (err < 0) {
+ kfree(cmod);
+ return err;
+ }
+
+ return 0;
+}
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc)
+{
+ struct codetag_type *cttype;
+ int err;
+
+ BUG_ON(desc->tag_size <= 0);
+
+ cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
+ if (unlikely(!cttype))
+ return ERR_PTR(-ENOMEM);
+
+ cttype->desc = *desc;
+ idr_init(&cttype->mod_idr);
+ init_rwsem(&cttype->mod_lock);
+
+ err = codetag_module_init(cttype, NULL);
+ if (unlikely(err)) {
+ kfree(cttype);
+ return ERR_PTR(err);
+ }
+
+ mutex_lock(&codetag_lock);
+ list_add_tail(&cttype->link, &codetag_types);
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:41 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add support for code tagging from dynamically loaded modules.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/codetag.h | 12 +++++++++
kernel/module/main.c | 4 +++
lib/codetag.c | 58 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};

struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);

+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..4232e7bff549 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
#include <linux/cfi.h>
+#include <linux/codetag.h>
#include <linux/debugfs.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1249,6 +1250,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);

+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -2974,6 +2976,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);

+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);

diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type *cttype,
static void *get_symbol(struct module *mod, const char *prefix, const char *name)
{
char buf[64];
+ void *ret;
int res;

res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;

- return mod ?
+ preempt_disable();
+ ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+ preempt_enable();
+
+ return ret;
}

static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)

down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);

if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)

return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:43 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 6 +++---
kernel/module/main.c | 23 +++++++++++++++--------
lib/codetag.c | 11 ++++++++---
3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
- void (*module_unload)(struct codetag_type *cttype,
+ bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
};

@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);

#ifdef CONFIG_CODE_TAGGING
void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
#else
static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif

#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 4232e7bff549..9ff56f2bb09d 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1218,15 +1218,19 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
return module_alloc(size);
}

-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+ bool unload_codetags)
{
+ if (!unload_codetags && mod_mem_type_is_core_data(type))
+ return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
}

-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
{
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1237,20 +1241,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod_mem->base, type,
+ unload_codetags);
}

/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
}

/* Free a module, remove from lists, etc. */
static void free_module(struct module *mod)
{
+ bool unload_codetags;
+
trace_module_free(mod);

- codetag_unload_module(mod);
+ unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -1292,7 +1299,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, unload_codetags);
}

void *__symbol_get(const char *symbol)
@@ -2294,7 +2301,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod->mem[t].base, t, true);
return ret;
}

@@ -2424,7 +2431,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, true);
}

int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/seq_buf.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>

struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
}

-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
{
struct codetag_type *cttype;
+ bool unload_ok = true;

if (!mod)
- return;
+ return true;

mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag_unload_module(struct module *mod)
}
if (found) {
if (cttype->desc.module_unload)
- cttype->desc.module_unload(cttype, cmod);
+ if (!cttype->desc.module_unload(cttype, cmod))
+ unload_ok = false;

cttype->count -= range_size(cttype, &cmod->range);
idr_remove(&cttype->mod_idr, mod_id);
@@ -250,4 +253,6 @@ void codetag_unload_module(struct module *mod)
up_write(&cttype->mod_lock);
}
mutex_unlock(&codetag_lock);
+
+ return unload_ok;
}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:46 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

Provide codetag_query_parse() to parse codetag queries and
codetag_matches_query() to check if the query affects a given codetag.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 27 ++++++++
lib/codetag.c | 135 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 162 insertions(+)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d98e4c8e86f0..87207f199ac9 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -80,4 +80,31 @@ static inline void codetag_load_module(struct module *mod) {}
static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif

+/* Codetag query parsing */
+
+struct codetag_query {
+ const char *filename;
+ const char *module;
+ const char *function;
+ const char *class;
+ unsigned int first_line, last_line;
+ unsigned int first_index, last_index;
+ unsigned int cur_index;
+
+ bool match_line:1;
+ bool match_index:1;
+
+ unsigned int set_enabled:1;
+ unsigned int enabled:2;
+
+ unsigned int set_frequency:1;
+ unsigned int frequency;
+};
+
+char *codetag_query_parse(struct codetag_query *q, char *buf);
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class);
+
#endif /* _LINUX_CODETAG_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 0ad4ea66c769..84f90f3b922c 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -256,3 +256,138 @@ bool codetag_unload_module(struct module *mod)

return unload_ok;
}
+
+/* Codetag query parsing */
+
+#define CODETAG_QUERY_TOKENS() \
+ x(func) \
+ x(file) \
+ x(line) \
+ x(module) \
+ x(class) \
+ x(index)
+
+enum tokens {
+#define x(name) TOK_##name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+};
+
+static const char * const token_strs[] = {
+#define x(name) #name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+ NULL
+};
+
+static int parse_range(char *str, unsigned int *first, unsigned int *last)
+{
+ char *first_str = str;
+ char *last_str = strchr(first_str, '-');
+
+ if (last_str)
+ *last_str++ = '\0';
+
+ if (kstrtouint(first_str, 10, first))
+ return -EINVAL;
+
+ if (!last_str)
+ *last = *first;
+ else if (kstrtouint(last_str, 10, last))
+ return -EINVAL;
+
+ return 0;
+}
+
+char *codetag_query_parse(struct codetag_query *q, char *buf)
+{
+ while (1) {
+ char *p = buf;
+ char *str1 = strsep_no_empty(&p, " \t\r\n");
+ char *str2 = strsep_no_empty(&p, " \t\r\n");
+ int ret, token;
+
+ if (!str1 || !str2)
+ break;
+
+ token = match_string(token_strs, ARRAY_SIZE(token_strs), str1);
+ if (token < 0)
+ break;
+
+ switch (token) {
+ case TOK_func:
+ q->function = str2;
+ break;
+ case TOK_file:
+ q->filename = str2;
+ break;
+ case TOK_line:
+ ret = parse_range(str2, &q->first_line, &q->last_line);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_line = true;
+ break;
+ case TOK_module:
+ q->module = str2;
+ break;
+ case TOK_class:
+ q->class = str2;
+ break;
+ case TOK_index:
+ ret = parse_range(str2, &q->first_index, &q->last_index);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_index = true;
+ break;
+ }
+
+ buf = p;
+ }
+
+ return buf;
+}
+
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class)
+{
+ size_t classlen = q->class ? strlen(q->class) : 0;
+
+ if (q->module &&
+ (!mod->mod ||
+ strcmp(q->module, ct->modname)))
+ return false;
+
+ if (q->filename &&
+ strcmp(q->filename, ct->filename) &&
+ strcmp(q->filename, kbasename(ct->filename)))
+ return false;
+
+ if (q->function &&
+ strcmp(q->function, ct->function))
+ return false;
+
+ /* match against the line number range */
+ if (q->match_line &&
+ (ct->lineno < q->first_line ||
+ ct->lineno > q->last_line))
+ return false;
+
+ /* match against the class */
+ if (classlen &&
+ (strncmp(q->class, class, classlen) ||
+ (class[classlen] && class[classlen] != ':')))
+ return false;
+
+ /* match against the fault index */
+ if (q->match_index &&
+ (q->cur_index < q->first_index ||
+ q->cur_index > q->last_index)) {
+ q->cur_index++;
+ return false;
+ }
+
+ q->cur_index++;
+ return true;
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:48 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It also registers an "alloc_tags" codetag
type with "allocations" defbugfs interface to output allocation tag
information.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.

Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
.../admin-guide/kernel-parameters.txt | 2 +
include/asm-generic/codetag.lds.h | 14 ++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 105 +++++++++++
include/linux/sched.h | 24 +++
lib/Kconfig.debug | 19 ++
lib/Makefile | 2 +
lib/alloc_tag.c | 177 ++++++++++++++++++
scripts/module.lds.S | 7 +
9 files changed, 353 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..2fd8e56b7af8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3770,6 +3770,8 @@

nomce [X86-32] Disable Machine Check Exception

+ nomem_profiling Disable memory allocation profiling.
+
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d1f57e4868ed..985ff045c2a2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/

+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -374,6 +376,7 @@
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..d913f8d9a7d8
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/static_key.h>
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ struct lazy_percpu_counter bytes_allocated;
+} __aligned(8);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { .ct = CODE_TAG_INIT }; \
+ struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
+
+extern struct static_key_true mem_alloc_profiling_key;
+
+static inline bool mem_alloc_profiling_enabled(void)
+{
+ return static_branch_likely(&mem_alloc_profiling_key);
+}
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
+ bool may_allocate)
+{
+ struct alloc_tag *tag;
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
+#endif
+ if (!ref || !ref->ct)
+ return;
+
+ tag = ct_to_alloc_tag(ref->ct);
+
+ if (may_allocate)
+ lazy_percpu_counter_add(&tag->bytes_allocated, -bytes);
+ else
+ lazy_percpu_counter_add_noupgrade(&tag->bytes_allocated, -bytes);
+ ref->ct = NULL;
+}
+
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, true);
+}
+
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, false);
+}
+
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && ref->ct,
+ "alloc_tag was not cleared (got tag for %s:%u)\n",\
+ ref->ct->filename, ref->ct->lineno);
+
+ WARN_ONCE(!tag, "current->alloc_tag not set");
+#endif
+ if (!ref || !tag)
+ return;
+
+ ref->ct = &tag->ct;
+ lazy_percpu_counter_add(&tag->bytes_allocated, bytes);
+}
+
+#else
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ size_t bytes) {}
+
+#endif
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e7efdea2d9..33708bf8f191 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -763,6 +763,10 @@ struct task_struct {
unsigned int flags;
unsigned int ptrace;

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
+
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
@@ -802,6 +806,7 @@ struct task_struct {
struct task_group *sched_task_group;
#endif

+
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
@@ -2444,4 +2449,23 @@ static inline void sched_core_fork(struct task_struct *p) { }

extern void sched_set_stop_task(int cpu, struct task_struct *stop);

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
+{
+ swap(current->alloc_tag, tag);
+ return tag;
+}
+
+static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
+ current->alloc_tag = old;
+}
+#else
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
+#define alloc_tag_restore(_tag, _old)
+#endif
+
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5078da7d3ffb..da0a91ea6042 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -961,6 +961,25 @@ config CODE_TAGGING
bool
select KALLSYMS

+config MEM_ALLOC_PROFILING
+ bool "Enable memory allocation profiling"
+ default n
+ depends on DEBUG_FS
+ select CODE_TAGGING
+ select LAZY_PERCPU_COUNTER
+ help
+ Track allocation source code and record total allocation size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance impact.
+
+config MEM_ALLOC_PROFILING_DEBUG
+ bool "Memory allocation profiler debugging"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ help
+ Adds warnings with helpful error messages for memory allocation
+ profiling.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 28d70ecf2976..8d09ccb4d30c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -229,6 +229,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..3c4cfeb79862
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/uaccess.h>
+
+DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);
+
+/*
+ * Won't need to be exported once page allocation accounting is moved to the
+ * correct place:
+ */
+EXPORT_SYMBOL(mem_alloc_profiling_key);
+
+static int __init mem_alloc_profiling_disable(char *s)
+{
+ static_branch_disable(&mem_alloc_profiling_key);
+ return 1;
+}
+__setup("nomem_profiling", mem_alloc_profiling_disable);
+
+struct alloc_tag_file_iterator {
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
+};
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+static int allocations_file_open(struct inode *inode, struct file *file)
+{
+ struct codetag_type *cttype = inode->i_private;
+ struct alloc_tag_file_iterator *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_lock_module_list(cttype, false);
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ file->private_data = iter;
+
+ return 0;
+}
+
+static int allocations_file_release(struct inode *inode, struct file *file)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+
+ kfree(iter);
+ return 0;
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ char buf[10];
+
+ string_get_size(lazy_percpu_counter_read(&tag->bytes_allocated), 1,
+ STRING_UNITS_2, buf, sizeof(buf));
+
+ seq_buf_printf(out, "%8s ", buf);
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, '\n');
+}
+
+static ssize_t allocations_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ int err = 0;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ alloc_tag_to_text(&iter->buf, ct);
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ? : buf.ret;
+}
+
+static const struct file_operations allocations_file_ops = {
+ .owner = THIS_MODULE,
+ .open = allocations_file_open,
+ .release = allocations_file_release,
+ .read = allocations_file_read,
+};
+
+static int __init dbgfs_init(struct codetag_type *cttype)
+{
+ struct dentry *file;
+
+ file = debugfs_create_file("allocations", 0444, NULL, cttype,
+ &allocations_file_ops);
+
+ return IS_ERR(file) ? PTR_ERR(file) : 0;
+}
+
+static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ bool module_unused = true;
+ struct alloc_tag *tag;
+ struct codetag *ct;
+ size_t bytes;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ if (iter.cmod != cmod)
+ continue;
+
+ tag = ct_to_alloc_tag(ct);
+ bytes = lazy_percpu_counter_read(&tag->bytes_allocated);
+
+ if (!WARN(bytes, "%s:%u module %s func:%s has %zu allocated at module unload",
+ ct->filename, ct->lineno, ct->modname, ct->function, bytes))
+ lazy_percpu_counter_exit(&tag->bytes_allocated);
+ else
+ module_unused = false;
+ }
+
+ return module_unused;
+}
+
+static int __init alloc_tag_init(void)
+{
+ struct codetag_type *cttype;
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ return dbgfs_init(cttype);
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..45c67a0994f3 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -9,6 +9,8 @@
#define DISCARD_EH_FRAME *(.eh_frame)
#endif

+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,12 +49,17 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}

.rodata : {
*(.rodata .rodata.[0-9a-zA-Z_]*)
*(.rodata..L*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:50 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/pgalloc_tag.h | 33 +++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 17 +++++++++++++++++
mm/page_ext.c | 12 +++++++++---
4 files changed, 60 insertions(+), 3 deletions(-)
create mode 100644 include/linux/pgalloc_tag.h

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index 000000000000..f8c7b6ef9c75
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include <linux/alloc_tag.h>
+#include <linux/page_ext.h>
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+struct page_ext *lookup_page_ext(const struct page *page);
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+ if (page && mem_alloc_profiling_enabled()) {
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (page_ext)
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+ }
+ return NULL;
+}
+
+static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref)
+ alloc_tag_sub(ref, PAGE_SIZE << order);
+}
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index da0a91ea6042..d3aa5ee0bf0d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -967,6 +967,7 @@ config MEM_ALLOC_PROFILING
depends on DEBUG_FS
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
+ select PAGE_EXTENSION
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 3c4cfeb79862..4a0b95a46b2e 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -4,6 +4,7 @@
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/page_ext.h>
#include <linux/seq_buf.h>
#include <linux/uaccess.h>

@@ -159,6 +160,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_
return module_unused;
}

+static __init bool need_page_alloc_tagging(void)
+{
+ return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+ .size = sizeof(union codetag_ref),
+ .need = need_page_alloc_tagging,
+ .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
static int __init alloc_tag_init(void)
{
struct codetag_type *cttype;
diff --git a/mm/page_ext.c b/mm/page_ext.c
index dc1626be458b..eaf054ec276c 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
#include <linux/page_idle.h>
#include <linux/page_table_check.h>
#include <linux/rcupdate.h>
+#include <linux/pgalloc_tag.h>

/*
* struct page extension
@@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ &page_alloc_tagging_ops,
+#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
&page_table_check_ops,
#endif
@@ -90,7 +94,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
unsigned long page_ext_size;

static unsigned long total_usage;
-static struct page_ext *lookup_page_ext(const struct page *page);
+struct page_ext *lookup_page_ext(const struct page *page);

bool early_page_ext __meminitdata;
static int __init setup_early_page_ext(char *str)
@@ -199,7 +203,7 @@ void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
pgdat->node_page_ext = NULL;
}

-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
unsigned long index;
@@ -219,6 +223,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
}
+EXPORT_SYMBOL(lookup_page_ext);

static int __init alloc_node_page_ext(int nid)
{
@@ -278,7 +283,7 @@ static bool page_ext_invalid(struct page_ext *page_ext)
return !page_ext || (((unsigned long)page_ext & PAGE_EXT_INVALID) == PAGE_EXT_INVALID);
}

-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
@@ -295,6 +300,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
return NULL;
return get_entry(page_ext, pfn);
}
+EXPORT_SYMBOL(lookup_page_ext);

static void *__meminit alloc_page_ext(size_t size, int nid)
{
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:52 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 56a917df410d..842a0ec5eaa9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7a9f0b0bddbd..76a9d5ca4eee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 9784a77fa3c9..6c7d984f164d 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..5ab2616153f0 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 31f114f486c4..d741940dcb3b 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -27,7 +27,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9a4db5cce600..fc42930af14b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}

struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:55 PMMay 1
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/alloc_tag.h | 11 ++++
include/linux/gfp.h | 123 +++++++++++++++++++++++++-----------
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 ++-
include/linux/pgalloc_tag.h | 38 +++++++++--
mm/compaction.c | 9 ++-
mm/filemap.c | 6 +-
mm/mempolicy.c | 30 ++++-----
mm/mm_init.c | 1 +
mm/page_alloc.c | 73 ++++++++++++---------
10 files changed, 208 insertions(+), 93 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index d913f8d9a7d8..07922d81b641 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -102,4 +102,15 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,

#endif

+#define alloc_hooks(_do_alloc, _res_type, _err) \
+({ \
+ _res_type _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ed8cb537c6a7..0cb4a515109a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@

#include <linux/mmzone.h>
#include <linux/topology.h>
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>

struct vm_area_struct;

@@ -174,42 +176,57 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif

-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_alloc_pages2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct page *, NULL)
+
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_folio_alloc2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct folio *, NULL)

-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
-
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+#define __alloc_pages_bulk(_gfp, _preferred_nid, _nodemask, _nr_pages, \
+ _page_list, _page_array) \
+ alloc_hooks(_alloc_pages_bulk(_gfp, _preferred_nid, \
+ _nodemask, _nr_pages, \
+ _page_list, _page_array), \
+ unsigned long, 0)
+
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define alloc_pages_bulk_array_mempolicy(_gfp, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_mempolicy(_gfp, \
+ _nr_pages, _page_array), \
+ unsigned long, 0)

/* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)

-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)

static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
+_alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return _alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
}

+#define alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array), \
+ unsigned long, 0)
+
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
{
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -229,21 +246,25 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+_alloc_pages_node2(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask);

- return __alloc_pages(gfp_mask, order, nid, NULL);
+ return _alloc_pages2(gfp_mask, order, nid, NULL);
}

+#define __alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node2(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
static inline
struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp);

- return __folio_alloc(gfp, order, nid, NULL);
+ return _folio_alloc2(gfp, order, nid, NULL);
}

/*
@@ -251,32 +272,45 @@ struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
*/
-static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
+static inline struct page *_alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_node(nid, gfp_mask, order);
+ return _alloc_pages_node2(nid, gfp_mask, order);
}

+#define alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
#ifdef CONFIG_NUMA
-struct page *alloc_pages(gfp_t gfp, unsigned int order);
-struct folio *folio_alloc(gfp_t gfp, unsigned order);
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct page *_alloc_pages(gfp_t gfp, unsigned int order);
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order);
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
-static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
+static inline struct page *_alloc_pages(gfp_t gfp_mask, unsigned int order)
{
- return alloc_pages_node(numa_node_id(), gfp_mask, order);
+ return _alloc_pages_node(numa_node_id(), gfp_mask, order);
}
-static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+static inline struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
-#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
- folio_alloc(gfp, order)
+#define _vma_alloc_folio(gfp, order, vma, addr, hugepage) \
+ _folio_alloc(gfp, order)
#endif
+
+#define alloc_pages(_gfp, _order) \
+ alloc_hooks(_alloc_pages(_gfp, _order), struct page *, NULL)
+#define folio_alloc(_gfp, _order) \
+ alloc_hooks(_folio_alloc(_gfp, _order), struct folio *, NULL)
+#define vma_alloc_folio(_gfp, _order, _vma, _addr, _hugepage) \
+ alloc_hooks(_vma_alloc_folio(_gfp, _order, _vma, _addr, \
+ _hugepage), struct folio *, NULL)
+
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
static inline struct page *alloc_page_vma(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
@@ -286,12 +320,21 @@ static inline struct page *alloc_page_vma(gfp_t gfp,
return &folio->page;
}

-extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
-extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order);
+#define __get_free_pages(_gfp_mask, _order) \
+ alloc_hooks(_get_free_pages(_gfp_mask, _order), unsigned long, 0)
+extern unsigned long _get_zeroed_page(gfp_t gfp_mask);
+#define get_zeroed_page(_gfp_mask) \
+ alloc_hooks(_get_zeroed_page(_gfp_mask), unsigned long, 0)

-void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+void *_alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+#define alloc_pages_exact(_size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact(_size, _gfp_mask), void *, NULL)
void free_pages_exact(void *virt, size_t size);
-__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+
+__meminit void *_alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+#define alloc_pages_exact_nid(_nid, _size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact_nid(_nid, _size, _gfp_mask), void *, NULL)

#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
@@ -354,10 +397,16 @@ static inline bool pm_suspended_storage(void)

#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
+extern int _alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
-extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask);
+#define alloc_contig_range(_start, _end, _migratetype, _gfp_mask) \
+ alloc_hooks(_alloc_contig_range(_start, _end, _migratetype, \
+ _gfp_mask), int, -ENOMEM)
+extern struct page *_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
+#define alloc_contig_pages(_nr_pages, _gfp_mask, _nid, _nodemask) \
+ alloc_hooks(_alloc_contig_pages(_nr_pages, _gfp_mask, _nid, \
+ _nodemask), struct page *, NULL)
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 67314f648aeb..cff15ee5440e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@

#include <linux/types.h>
#include <linux/stacktrace.h>
-#include <linux/stackdepot.h>

struct pglist_data;

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..b2efafa001f8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -467,14 +467,17 @@ static inline void *detach_page_private(struct page *page)
}

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order);
#else
-static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+static inline struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
#endif

+#define filemap_alloc_folio(_gfp, _order) \
+ alloc_hooks(_filemap_alloc_folio(_gfp, _order), struct folio *, NULL)
+
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
return &filemap_alloc_folio(gfp, 0)->page;
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index f8c7b6ef9c75..567327c1c46f 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -6,28 +6,58 @@
#define _LINUX_PGALLOC_TAG_H

#include <linux/alloc_tag.h>
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
#include <linux/page_ext.h>

extern struct page_ext_operations page_alloc_tagging_ops;
-struct page_ext *lookup_page_ext(const struct page *page);
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
+{
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
+{
+ return (void *)ref - page_alloc_tagging_ops.offset;
+}

static inline union codetag_ref *get_page_tag_ref(struct page *page)
{
if (page && mem_alloc_profiling_enabled()) {
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = page_ext_get(page);

if (page_ext)
- return (void *)page_ext + page_alloc_tagging_ops.offset;
+ return codetag_ref_from_page_ext(page_ext);
}
return NULL;
}

+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+ if (ref)
+ page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
{
union codetag_ref *ref = get_page_tag_ref(page);

- if (ref)
+ if (ref) {
alloc_tag_sub(ref, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
}

+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
+static inline void put_page_tag_ref(union codetag_ref *ref) {}
+#define pgalloc_tag_dec(__page, __size) do {} while (0)
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index c8bcdea15f5f..32707fb62495 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1684,7 +1684,7 @@ static void isolate_freepages(struct compact_control *cc)
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
-static struct page *compaction_alloc(struct page *migratepage,
+static struct page *_compaction_alloc(struct page *migratepage,
unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
@@ -1704,6 +1704,13 @@ static struct page *compaction_alloc(struct page *migratepage,
return freepage;
}

+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data)
+{
+ return alloc_hooks(_compaction_alloc(migratepage, data),
+ struct page *, NULL);
+}
+
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..f0f8b782d172 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -958,7 +958,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
@@ -973,9 +973,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)

return folio;
}
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
-EXPORT_SYMBOL(filemap_alloc_folio);
+EXPORT_SYMBOL(_filemap_alloc_folio);
#endif

/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2068b594dc88..80cd33811641 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2141,7 +2141,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
}

/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * _vma_alloc_folio - Allocate a folio for a VMA.
* @gfp: GFP flags.
* @order: Order of the folio.
* @vma: Pointer to VMA or NULL if not available.
@@ -2155,7 +2155,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2240,10 +2240,10 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
out:
return folio;
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(_vma_alloc_folio);

/**
- * alloc_pages - Allocate pages.
+ * _alloc_pages - Allocate pages.
* @gfp: GFP flags.
* @order: Power of two of number of pages to allocate.
*
@@ -2256,7 +2256,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned order)
+struct page *_alloc_pages(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
struct page *page;
@@ -2274,15 +2274,15 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
page = alloc_pages_preferred_many(gfp, order,
policy_node(gfp, pol, numa_node_id()), pol);
else
- page = __alloc_pages(gfp, order,
+ page = _alloc_pages2(gfp, order,
policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));

return page;
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(_alloc_pages);

-struct folio *folio_alloc(gfp_t gfp, unsigned order)
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);

@@ -2290,7 +2290,7 @@ struct folio *folio_alloc(gfp_t gfp, unsigned order)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(_folio_alloc);

static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2309,13 +2309,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,

for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2337,11 +2337,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);

- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = _alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);

if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += _alloc_pages_bulk(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2353,7 +2353,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2369,7 +2369,7 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);

- return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
+ return _alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol), nr_pages, NULL,
page_array);
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..42135fad4d8a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
#include <linux/page_ext.h>
#include <linux/pti.h>
#include <linux/pgtable.h>
+#include <linux/stackdepot.h>
#include <linux/swap.h>
#include <linux/cma.h>
#include "internal.h"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9de2a18519a1..edd35500f7f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -74,6 +74,7 @@
#include <linux/psi.h>
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
+#include <linux/pgalloc_tag.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -657,6 +658,7 @@ static inline bool pcp_allowed_order(unsigned int order)

static inline void free_the_page(struct page *page, unsigned int order)
{
+
if (pcp_allowed_order(order)) /* Via pcp? */
free_unref_page(page, order);
else
@@ -1259,6 +1261,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
__memcg_kmem_uncharge_page(page, order);
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);
return false;
}

@@ -1301,6 +1304,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);

if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1669,6 +1673,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref *ref;
+#endif
int i;

set_page_private(page, 0);
@@ -1721,6 +1728,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,

set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ alloc_tag_add(ref, current->alloc_tag, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+#endif
}

static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -4568,7 +4583,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
*
* Returns the number of pages on the list or array.
*/
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array)
@@ -4704,7 +4719,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_trylock_finish(UP_flags);

failed:
- page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ page = _alloc_pages2(gfp, 0, preferred_nid, nodemask);
if (page) {
if (page_list)
list_add(&page->lru, page_list);
@@ -4715,12 +4730,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

goto out;
}
-EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+EXPORT_SYMBOL_GPL(_alloc_pages_bulk);

/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page;
@@ -4783,41 +4798,41 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,

return page;
}
-EXPORT_SYMBOL(__alloc_pages);
+EXPORT_SYMBOL(_alloc_pages2);

-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ struct page *page = _alloc_pages2(gfp | __GFP_COMP, order,
preferred_nid, nodemask);

if (page && order > 1)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(__folio_alloc);
+EXPORT_SYMBOL(_folio_alloc2);

/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
* you need to access high mem.
*/
-unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
+unsigned long _get_free_pages(gfp_t gfp_