[PATCH 00/40] Memory allocation profiling

1 view
Skip to first unread message

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:08 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Memory allocation profiling infrastructure provides a low overhead
mechanism to make all kernel allocations in the system visible. It can be
used to monitor memory usage, track memory hotspots, detect memory leaks,
identify memory regressions.

To keep the overhead to the minimum, we record only allocation sizes for
every allocation in the codebase. With that information, if users are
interested in more detailed context for a specific allocation, they can
enable in-depth context tracking, which includes capturing the pid, tgid,
task name, allocation size, timestamp and call stack for every allocation
at the specified code location.

The data is exposed to the user space via a read-only debugfs file called
allocations. Usage example:

$ sort -hr /sys/kernel/debug/allocations|head
153MiB 8599 mm/slub.c:1826 module:slub func:alloc_slab_page
6.08MiB 49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
5.09MiB 6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
4.54MiB 78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
1.32MiB 338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
1.16MiB 603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
1.00MiB 256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
734KiB 5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
640KiB 160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
640KiB 160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf

For allocation context capture, a new debugfs file called allocations.ctx
is used to select which code location should capture allocation context
and to read captured context information. Usage example:

$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > allocations.ctx
$ cat allocations.ctx
920KiB 230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1474
tgid: 1474
comm: bash
ts: 175332940994
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0xb0
copy_page_range+0x842/0x1640
dup_mm+0x42d/0x580
copy_process+0xfb1/0x1ac0
kernel_clone+0x92/0x3e0
__do_sys_clone+0x66/0x90
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
...

Implementation utilizes a more generic concept of code tagging, introduced
as part of this patchset. Code tag is a structure identifying a specific
location in the source code which is generated at compile time and can be
embedded in an application-specific structure. A number of applications
for code tagging have been presented in the original RFC [1].
Code tagging uses the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.

To profile memory allocations, we instrument page, slab and percpu
allocators to record total memory allocated in the associated code tag at
every allocation in the codebase. Every time an allocation is performed by
an instrumented allocator, the code tag at that location increments its
counter by allocation size. Every time the memory is freed the counter is
decremented. To decrement the counter upon freeing, allocated object needs
a reference to its code tag. Page allocators use page_ext to record this
reference while slab allocators use memcg_data (renamed into more generic
slabobj_ext) of the slab page.

Module allocations are accounted the same way as other kernel allocations.
Module loading and unloading is supported. If a module is unloaded while
one or more of its allocations is still not freed (rather rare condition),
its data section will be kept in memory to allow later code tag
referencing when the allocation is freed later on.

As part of this series we introduce several kernel configs:
CODE_TAGGING - to enable code tagging framework
CONFIG_MEM_ALLOC_PROFILING - to enable memory allocation profiling
CONFIG_MEM_ALLOC_PROFILING_DEBUG - to enable memory allocation profiling
validation
Note: CONFIG_MEM_ALLOC_PROFILING enables CONFIG_PAGE_EXTENSION to store
code tag reference in the page_ext object.

nomem_profiling kernel command-line parameter is also provided to disable
the functionality and avoid the performance overhead.
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below is performance
comparison between the baseline kernel, profiling when enabled, profiling
when disabled (nomem_profiling=y) and (for comparison purposes) baseline
with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:

kmalloc pgalloc
Baseline (6.3-rc7) 9.200s 31.050s
profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)

[1] https://lore.kernel.org/all/20220830214919...@google.com/

Kent Overstreet (15):
lib/string_helpers: Drop space in string_get_size's output
scripts/kallysms: Always include __start and __stop symbols
fs: Convert alloc_inode_sb() to a macro
nodemask: Split out include/linux/nodemask_types.h
prandom: Remove unused include
lib/string.c: strsep_no_empty()
Lazy percpu counters
lib: code tagging query helper functions
mm/slub: Mark slab_free_freelist_hook() __always_inline
mempool: Hook up to memory allocation profiling
timekeeping: Fix a circular include dependency
mm: percpu: Introduce pcpuobj_ext
mm: percpu: Add codetag reference into pcpuobj_ext
arm64: Fix circular header dependency
MAINTAINERS: Add entries for code tagging and memory allocation
profiling

Suren Baghdasaryan (25):
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
lib: code tagging framework
lib: code tagging module support
lib: prevent module unloading if memory is not freed
lib: add allocation tagging support for memory allocation profiling
lib: introduce support for page allocation tagging
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging
mm/page_ext: enable early_page_ext when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
mm: create new codetag references during page splitting
lib: add codetag reference into slabobj_ext
mm/slab: add allocation accounting into slab allocation and free paths
mm/slab: enable slab allocation tagging for kmalloc and friends
mm: percpu: enable per-cpu allocation tagging
move stack capture functionality into a separate function for reuse
lib: code tagging context capture support
lib: implement context capture support for tagged allocations
lib: add memory allocations report in show_mem()
codetag: debug: skip objext checking when it's for objext itself
codetag: debug: mark codetags for reserved pages as empty
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
allocations

.../admin-guide/kernel-parameters.txt | 2 +
MAINTAINERS | 22 +
arch/arm64/include/asm/spectre.h | 4 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/asm-generic/codetag.lds.h | 14 +
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 161 ++++++
include/linux/codetag.h | 159 ++++++
include/linux/codetag_ctx.h | 48 ++
include/linux/dma-map-ops.h | 2 +-
include/linux/fs.h | 6 +-
include/linux/gfp.h | 123 ++--
include/linux/gfp_types.h | 12 +-
include/linux/hrtimer.h | 2 +-
include/linux/lazy-percpu-counter.h | 102 ++++
include/linux/memcontrol.h | 56 +-
include/linux/mempool.h | 73 ++-
include/linux/mm.h | 8 +
include/linux/mm_types.h | 4 +-
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 +-
include/linux/percpu.h | 19 +-
include/linux/pgalloc_tag.h | 95 ++++
include/linux/prandom.h | 1 -
include/linux/sched.h | 32 +-
include/linux/slab.h | 182 +++---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 +-
include/linux/stackdepot.h | 16 +
include/linux/string.h | 1 +
include/linux/time_namespace.h | 2 +
init/Kconfig | 4 +
kernel/dma/mapping.c | 4 +-
kernel/module/main.c | 25 +-
lib/Kconfig | 3 +
lib/Kconfig.debug | 26 +
lib/Makefile | 5 +
lib/alloc_tag.c | 464 +++++++++++++++
lib/codetag.c | 529 ++++++++++++++++++
lib/lazy-percpu-counter.c | 127 +++++
lib/show_mem.c | 15 +
lib/stackdepot.c | 68 +++
lib/string.c | 19 +
lib/string_helpers.c | 3 +-
mm/compaction.c | 9 +-
mm/filemap.c | 6 +-
mm/huge_memory.c | 2 +
mm/kfence/core.c | 14 +-
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +-
mm/mempolicy.c | 30 +-
mm/mempool.c | 28 +-
mm/mm_init.c | 1 +
mm/page_alloc.c | 75 ++-
mm/page_ext.c | 21 +-
mm/page_owner.c | 54 +-
mm/percpu-internal.h | 26 +-
mm/percpu.c | 122 ++--
mm/slab.c | 22 +-
mm/slab.h | 224 ++++++--
mm/slab_common.c | 95 +++-
mm/slub.c | 24 +-
mm/util.c | 10 +-
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
70 files changed, 2765 insertions(+), 554 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/codetag_ctx.h
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 include/linux/nodemask_types.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c
create mode 100644 lib/lazy-percpu-counter.c

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:11 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Trønnes
From: Kent Overstreet <kent.ov...@linux.dev>

Previously, string_get_size() outputted a space between the number and
the units, i.e.
9.88 MiB

This changes it to
9.88MiB

which allows it to be parsed correctly by the 'sort -h' command.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Andy Shevchenko <an...@kernel.org>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jaso...@redhat.com>
Cc: "Noralf Trønnes" <nor...@tronnes.org>
Cc: Jens Axboe <ax...@kernel.dk>
---
lib/string_helpers.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e076..593b29fece32 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -126,8 +126,7 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];

- snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ snprintf(buf, len, "%u%s%s", (u32)size, tmp, unit);
}
EXPORT_SYMBOL(string_get_size);

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:13 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0d2db41177b2..7b7dbeb5bd6e 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -203,6 +203,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}

+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -210,6 +215,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:15 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
From: Kent Overstreet <kent.ov...@linux.dev>

We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Alexander Viro <vi...@zeniv.linux.org.uk>
---
include/linux/fs.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..4905ce14db0b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
* This must be used for allocating filesystems specific inodes to set
* up the inode reclaim context correctly.
*/
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
- return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)

extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:18 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

sched.h, which defines task_struct, needs nodemask_t - but sched.h is a
frequently used header and ideally shouldn't be pulling in any more code
that it needs to.

This splits out nodemask_types.h which has the definition sched.h needs,
which will avoid a circular header dependency in the alloc tagging patch
series, and as a bonus should speed up kernel build times.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
---
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +++++++++
include/linux/sched.h | 2 +-
3 files changed, 11 insertions(+), 2 deletions(-)
create mode 100644 include/linux/nodemask_types.h

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index bb0ee80526b2..fda37b6df274 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -93,10 +93,10 @@
#include <linux/threads.h>
#include <linux/bitmap.h>
#include <linux/minmax.h>
+#include <linux/nodemask_types.h>
#include <linux/numa.h>
#include <linux/random.h>

-typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
extern nodemask_t _unused_nodemask_arg_;

/**
diff --git a/include/linux/nodemask_types.h b/include/linux/nodemask_types.h
new file mode 100644
index 000000000000..84c2f47c4237
--- /dev/null
+++ b/include/linux/nodemask_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NODEMASK_TYPES_H
+#define __LINUX_NODEMASK_TYPES_H
+
+#include <linux/numa.h>
+
+typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
+
+#endif /* __LINUX_NODEMASK_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..35e7efdea2d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -20,7 +20,7 @@
#include <linux/hrtimer.h>
#include <linux/irqflags.h>
#include <linux/seccomp.h>
-#include <linux/nodemask.h>
+#include <linux/nodemask_types.h>
#include <linux/rcupdate.h>
#include <linux/refcount.h>
#include <linux/resource.h>
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:20 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

prandom.h doesn't use percpu.h - this fixes some circular header issues.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/prandom.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index f2ed5b72b3d6..f7f1e5251c67 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -10,7 +10,6 @@

#include <linux/types.h>
#include <linux/once.h>
-#include <linux/percpu.h>
#include <linux/random.h>

struct rnd_state {
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:22 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This adds a new helper which is like strsep, except that it skips empty
tokens.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/string.h | 1 +
lib/string.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/include/linux/string.h b/include/linux/string.h
index c062c581a98b..6cd5451c262c 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -96,6 +96,7 @@ extern char * strpbrk(const char *,const char *);
#ifndef __HAVE_ARCH_STRSEP
extern char * strsep(char **,const char *);
#endif
+extern char *strsep_no_empty(char **, const char *);
#ifndef __HAVE_ARCH_STRSPN
extern __kernel_size_t strspn(const char *,const char *);
#endif
diff --git a/lib/string.c b/lib/string.c
index 3d55ef890106..dd4914baf45a 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -520,6 +520,25 @@ char *strsep(char **s, const char *ct)
EXPORT_SYMBOL(strsep);
#endif

+/**
+ * strsep_no_empt - Split a string into tokens, but don't return empty tokens
+ * @s: The string to be searched
+ * @ct: The characters to search for
+ *
+ * strsep() updates @s to point after the token, ready for the next call.
+ */
+char *strsep_no_empty(char **s, const char *ct)
+{
+ char *ret;
+
+ do {
+ ret = strsep(s, ct);
+ } while (ret && !*ret);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(strsep_no_empty);
+
#ifndef __HAVE_ARCH_MEMSET
/**
* memset - Fill a region of memory with the given value
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:25 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This patch adds lib/lazy-percpu-counter.c, which implements counters
that start out as atomics, but lazily switch to percpu mode if the
update rate crosses some threshold (arbitrarily set at 256 per second).

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/lazy-percpu-counter.h | 102 ++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/lazy-percpu-counter.c | 127 ++++++++++++++++++++++++++++
4 files changed, 234 insertions(+)
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 lib/lazy-percpu-counter.c

diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
new file mode 100644
index 000000000000..45ca9e2ce58b
--- /dev/null
+++ b/include/linux/lazy-percpu-counter.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Lazy percpu counters:
+ * (C) 2022 Kent Overstreet
+ *
+ * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
+ * the update rate crosses some threshold.
+ *
+ * This means we don't have to decide between low memory overhead atomic
+ * counters and higher performance percpu counters - we can have our cake and
+ * eat it, too!
+ *
+ * Internally we use an atomic64_t, where the low bit indicates whether we're in
+ * percpu mode, and the high 8 bits are a secondary counter that's incremented
+ * when the counter is modified - meaning 55 bits of precision are available for
+ * the counter itself.
+ */
+
+#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
+#define _LINUX_LAZY_PERCPU_COUNTER_H
+
+#include <linux/atomic.h>
+#include <asm/percpu.h>
+
+struct lazy_percpu_counter {
+ atomic64_t v;
+ unsigned long last_wrap;
+};
+
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i);
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c);
+
+/*
+ * We use the high bits of the atomic counter for a secondary counter, which is
+ * incremented every time the counter is touched. When the secondary counter
+ * wraps, we check the time the counter last wrapped, and if it was recent
+ * enough that means the update frequency has crossed our threshold and we
+ * switch to percpu mode:
+ */
+#define COUNTER_MOD_BITS 8
+#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
+#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
+
+/*
+ * We use the low bit of the counter to indicate whether we're in atomic mode
+ * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
+ * percpu counters:
+ */
+#define COUNTER_IS_PCPU_BIT 1
+
+static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+{
+ if (!(v & COUNTER_IS_PCPU_BIT))
+ return NULL;
+
+ v ^= COUNTER_IS_PCPU_BIT;
+ return (u64 __percpu *)(unsigned long)v;
+}
+
+/**
+ * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath(c, i);
+}
+
+/**
+ * lazy_percpu_counter_add_noupgrade: Add a value to a lazy_percpu_counter,
+ * without upgrading to percpu mode
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath_noupgrade(c, i);
+}
+
+static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
+{
+ lazy_percpu_counter_add(c, -i);
+}
+
+#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 5c2da561c516..7380292a8fcd 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -505,6 +505,9 @@ config ASSOCIATIVE_ARRAY

for more information.

+config LAZY_PERCPU_COUNTER
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae34e..293a0858a3f8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -164,6 +164,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o

+obj-$(CONFIG_LAZY_PERCPU_COUNTER) += lazy-percpu-counter.o
+
obj-$(CONFIG_BITREVERSE) += bitrev.o
obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
obj-$(CONFIG_PACKING) += packing.o
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
new file mode 100644
index 000000000000..4f4e32c2dc09
--- /dev/null
+++ b/lib/lazy-percpu-counter.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/atomic.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/percpu.h>
+
+static inline s64 lazy_percpu_counter_atomic_val(s64 v)
+{
+ /* Ensure output is sign extended properly: */
+ return (v << COUNTER_MOD_BITS) >>
+ (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
+}
+
+static void lazy_percpu_counter_switch_to_pcpu(struct lazy_percpu_counter *c)
+{
+ u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
+ u64 old, new, v;
+
+ if (!pcpu_v)
+ return;
+
+ preempt_disable();
+ v = atomic64_read(&c->v);
+ do {
+ if (lazy_percpu_counter_is_pcpu(v)) {
+ free_percpu(pcpu_v);
+ return;
+ }
+
+ old = v;
+ new = (unsigned long)pcpu_v | 1;
+
+ *this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+ } while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
+ preempt_enable();
+}
+
+/**
+ * lazy_percpu_counter_exit: Free resources associated with a
+ * lazy_percpu_counter
+ *
+ * @c: counter to exit
+ */
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+{
+ free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
+
+/**
+ * lazy_percpu_counter_read: Read current value of a lazy_percpu_counter
+ *
+ * @c: counter to read
+ */
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
+{
+ s64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (pcpu_v) {
+ int cpu;
+
+ v = 0;
+ for_each_possible_cpu(cpu)
+ v += *per_cpu_ptr(pcpu_v, cpu);
+ } else {
+ v = lazy_percpu_counter_atomic_val(v);
+ }
+
+ return v;
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_read);
+
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+ atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+
+ if (unlikely(!(v & COUNTER_MOD_MASK))) {
+ unsigned long now = jiffies;
+
+ if (c->last_wrap &&
+ unlikely(time_after(c->last_wrap + HZ, now)))
+ lazy_percpu_counter_switch_to_pcpu(c);
+ else
+ c->last_wrap = now;
+ }
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
+
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath_noupgrade);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:28 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 20 +++--
include/linux/mm_types.h | 4 +-
init/Kconfig | 4 +
mm/kfence/core.c | 14 ++--
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 ++------------
mm/page_owner.c | 2 +-
mm/slab.h | 148 +++++++++++++++++++++++++------------
mm/slab_common.c | 47 ++++++++++++
9 files changed, 185 insertions(+), 114 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c..b9fd9732a52b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -339,8 +339,8 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;

enum page_memcg_data_flags {
- /* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ /* page->memcg_data is a pointer to an slabobj_ext vector */
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -378,7 +378,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -399,7 +399,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -496,7 +496,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;

if (memcg_data & MEMCG_DATA_KMEM) {
@@ -542,7 +542,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}

@@ -1606,6 +1606,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */

+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..e79303e1e30c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -194,7 +194,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;

-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif

@@ -320,7 +320,7 @@ struct folio {
void *private;
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..44267919a2a2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,10 +936,14 @@ config CGROUP_FAVOR_DYNMODS

Say N if unsure.

+config SLAB_OBJ_EXT
+ bool
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+ select SLAB_OBJ_EXT
help
Provides control over the memory footprint of tasks in a cgroup.

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index dad3c0eb70a0..aea6fa145080 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -590,9 +590,9 @@ static unsigned long kfence_init_pool(void)
continue;

__folio_set_slab(slab_folio(slab));
-#ifdef CONFIG_MEMCG
- slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = (unsigned long)&kfence_metadata[i / 2 - 1].obj_exts |
+ MEMCG_DATA_OBJEXTS;
#endif
}

@@ -634,8 +634,8 @@ static unsigned long kfence_init_pool(void)

if (!i || (i % 2))
continue;
-#ifdef CONFIG_MEMCG
- slab->memcg_data = 0;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = 0;
#endif
__folio_clear_slab(slab_folio(slab));
}
@@ -1093,8 +1093,8 @@ void __kfence_free(void *addr)
{
struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

-#ifdef CONFIG_MEMCG
- KFENCE_WARN_ON(meta->objcg);
+#ifdef CONFIG_MEMCG_KMEM
+ KFENCE_WARN_ON(meta->obj_exts.objcg);
#endif
/*
* If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 2aafc46a4aaf..8e0d76c4ea2a 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -97,8 +97,8 @@ struct kfence_metadata {
struct kfence_track free_track;
/* For updating alloc_covered on frees. */
u32 alloc_stack_hash;
-#ifdef CONFIG_MEMCG
- struct obj_cgroup *objcg;
+#ifdef CONFIG_MEMCG_KMEM
+ struct slabobj_ext obj_exts;
#endif
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..f2a7fe718117 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2892,13 +2892,6 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
}

#ifdef CONFIG_MEMCG_KMEM
-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
-
/*
* mod_objcg_mlstate() may be called with irq enabled, so
* mod_memcg_lruvec_state() should be used.
@@ -2917,62 +2910,27 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
rcu_read_unlock();
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
-{
- unsigned int objects = objs_per_slab(s, slab);
- unsigned long memcg_data;
- void *vec;
-
- gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- slab_nid(slab));
- if (!vec)
- return -ENOMEM;
-
- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
- if (new_slab) {
- /*
- * If the slab is brand new and nobody can yet access its
- * memcg_data, no synchronization is required and memcg_data can
- * be simply assigned.
- */
- slab->memcg_data = memcg_data;
- } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
- /*
- * If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
- * objcg vector should be reused.
- */
- kfree(vec);
- return 0;
- }
-
- kmemleak_not_leak(vec);
- return 0;
-}
-
static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
{
/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * slab->memcg_data.
+ * slab->obj_exts.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;

slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;

off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);

return NULL;
}
@@ -2980,7 +2938,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
/*
* folio_memcg_check() is used here, because in theory we can encounter
* a folio where the slab flag has been cleared already, but
- * slab->memcg_data has not been freed yet
+ * slab->obj_exts has not been freed yet
* folio_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 31169b3e7f06..8b6086c666e6 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -372,7 +372,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");

diff --git a/mm/slab.h b/mm/slab.h
index f01ac256a8f5..25d14b3a7280 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,8 +57,8 @@ struct slab {
#endif

atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long memcg_data;
+#ifdef CONFIG_SLAB_OBJ_EXT
+ unsigned long obj_exts;
#endif
};

@@ -67,8 +67,8 @@ struct slab {
SLAB_MATCH(flags, __page_flags);
SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-SLAB_MATCH(memcg_data, memcg_data);
+#ifdef CONFIG_SLAB_OBJ_EXT
+SLAB_MATCH(memcg_data, obj_exts);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
@@ -390,36 +390,106 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(slab->memcg_data);
+ unsigned long obj_exts = READ_ONCE(slab->obj_exts);

- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS),
slab_page(slab));
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
+ VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
+#else
+ return (struct slabobj_ext *)obj_exts;
+#endif
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);

-static inline void memcg_free_slab_cgroups(struct slab *slab)
+static inline bool need_slab_obj_ext(void)
{
- kfree(slab_objcgs(slab));
- slab->memcg_data = 0;
+ /*
+ * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
+ * inside memcg_slab_post_alloc_hook. No other users for now.
+ */
+ return false;
}

+static inline void free_slab_obj_exts(struct slab *slab)
+{
+ struct slabobj_ext *obj_exts;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ kfree(obj_exts);
+ slab->obj_exts = 0;
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ if (!p)
+ return NULL;
+
+ if (!need_slab_obj_ext())
+ return NULL;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab) &&
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name))
+ return NULL;
+
+ return slab_obj_exts(slab) + obj_to_index(s, slab, p);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
+{
+ return NULL;
+}
+
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
+{
+ return 0;
+}
+
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -487,16 +557,15 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);

- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags,
- false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}

off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -509,14 +578,14 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
int i;

if (!memcg_kmem_online())
return;

- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return;

for (i = 0; i < objects; i++) {
@@ -524,11 +593,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
unsigned int off;

off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;

- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -537,27 +606,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
}

#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
-{
- return NULL;
-}
-
static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
{
return NULL;
}

-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
-{
- return 0;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -594,7 +647,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -603,8 +656,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_online())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -684,6 +736,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ struct slabobj_ext *obj_exts;
size_t i;

flags &= gfp_allowed_mask;
@@ -714,6 +767,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
+ obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
}

memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 607249785c07..f11cc072b01e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -204,6 +204,53 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}

+#ifdef CONFIG_SLAB_OBJ_EXT
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
+
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
+{
+ unsigned int objects = objs_per_slab(s, slab);
+ unsigned long obj_exts;
+ void *vec;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+ slab_nid(slab));
+ if (!vec)
+ return -ENOMEM;
+
+ obj_exts = (unsigned long)vec;
+#ifdef CONFIG_MEMCG
+ obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ if (new_slab) {
+ /*
+ * If the slab is brand new and nobody can yet access its
+ * obj_exts, no synchronization is required and obj_exts can
+ * be simply assigned.
+ */
+ slab->obj_exts = obj_exts;
+ } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ /*
+ * If the slab is already in use, somebody can allocate and
+ * assign slabobj_exts in parallel. In this case the existing
+ * objcg vector should be reused.
+ */
+ kfree(vec);
+ return 0;
+ }
+
+ kmemleak_not_leak(vec);
+ return 0;
+}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:30 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/gfp_types.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..aab1959130f9 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_SKIP_ZERO 0
#define ___GFP_SKIP_KASAN 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT 0x4000000u
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x4000000u
+#define ___GFP_NOLOCKDEP 0x8000000u
#else
#define ___GFP_NOLOCKDEP 0
#endif
@@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)

/**
* DOC: Watermark modifiers
@@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:32 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/slab.h | 7 +++++++
mm/slab.c | 2 +-
mm/slub.c | 5 +++--
3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6b3e155b70bf..99a146f3cedf 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -147,6 +147,13 @@
#endif
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */

+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slab.c b/mm/slab.c
index bb57f7fdbae1..ccc76f7455e9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1232,7 +1232,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;

diff --git a/mm/slub.c b/mm/slub.c
index c87628cd8a9a..507b71372ee4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5020,7 +5020,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);

create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

@@ -5030,7 +5031,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:34 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
2 files changed, 8 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 25d14b3a7280..b1c22dc87047 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -450,6 +450,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
if (!need_slab_obj_ext())
return NULL;

+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return NULL;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f11cc072b01e..42777d66d0e3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -220,6 +220,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:37 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce objext_flags to store additional objext flags unrelated to memcg.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++-------
mm/slab.h | 4 +---
2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b9fd9732a52b..5e2da63c525f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -347,7 +347,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS = (1UL << 2),
};

-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG (1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+ /* the next bit after the last actual flag */
+ __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG

static inline bool folio_memcg_kmem(struct folio *folio);

@@ -381,7 +396,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -402,7 +417,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

- return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -459,11 +474,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -502,11 +517,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index b1c22dc87047..bec202bdcfb8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -409,10 +409,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
- return (struct slabobj_ext *)obj_exts;
#endif
+ return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
}

int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:39 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.

Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 71 ++++++++++++++
lib/Kconfig.debug | 4 +
lib/Makefile | 1 +
lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 275 insertions(+)
create mode 100644 include/linux/codetag.h
create mode 100644 lib/codetag.c

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index 000000000000..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include <linux/types.h>
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged. At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+ unsigned int flags; /* used in later patches */
+ unsigned int lineno;
+ const char *modname;
+ const char *function;
+ const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+ struct codetag *ct;
+};
+
+struct codetag_range {
+ struct codetag *start;
+ struct codetag *stop;
+};
+
+struct codetag_module {
+ struct module *mod;
+ struct codetag_range range;
+};
+
+struct codetag_type_desc {
+ const char *section;
+ size_t tag_size;
+};
+
+struct codetag_iterator {
+ struct codetag_type *cttype;
+ struct codetag_module *cmod;
+ unsigned long mod_id;
+ struct codetag *ct;
+};
+
+#define CODE_TAG_INIT { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .lineno = __LINE__, \
+ .flags = 0, \
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..5078da7d3ffb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -957,6 +957,10 @@ config DEBUG_STACKOVERFLOW

If in doubt, say "N".

+config CODE_TAGGING
+ bool
+ select KALLSYMS
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 293a0858a3f8..28d70ecf2976 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -228,6 +228,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

+obj-$(CONFIG_CODE_TAGGING) += codetag.o
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index 000000000000..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/codetag.h>
+#include <linux/idr.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/slab.h>
+
+struct codetag_type {
+ struct list_head link;
+ unsigned int count;
+ struct idr mod_idr;
+ struct rw_semaphore mod_lock; /* protects mod_idr */
+ struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+ if (lock)
+ down_read(&cttype->mod_lock);
+ else
+ up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+ struct codetag_iterator iter = {
+ .cttype = cttype,
+ .cmod = NULL,
+ .mod_id = 0,
+ .ct = NULL,
+ };
+
+ return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+ return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+ struct codetag *res = (struct codetag *)
+ ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+ return res < iter->cmod->range.stop ? res : NULL;
+}
+
+struct codetag *codetag_next_ct(struct codetag_iterator *iter)
+{
+ struct codetag_type *cttype = iter->cttype;
+ struct codetag_module *cmod;
+ struct codetag *ct;
+
+ lockdep_assert_held(&cttype->mod_lock);
+
+ if (unlikely(idr_is_empty(&cttype->mod_idr)))
+ return NULL;
+
+ ct = NULL;
+ while (true) {
+ cmod = idr_find(&cttype->mod_idr, iter->mod_id);
+
+ /* If module was removed move to the next one */
+ if (!cmod)
+ cmod = idr_get_next_ul(&cttype->mod_idr,
+ &iter->mod_id);
+
+ /* Exit if no more modules */
+ if (!cmod)
+ break;
+
+ if (cmod != iter->cmod) {
+ iter->cmod = cmod;
+ ct = get_first_module_ct(cmod);
+ } else
+ ct = get_next_module_ct(iter);
+
+ if (ct)
+ break;
+
+ iter->mod_id++;
+ }
+
+ iter->ct = ct;
+ return ct;
+}
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ seq_buf_printf(out, "%s:%u module:%s func:%s",
+ ct->filename, ct->lineno,
+ ct->modname, ct->function);
+}
+
+static inline size_t range_size(const struct codetag_type *cttype,
+ const struct codetag_range *range)
+{
+ return ((char *)range->stop - (char *)range->start) /
+ cttype->desc.tag_size;
+}
+
+static void *get_symbol(struct module *mod, const char *prefix, const char *name)
+{
+ char buf[64];
+ int res;
+
+ res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
+ if (WARN_ON(res < 1 || res > sizeof(buf)))
+ return NULL;
+
+ return mod ?
+ (void *)find_kallsyms_symbol_value(mod, buf) :
+ (void *)kallsyms_lookup_name(buf);
+}
+
+static struct codetag_range get_section_range(struct module *mod,
+ const char *section)
+{
+ return (struct codetag_range) {
+ get_symbol(mod, "__start_", section),
+ get_symbol(mod, "__stop_", section),
+ };
+}
+
+static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
+{
+ struct codetag_range range;
+ struct codetag_module *cmod;
+ int err;
+
+ range = get_section_range(mod, cttype->desc.section);
+ if (!range.start || !range.stop) {
+ pr_warn("Failed to load code tags of type %s from the module %s\n",
+ cttype->desc.section,
+ mod ? mod->name : "(built-in)");
+ return -EINVAL;
+ }
+
+ /* Ignore empty ranges */
+ if (range.start == range.stop)
+ return 0;
+
+ BUG_ON(range.start > range.stop);
+
+ cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
+ if (unlikely(!cmod))
+ return -ENOMEM;
+
+ cmod->mod = mod;
+ cmod->range = range;
+
+ down_write(&cttype->mod_lock);
+ err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
+ if (err >= 0)
+ cttype->count += range_size(cttype, &range);
+ up_write(&cttype->mod_lock);
+
+ if (err < 0) {
+ kfree(cmod);
+ return err;
+ }
+
+ return 0;
+}
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc)
+{
+ struct codetag_type *cttype;
+ int err;
+
+ BUG_ON(desc->tag_size <= 0);
+
+ cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
+ if (unlikely(!cttype))
+ return ERR_PTR(-ENOMEM);
+
+ cttype->desc = *desc;
+ idr_init(&cttype->mod_idr);
+ init_rwsem(&cttype->mod_lock);
+
+ err = codetag_module_init(cttype, NULL);
+ if (unlikely(err)) {
+ kfree(cttype);
+ return ERR_PTR(err);
+ }
+
+ mutex_lock(&codetag_lock);
+ list_add_tail(&cttype->link, &codetag_types);
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:41 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add support for code tagging from dynamically loaded modules.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/codetag.h | 12 +++++++++
kernel/module/main.c | 4 +++
lib/codetag.c | 58 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};

struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);

+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..4232e7bff549 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
#include <linux/cfi.h>
+#include <linux/codetag.h>
#include <linux/debugfs.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1249,6 +1250,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);

+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -2974,6 +2976,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);

+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);

diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type *cttype,
static void *get_symbol(struct module *mod, const char *prefix, const char *name)
{
char buf[64];
+ void *ret;
int res;

res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;

- return mod ?
+ preempt_disable();
+ ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+ preempt_enable();
+
+ return ret;
}

static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)

down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);

if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)

return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:43 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 6 +++---
kernel/module/main.c | 23 +++++++++++++++--------
lib/codetag.c | 11 ++++++++---
3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
- void (*module_unload)(struct codetag_type *cttype,
+ bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
};

@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);

#ifdef CONFIG_CODE_TAGGING
void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
#else
static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif

#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 4232e7bff549..9ff56f2bb09d 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1218,15 +1218,19 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
return module_alloc(size);
}

-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+ bool unload_codetags)
{
+ if (!unload_codetags && mod_mem_type_is_core_data(type))
+ return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
}

-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
{
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1237,20 +1241,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod_mem->base, type,
+ unload_codetags);
}

/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
}

/* Free a module, remove from lists, etc. */
static void free_module(struct module *mod)
{
+ bool unload_codetags;
+
trace_module_free(mod);

- codetag_unload_module(mod);
+ unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -1292,7 +1299,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, unload_codetags);
}

void *__symbol_get(const char *symbol)
@@ -2294,7 +2301,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod->mem[t].base, t, true);
return ret;
}

@@ -2424,7 +2431,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, true);
}

int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/seq_buf.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>

struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
}

-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
{
struct codetag_type *cttype;
+ bool unload_ok = true;

if (!mod)
- return;
+ return true;

mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag_unload_module(struct module *mod)
}
if (found) {
if (cttype->desc.module_unload)
- cttype->desc.module_unload(cttype, cmod);
+ if (!cttype->desc.module_unload(cttype, cmod))
+ unload_ok = false;

cttype->count -= range_size(cttype, &cmod->range);
idr_remove(&cttype->mod_idr, mod_id);
@@ -250,4 +253,6 @@ void codetag_unload_module(struct module *mod)
up_write(&cttype->mod_lock);
}
mutex_unlock(&codetag_lock);
+
+ return unload_ok;
}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:46 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

Provide codetag_query_parse() to parse codetag queries and
codetag_matches_query() to check if the query affects a given codetag.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 27 ++++++++
lib/codetag.c | 135 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 162 insertions(+)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d98e4c8e86f0..87207f199ac9 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -80,4 +80,31 @@ static inline void codetag_load_module(struct module *mod) {}
static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif

+/* Codetag query parsing */
+
+struct codetag_query {
+ const char *filename;
+ const char *module;
+ const char *function;
+ const char *class;
+ unsigned int first_line, last_line;
+ unsigned int first_index, last_index;
+ unsigned int cur_index;
+
+ bool match_line:1;
+ bool match_index:1;
+
+ unsigned int set_enabled:1;
+ unsigned int enabled:2;
+
+ unsigned int set_frequency:1;
+ unsigned int frequency;
+};
+
+char *codetag_query_parse(struct codetag_query *q, char *buf);
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class);
+
#endif /* _LINUX_CODETAG_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 0ad4ea66c769..84f90f3b922c 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -256,3 +256,138 @@ bool codetag_unload_module(struct module *mod)

return unload_ok;
}
+
+/* Codetag query parsing */
+
+#define CODETAG_QUERY_TOKENS() \
+ x(func) \
+ x(file) \
+ x(line) \
+ x(module) \
+ x(class) \
+ x(index)
+
+enum tokens {
+#define x(name) TOK_##name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+};
+
+static const char * const token_strs[] = {
+#define x(name) #name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+ NULL
+};
+
+static int parse_range(char *str, unsigned int *first, unsigned int *last)
+{
+ char *first_str = str;
+ char *last_str = strchr(first_str, '-');
+
+ if (last_str)
+ *last_str++ = '\0';
+
+ if (kstrtouint(first_str, 10, first))
+ return -EINVAL;
+
+ if (!last_str)
+ *last = *first;
+ else if (kstrtouint(last_str, 10, last))
+ return -EINVAL;
+
+ return 0;
+}
+
+char *codetag_query_parse(struct codetag_query *q, char *buf)
+{
+ while (1) {
+ char *p = buf;
+ char *str1 = strsep_no_empty(&p, " \t\r\n");
+ char *str2 = strsep_no_empty(&p, " \t\r\n");
+ int ret, token;
+
+ if (!str1 || !str2)
+ break;
+
+ token = match_string(token_strs, ARRAY_SIZE(token_strs), str1);
+ if (token < 0)
+ break;
+
+ switch (token) {
+ case TOK_func:
+ q->function = str2;
+ break;
+ case TOK_file:
+ q->filename = str2;
+ break;
+ case TOK_line:
+ ret = parse_range(str2, &q->first_line, &q->last_line);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_line = true;
+ break;
+ case TOK_module:
+ q->module = str2;
+ break;
+ case TOK_class:
+ q->class = str2;
+ break;
+ case TOK_index:
+ ret = parse_range(str2, &q->first_index, &q->last_index);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_index = true;
+ break;
+ }
+
+ buf = p;
+ }
+
+ return buf;
+}
+
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class)
+{
+ size_t classlen = q->class ? strlen(q->class) : 0;
+
+ if (q->module &&
+ (!mod->mod ||
+ strcmp(q->module, ct->modname)))
+ return false;
+
+ if (q->filename &&
+ strcmp(q->filename, ct->filename) &&
+ strcmp(q->filename, kbasename(ct->filename)))
+ return false;
+
+ if (q->function &&
+ strcmp(q->function, ct->function))
+ return false;
+
+ /* match against the line number range */
+ if (q->match_line &&
+ (ct->lineno < q->first_line ||
+ ct->lineno > q->last_line))
+ return false;
+
+ /* match against the class */
+ if (classlen &&
+ (strncmp(q->class, class, classlen) ||
+ (class[classlen] && class[classlen] != ':')))
+ return false;
+
+ /* match against the fault index */
+ if (q->match_index &&
+ (q->cur_index < q->first_index ||
+ q->cur_index > q->last_index)) {
+ q->cur_index++;
+ return false;
+ }
+
+ q->cur_index++;
+ return true;
+}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:48 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It also registers an "alloc_tags" codetag
type with "allocations" defbugfs interface to output allocation tag
information.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.

Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
.../admin-guide/kernel-parameters.txt | 2 +
include/asm-generic/codetag.lds.h | 14 ++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 105 +++++++++++
include/linux/sched.h | 24 +++
lib/Kconfig.debug | 19 ++
lib/Makefile | 2 +
lib/alloc_tag.c | 177 ++++++++++++++++++
scripts/module.lds.S | 7 +
9 files changed, 353 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..2fd8e56b7af8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3770,6 +3770,8 @@

nomce [X86-32] Disable Machine Check Exception

+ nomem_profiling Disable memory allocation profiling.
+
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d1f57e4868ed..985ff045c2a2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/

+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -374,6 +376,7 @@
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..d913f8d9a7d8
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/static_key.h>
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ struct lazy_percpu_counter bytes_allocated;
+} __aligned(8);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { .ct = CODE_TAG_INIT }; \
+ struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
+
+extern struct static_key_true mem_alloc_profiling_key;
+
+static inline bool mem_alloc_profiling_enabled(void)
+{
+ return static_branch_likely(&mem_alloc_profiling_key);
+}
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
+ bool may_allocate)
+{
+ struct alloc_tag *tag;
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
+#endif
+ if (!ref || !ref->ct)
+ return;
+
+ tag = ct_to_alloc_tag(ref->ct);
+
+ if (may_allocate)
+ lazy_percpu_counter_add(&tag->bytes_allocated, -bytes);
+ else
+ lazy_percpu_counter_add_noupgrade(&tag->bytes_allocated, -bytes);
+ ref->ct = NULL;
+}
+
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, true);
+}
+
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, false);
+}
+
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && ref->ct,
+ "alloc_tag was not cleared (got tag for %s:%u)\n",\
+ ref->ct->filename, ref->ct->lineno);
+
+ WARN_ONCE(!tag, "current->alloc_tag not set");
+#endif
+ if (!ref || !tag)
+ return;
+
+ ref->ct = &tag->ct;
+ lazy_percpu_counter_add(&tag->bytes_allocated, bytes);
+}
+
+#else
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ size_t bytes) {}
+
+#endif
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e7efdea2d9..33708bf8f191 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -763,6 +763,10 @@ struct task_struct {
unsigned int flags;
unsigned int ptrace;

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
+
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
@@ -802,6 +806,7 @@ struct task_struct {
struct task_group *sched_task_group;
#endif

+
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
@@ -2444,4 +2449,23 @@ static inline void sched_core_fork(struct task_struct *p) { }

extern void sched_set_stop_task(int cpu, struct task_struct *stop);

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
+{
+ swap(current->alloc_tag, tag);
+ return tag;
+}
+
+static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
+ current->alloc_tag = old;
+}
+#else
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
+#define alloc_tag_restore(_tag, _old)
+#endif
+
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5078da7d3ffb..da0a91ea6042 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -961,6 +961,25 @@ config CODE_TAGGING
bool
select KALLSYMS

+config MEM_ALLOC_PROFILING
+ bool "Enable memory allocation profiling"
+ default n
+ depends on DEBUG_FS
+ select CODE_TAGGING
+ select LAZY_PERCPU_COUNTER
+ help
+ Track allocation source code and record total allocation size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance impact.
+
+config MEM_ALLOC_PROFILING_DEBUG
+ bool "Memory allocation profiler debugging"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ help
+ Adds warnings with helpful error messages for memory allocation
+ profiling.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 28d70ecf2976..8d09ccb4d30c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -229,6 +229,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..3c4cfeb79862
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/uaccess.h>
+
+DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);
+
+/*
+ * Won't need to be exported once page allocation accounting is moved to the
+ * correct place:
+ */
+EXPORT_SYMBOL(mem_alloc_profiling_key);
+
+static int __init mem_alloc_profiling_disable(char *s)
+{
+ static_branch_disable(&mem_alloc_profiling_key);
+ return 1;
+}
+__setup("nomem_profiling", mem_alloc_profiling_disable);
+
+struct alloc_tag_file_iterator {
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
+};
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+static int allocations_file_open(struct inode *inode, struct file *file)
+{
+ struct codetag_type *cttype = inode->i_private;
+ struct alloc_tag_file_iterator *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_lock_module_list(cttype, false);
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ file->private_data = iter;
+
+ return 0;
+}
+
+static int allocations_file_release(struct inode *inode, struct file *file)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+
+ kfree(iter);
+ return 0;
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ char buf[10];
+
+ string_get_size(lazy_percpu_counter_read(&tag->bytes_allocated), 1,
+ STRING_UNITS_2, buf, sizeof(buf));
+
+ seq_buf_printf(out, "%8s ", buf);
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, '\n');
+}
+
+static ssize_t allocations_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ int err = 0;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ alloc_tag_to_text(&iter->buf, ct);
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ? : buf.ret;
+}
+
+static const struct file_operations allocations_file_ops = {
+ .owner = THIS_MODULE,
+ .open = allocations_file_open,
+ .release = allocations_file_release,
+ .read = allocations_file_read,
+};
+
+static int __init dbgfs_init(struct codetag_type *cttype)
+{
+ struct dentry *file;
+
+ file = debugfs_create_file("allocations", 0444, NULL, cttype,
+ &allocations_file_ops);
+
+ return IS_ERR(file) ? PTR_ERR(file) : 0;
+}
+
+static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ bool module_unused = true;
+ struct alloc_tag *tag;
+ struct codetag *ct;
+ size_t bytes;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ if (iter.cmod != cmod)
+ continue;
+
+ tag = ct_to_alloc_tag(ct);
+ bytes = lazy_percpu_counter_read(&tag->bytes_allocated);
+
+ if (!WARN(bytes, "%s:%u module %s func:%s has %zu allocated at module unload",
+ ct->filename, ct->lineno, ct->modname, ct->function, bytes))
+ lazy_percpu_counter_exit(&tag->bytes_allocated);
+ else
+ module_unused = false;
+ }
+
+ return module_unused;
+}
+
+static int __init alloc_tag_init(void)
+{
+ struct codetag_type *cttype;
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ return dbgfs_init(cttype);
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..45c67a0994f3 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -9,6 +9,8 @@
#define DISCARD_EH_FRAME *(.eh_frame)
#endif

+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,12 +49,17 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}

.rodata : {
*(.rodata .rodata.[0-9a-zA-Z_]*)
*(.rodata..L*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:50 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/pgalloc_tag.h | 33 +++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 17 +++++++++++++++++
mm/page_ext.c | 12 +++++++++---
4 files changed, 60 insertions(+), 3 deletions(-)
create mode 100644 include/linux/pgalloc_tag.h

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index 000000000000..f8c7b6ef9c75
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include <linux/alloc_tag.h>
+#include <linux/page_ext.h>
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+struct page_ext *lookup_page_ext(const struct page *page);
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+ if (page && mem_alloc_profiling_enabled()) {
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (page_ext)
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+ }
+ return NULL;
+}
+
+static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref)
+ alloc_tag_sub(ref, PAGE_SIZE << order);
+}
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index da0a91ea6042..d3aa5ee0bf0d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -967,6 +967,7 @@ config MEM_ALLOC_PROFILING
depends on DEBUG_FS
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
+ select PAGE_EXTENSION
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 3c4cfeb79862..4a0b95a46b2e 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -4,6 +4,7 @@
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/page_ext.h>
#include <linux/seq_buf.h>
#include <linux/uaccess.h>

@@ -159,6 +160,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_
return module_unused;
}

+static __init bool need_page_alloc_tagging(void)
+{
+ return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+ .size = sizeof(union codetag_ref),
+ .need = need_page_alloc_tagging,
+ .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
static int __init alloc_tag_init(void)
{
struct codetag_type *cttype;
diff --git a/mm/page_ext.c b/mm/page_ext.c
index dc1626be458b..eaf054ec276c 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
#include <linux/page_idle.h>
#include <linux/page_table_check.h>
#include <linux/rcupdate.h>
+#include <linux/pgalloc_tag.h>

/*
* struct page extension
@@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ &page_alloc_tagging_ops,
+#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
&page_table_check_ops,
#endif
@@ -90,7 +94,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
unsigned long page_ext_size;

static unsigned long total_usage;
-static struct page_ext *lookup_page_ext(const struct page *page);
+struct page_ext *lookup_page_ext(const struct page *page);

bool early_page_ext __meminitdata;
static int __init setup_early_page_ext(char *str)
@@ -199,7 +203,7 @@ void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
pgdat->node_page_ext = NULL;
}

-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
unsigned long index;
@@ -219,6 +223,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
}
+EXPORT_SYMBOL(lookup_page_ext);

static int __init alloc_node_page_ext(int nid)
{
@@ -278,7 +283,7 @@ static bool page_ext_invalid(struct page_ext *page_ext)
return !page_ext || (((unsigned long)page_ext & PAGE_EXT_INVALID) == PAGE_EXT_INVALID);
}

-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
@@ -295,6 +300,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
return NULL;
return get_entry(page_ext, pfn);
}
+EXPORT_SYMBOL(lookup_page_ext);

static void *__meminit alloc_page_ext(size_t size, int nid)
{
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:52 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 56a917df410d..842a0ec5eaa9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7a9f0b0bddbd..76a9d5ca4eee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 9784a77fa3c9..6c7d984f164d 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..5ab2616153f0 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 31f114f486c4..d741940dcb3b 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -27,7 +27,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9a4db5cce600..fc42930af14b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}

struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:55 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/alloc_tag.h | 11 ++++
include/linux/gfp.h | 123 +++++++++++++++++++++++++-----------
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 ++-
include/linux/pgalloc_tag.h | 38 +++++++++--
mm/compaction.c | 9 ++-
mm/filemap.c | 6 +-
mm/mempolicy.c | 30 ++++-----
mm/mm_init.c | 1 +
mm/page_alloc.c | 73 ++++++++++++---------
10 files changed, 208 insertions(+), 93 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index d913f8d9a7d8..07922d81b641 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -102,4 +102,15 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,

#endif

+#define alloc_hooks(_do_alloc, _res_type, _err) \
+({ \
+ _res_type _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ed8cb537c6a7..0cb4a515109a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@

#include <linux/mmzone.h>
#include <linux/topology.h>
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>

struct vm_area_struct;

@@ -174,42 +176,57 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif

-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_alloc_pages2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct page *, NULL)
+
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_folio_alloc2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct folio *, NULL)

-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
-
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+#define __alloc_pages_bulk(_gfp, _preferred_nid, _nodemask, _nr_pages, \
+ _page_list, _page_array) \
+ alloc_hooks(_alloc_pages_bulk(_gfp, _preferred_nid, \
+ _nodemask, _nr_pages, \
+ _page_list, _page_array), \
+ unsigned long, 0)
+
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define alloc_pages_bulk_array_mempolicy(_gfp, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_mempolicy(_gfp, \
+ _nr_pages, _page_array), \
+ unsigned long, 0)

/* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)

-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)

static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
+_alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return _alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
}

+#define alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array), \
+ unsigned long, 0)
+
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
{
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -229,21 +246,25 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+_alloc_pages_node2(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask);

- return __alloc_pages(gfp_mask, order, nid, NULL);
+ return _alloc_pages2(gfp_mask, order, nid, NULL);
}

+#define __alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node2(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
static inline
struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp);

- return __folio_alloc(gfp, order, nid, NULL);
+ return _folio_alloc2(gfp, order, nid, NULL);
}

/*
@@ -251,32 +272,45 @@ struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
*/
-static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
+static inline struct page *_alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_node(nid, gfp_mask, order);
+ return _alloc_pages_node2(nid, gfp_mask, order);
}

+#define alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
#ifdef CONFIG_NUMA
-struct page *alloc_pages(gfp_t gfp, unsigned int order);
-struct folio *folio_alloc(gfp_t gfp, unsigned order);
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct page *_alloc_pages(gfp_t gfp, unsigned int order);
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order);
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
-static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
+static inline struct page *_alloc_pages(gfp_t gfp_mask, unsigned int order)
{
- return alloc_pages_node(numa_node_id(), gfp_mask, order);
+ return _alloc_pages_node(numa_node_id(), gfp_mask, order);
}
-static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+static inline struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
-#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
- folio_alloc(gfp, order)
+#define _vma_alloc_folio(gfp, order, vma, addr, hugepage) \
+ _folio_alloc(gfp, order)
#endif
+
+#define alloc_pages(_gfp, _order) \
+ alloc_hooks(_alloc_pages(_gfp, _order), struct page *, NULL)
+#define folio_alloc(_gfp, _order) \
+ alloc_hooks(_folio_alloc(_gfp, _order), struct folio *, NULL)
+#define vma_alloc_folio(_gfp, _order, _vma, _addr, _hugepage) \
+ alloc_hooks(_vma_alloc_folio(_gfp, _order, _vma, _addr, \
+ _hugepage), struct folio *, NULL)
+
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
static inline struct page *alloc_page_vma(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
@@ -286,12 +320,21 @@ static inline struct page *alloc_page_vma(gfp_t gfp,
return &folio->page;
}

-extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
-extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order);
+#define __get_free_pages(_gfp_mask, _order) \
+ alloc_hooks(_get_free_pages(_gfp_mask, _order), unsigned long, 0)
+extern unsigned long _get_zeroed_page(gfp_t gfp_mask);
+#define get_zeroed_page(_gfp_mask) \
+ alloc_hooks(_get_zeroed_page(_gfp_mask), unsigned long, 0)

-void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+void *_alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+#define alloc_pages_exact(_size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact(_size, _gfp_mask), void *, NULL)
void free_pages_exact(void *virt, size_t size);
-__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+
+__meminit void *_alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+#define alloc_pages_exact_nid(_nid, _size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact_nid(_nid, _size, _gfp_mask), void *, NULL)

#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
@@ -354,10 +397,16 @@ static inline bool pm_suspended_storage(void)

#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
+extern int _alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
-extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask);
+#define alloc_contig_range(_start, _end, _migratetype, _gfp_mask) \
+ alloc_hooks(_alloc_contig_range(_start, _end, _migratetype, \
+ _gfp_mask), int, -ENOMEM)
+extern struct page *_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
+#define alloc_contig_pages(_nr_pages, _gfp_mask, _nid, _nodemask) \
+ alloc_hooks(_alloc_contig_pages(_nr_pages, _gfp_mask, _nid, \
+ _nodemask), struct page *, NULL)
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 67314f648aeb..cff15ee5440e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@

#include <linux/types.h>
#include <linux/stacktrace.h>
-#include <linux/stackdepot.h>

struct pglist_data;

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..b2efafa001f8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -467,14 +467,17 @@ static inline void *detach_page_private(struct page *page)
}

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order);
#else
-static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+static inline struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
#endif

+#define filemap_alloc_folio(_gfp, _order) \
+ alloc_hooks(_filemap_alloc_folio(_gfp, _order), struct folio *, NULL)
+
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
return &filemap_alloc_folio(gfp, 0)->page;
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index f8c7b6ef9c75..567327c1c46f 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -6,28 +6,58 @@
#define _LINUX_PGALLOC_TAG_H

#include <linux/alloc_tag.h>
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
#include <linux/page_ext.h>

extern struct page_ext_operations page_alloc_tagging_ops;
-struct page_ext *lookup_page_ext(const struct page *page);
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
+{
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
+{
+ return (void *)ref - page_alloc_tagging_ops.offset;
+}

static inline union codetag_ref *get_page_tag_ref(struct page *page)
{
if (page && mem_alloc_profiling_enabled()) {
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = page_ext_get(page);

if (page_ext)
- return (void *)page_ext + page_alloc_tagging_ops.offset;
+ return codetag_ref_from_page_ext(page_ext);
}
return NULL;
}

+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+ if (ref)
+ page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
{
union codetag_ref *ref = get_page_tag_ref(page);

- if (ref)
+ if (ref) {
alloc_tag_sub(ref, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
}

+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
+static inline void put_page_tag_ref(union codetag_ref *ref) {}
+#define pgalloc_tag_dec(__page, __size) do {} while (0)
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index c8bcdea15f5f..32707fb62495 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1684,7 +1684,7 @@ static void isolate_freepages(struct compact_control *cc)
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
-static struct page *compaction_alloc(struct page *migratepage,
+static struct page *_compaction_alloc(struct page *migratepage,
unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
@@ -1704,6 +1704,13 @@ static struct page *compaction_alloc(struct page *migratepage,
return freepage;
}

+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data)
+{
+ return alloc_hooks(_compaction_alloc(migratepage, data),
+ struct page *, NULL);
+}
+
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..f0f8b782d172 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -958,7 +958,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
@@ -973,9 +973,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)

return folio;
}
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
-EXPORT_SYMBOL(filemap_alloc_folio);
+EXPORT_SYMBOL(_filemap_alloc_folio);
#endif

/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2068b594dc88..80cd33811641 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2141,7 +2141,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
}

/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * _vma_alloc_folio - Allocate a folio for a VMA.
* @gfp: GFP flags.
* @order: Order of the folio.
* @vma: Pointer to VMA or NULL if not available.
@@ -2155,7 +2155,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2240,10 +2240,10 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
out:
return folio;
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(_vma_alloc_folio);

/**
- * alloc_pages - Allocate pages.
+ * _alloc_pages - Allocate pages.
* @gfp: GFP flags.
* @order: Power of two of number of pages to allocate.
*
@@ -2256,7 +2256,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned order)
+struct page *_alloc_pages(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
struct page *page;
@@ -2274,15 +2274,15 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
page = alloc_pages_preferred_many(gfp, order,
policy_node(gfp, pol, numa_node_id()), pol);
else
- page = __alloc_pages(gfp, order,
+ page = _alloc_pages2(gfp, order,
policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));

return page;
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(_alloc_pages);

-struct folio *folio_alloc(gfp_t gfp, unsigned order)
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);

@@ -2290,7 +2290,7 @@ struct folio *folio_alloc(gfp_t gfp, unsigned order)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(_folio_alloc);

static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2309,13 +2309,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,

for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2337,11 +2337,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);

- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = _alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);

if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += _alloc_pages_bulk(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2353,7 +2353,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2369,7 +2369,7 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);

- return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
+ return _alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol), nr_pages, NULL,
page_array);
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..42135fad4d8a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
#include <linux/page_ext.h>
#include <linux/pti.h>
#include <linux/pgtable.h>
+#include <linux/stackdepot.h>
#include <linux/swap.h>
#include <linux/cma.h>
#include "internal.h"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9de2a18519a1..edd35500f7f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -74,6 +74,7 @@
#include <linux/psi.h>
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
+#include <linux/pgalloc_tag.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -657,6 +658,7 @@ static inline bool pcp_allowed_order(unsigned int order)

static inline void free_the_page(struct page *page, unsigned int order)
{
+
if (pcp_allowed_order(order)) /* Via pcp? */
free_unref_page(page, order);
else
@@ -1259,6 +1261,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
__memcg_kmem_uncharge_page(page, order);
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);
return false;
}

@@ -1301,6 +1304,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);

if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1669,6 +1673,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref *ref;
+#endif
int i;

set_page_private(page, 0);
@@ -1721,6 +1728,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,

set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ alloc_tag_add(ref, current->alloc_tag, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+#endif
}

static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -4568,7 +4583,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
*
* Returns the number of pages on the list or array.
*/
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array)
@@ -4704,7 +4719,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_trylock_finish(UP_flags);

failed:
- page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ page = _alloc_pages2(gfp, 0, preferred_nid, nodemask);
if (page) {
if (page_list)
list_add(&page->lru, page_list);
@@ -4715,12 +4730,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

goto out;
}
-EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+EXPORT_SYMBOL_GPL(_alloc_pages_bulk);

/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page;
@@ -4783,41 +4798,41 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,

return page;
}
-EXPORT_SYMBOL(__alloc_pages);
+EXPORT_SYMBOL(_alloc_pages2);

-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ struct page *page = _alloc_pages2(gfp | __GFP_COMP, order,
preferred_nid, nodemask);

if (page && order > 1)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(__folio_alloc);
+EXPORT_SYMBOL(_folio_alloc2);

/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
* you need to access high mem.
*/
-unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
+unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *page;

- page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+ page = _alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}
-EXPORT_SYMBOL(__get_free_pages);
+EXPORT_SYMBOL(_get_free_pages);

-unsigned long get_zeroed_page(gfp_t gfp_mask)
+unsigned long _get_zeroed_page(gfp_t gfp_mask)
{
- return __get_free_page(gfp_mask | __GFP_ZERO);
+ return _get_free_pages(gfp_mask | __GFP_ZERO, 0);
}
-EXPORT_SYMBOL(get_zeroed_page);
+EXPORT_SYMBOL(_get_zeroed_page);

/**
* __free_pages - Free pages allocated with alloc_pages().
@@ -5009,7 +5024,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
}

/**
- * alloc_pages_exact - allocate an exact number physically-contiguous pages.
+ * _alloc_pages_exact - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
* @gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP
*
@@ -5023,7 +5038,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
+void *_alloc_pages_exact(size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
unsigned long addr;
@@ -5031,13 +5046,13 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);

- addr = __get_free_pages(gfp_mask, order);
+ addr = _get_free_pages(gfp_mask, order);
return make_alloc_exact(addr, order, size);
}
-EXPORT_SYMBOL(alloc_pages_exact);
+EXPORT_SYMBOL(_alloc_pages_exact);

/**
- * alloc_pages_exact_nid - allocate an exact number of physically-contiguous
+ * _alloc_pages_exact_nid - allocate an exact number of physically-contiguous
* pages on a node.
* @nid: the preferred node ID where memory should be allocated
* @size: the number of bytes to allocate
@@ -5048,7 +5063,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
+void * __meminit _alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
struct page *p;
@@ -5056,7 +5071,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);

- p = alloc_pages_node(nid, gfp_mask, order);
+ p = _alloc_pages_node(nid, gfp_mask, order);
if (!p)
return NULL;
return make_alloc_exact((unsigned long)page_address(p), order, size);
@@ -6729,7 +6744,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
}

/**
- * alloc_contig_range() -- tries to allocate given range of pages
+ * _alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
* @migratetype: migratetype of the underlying pageblocks (either
@@ -6749,7 +6764,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
* pages which PFN is in [start, end) are allocated for the caller and
* need to be freed with free_contig_range().
*/
-int alloc_contig_range(unsigned long start, unsigned long end,
+int _alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
@@ -6873,15 +6888,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
undo_isolate_page_range(start, end, migratetype);
return ret;
}
-EXPORT_SYMBOL(alloc_contig_range);
+EXPORT_SYMBOL(_alloc_contig_range);

static int __alloc_contig_pages(unsigned long start_pfn,
unsigned long nr_pages, gfp_t gfp_mask)
{
unsigned long end_pfn = start_pfn + nr_pages;

- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return _alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
+ gfp_mask);
}

static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
@@ -6916,7 +6931,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
}

/**
- * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
+ * _alloc_contig_pages() -- tries to find and allocate contiguous range of pages
* @nr_pages: Number of contiguous pages to allocate
* @gfp_mask: GFP mask to limit search and used during compaction
* @nid: Target node
@@ -6936,8 +6951,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
*
* Return: pointer to contiguous pages on success, or NULL if not successful.
*/
-struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask)
+struct page *_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
{
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:57 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation. Early tasks allocate their stacks
using page allocator before alloc_node_page_ext() initializes page_ext
area, unless early_page_ext is enabled. Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation. This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging
with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/page_ext.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index eaf054ec276c..55ba797f8881 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -96,7 +96,16 @@ unsigned long page_ext_size;
static unsigned long total_usage;
struct page_ext *lookup_page_ext(const struct page *page);

+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+/*
+ * To ensure correct allocation tagging for pages, page_ext should be available
+ * before the first page allocation. Otherwise early task stacks will be
+ * allocated before page_ext initialization and missing tags will be flagged.
+ */
+bool early_page_ext __meminitdata = true;
+#else
bool early_page_ext __meminitdata;
+#endif
static int __init setup_early_page_ext(char *str)
{
early_page_ext = true;
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:55:59 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
When a high-order page is split into smaller ones, each newly split
page should get its codetag. The original codetag is reused for these
pages but it's recorded as 0-byte allocation because original codetag
already accounts for the original high-order allocated page.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/pgalloc_tag.h | 30 ++++++++++++++++++++++++++++++
mm/huge_memory.c | 2 ++
mm/page_alloc.c | 2 ++
3 files changed, 34 insertions(+)

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 567327c1c46f..0cbba13869b5 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -52,11 +52,41 @@ static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
}
}

+static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
+{
+ int i;
+ struct page_ext *page_ext;
+ union codetag_ref *ref;
+ struct alloc_tag *tag;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ page_ext = page_ext_get(page);
+ if (unlikely(!page_ext))
+ return;
+
+ ref = codetag_ref_from_page_ext(page_ext);
+ if (!ref->ct)
+ goto out;
+
+ tag = ct_to_alloc_tag(ref->ct);
+ page_ext = page_ext_next(page_ext);
+ for (i = 1; i < nr; i++) {
+ /* New reference with 0 bytes accounted */
+ alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0);
+ page_ext = page_ext_next(page_ext);
+ }
+out:
+ page_ext_put(page_ext);
+}
+
#else /* CONFIG_MEM_ALLOC_PROFILING */

static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
static inline void put_page_tag_ref(union codetag_ref *ref) {}
#define pgalloc_tag_dec(__page, __size) do {} while (0)
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}

#endif /* CONFIG_MEM_ALLOC_PROFILING */

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..221cce0052a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -37,6 +37,7 @@
#include <linux/page_owner.h>
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
+#include <linux/pgalloc_tag.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2557,6 +2558,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
/* Caller disabled irqs, so they are still disabled here */

split_page_owner(head, nr);
+ pgalloc_tag_split(head, nr);

/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edd35500f7f6..8cf5a835af7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2796,6 +2796,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -5012,6 +5013,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
struct page *last = page + nr;

split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
while (page < --last)
set_page_refcounted(last);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:01 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/memcontrol.h | 5 +++++
lib/Kconfig.debug | 1 +
mm/slab.h | 4 ++++
3 files changed, 10 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2da63c525f..c7f21b15b540 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1626,7 +1626,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
* if MEMCG_DATA_OBJEXTS is set.
*/
struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref ref;
+#endif
} __aligned(8);

static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d3aa5ee0bf0d..4157c2251b07 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -968,6 +968,7 @@ config MEM_ALLOC_PROFILING
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
select PAGE_EXTENSION
+ select SLAB_OBJ_EXT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/mm/slab.h b/mm/slab.h
index bec202bdcfb8..f953e7c81e98 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -418,6 +418,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,

static inline bool need_slab_obj_ext(void)
{
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ if (mem_alloc_profiling_enabled())
+ return true;
+#endif
/*
* CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
* inside memcg_slab_post_alloc_hook. No other users for now.
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:03 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Account slab allocations using codetag reference embedded into slabobj_ext.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 ++--
mm/slab.c | 4 +++-
mm/slab.h | 35 +++++++++++++++++++++++++++++++++++
4 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index a61e7d55d0d3..23f14dcb8d5b 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -107,7 +107,7 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla
* reciprocal_divide(offset, cache->reciprocal_buffer_size)
*/
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
u32 offset = (obj - slab->s_mem);
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f6df03f934e5..e8be5b368857 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -176,14 +176,14 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla

/* Determine object index from a given position */
static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
- void *addr, void *obj)
+ void *addr, const void *obj)
{
return reciprocal_divide(kasan_reset_tag(obj) - addr,
cache->reciprocal_size);
}

static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
if (is_kfence_address(obj))
return 0;
diff --git a/mm/slab.c b/mm/slab.c
index ccc76f7455e9..026f0c08708a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3367,9 +3367,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
+ struct slab *slab = virt_to_slab(objp);
bool init;

- memcg_slab_free_hook(cachep, virt_to_slab(objp), &objp, 1);
+ memcg_slab_free_hook(cachep, slab, &objp, 1);
+ alloc_tagging_slab_free_hook(cachep, slab, &objp, 1);

if (is_kfence_address(objp)) {
kmemleak_free_recursive(objp, cachep->flags);
diff --git a/mm/slab.h b/mm/slab.h
index f953e7c81e98..f9442d3a10b2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -494,6 +494,35 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)

#endif /* CONFIG_SLAB_OBJ_EXT */

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects)
+{
+ struct slabobj_ext *obj_exts;
+ int i;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ for (i = 0; i < objects; i++) {
+ unsigned int off = obj_to_index(s, slab, p[i]);
+
+ alloc_tag_sub(&obj_exts[off].ref, s->size);
+ }
+}
+
+#else
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#ifdef CONFIG_MEMCG_KMEM
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr);
@@ -776,6 +805,12 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ /* obj_exts can be allocated for other reasons */
+ if (likely(obj_exts) && mem_alloc_profiling_enabled())
+ alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size);
+#endif
}

memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:06 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/slab.h | 175 ++++++++++++++++++++++---------------------
mm/slab.c | 16 ++--
mm/slab_common.c | 22 +++---
mm/slub.c | 17 +++--
mm/util.c | 10 +--
5 files changed, 124 insertions(+), 116 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 99a146f3cedf..43c922524081 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -213,7 +213,10 @@ int kmem_cache_shrink(struct kmem_cache *s);
/*
* Common kmalloc functions provided by all allocators
*/
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2);
+void * __must_check _krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2);
+#define krealloc(_p, _size, _flags) \
+ alloc_hooks(_krealloc(_p, _size, _flags), void*, NULL)
+
void kfree(const void *objp);
void kfree_sensitive(const void *objp);
size_t __ksize(const void *objp);
@@ -451,6 +454,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
static_assert(PAGE_SHIFT <= 20);
#define kmalloc_index(s) __kmalloc_index(s, true)

+#include <linux/alloc_tag.h>
+
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);

/**
@@ -463,9 +468,15 @@ void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_siz
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc;
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags) __assume_slab_alignment __malloc;
+void *_kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc(_s, _flags) \
+ alloc_hooks(_kmem_cache_alloc(_s, _flags), void*, NULL)
+
+void *_kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+ gfp_t gfpflags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_lru(_s, _lru, _flags) \
+ alloc_hooks(_kmem_cache_alloc_lru(_s, _lru, _flags), void*, NULL)
+
void kmem_cache_free(struct kmem_cache *s, void *objp);

/*
@@ -476,7 +487,9 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
* Note that interrupts must be enabled when calling these functions.
*/
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+#define kmem_cache_alloc_bulk(_s, _flags, _size, _p) \
+ alloc_hooks(_kmem_cache_alloc_bulk(_s, _flags, _size, _p), int, 0)

static __always_inline void kfree_bulk(size_t size, void **p)
{
@@ -485,20 +498,32 @@ static __always_inline void kfree_bulk(size_t size, void **p)

void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
__alloc_size(1);
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
- __malloc;
+void *_kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
+ __malloc;
+#define kmem_cache_alloc_node(_s, _flags, _node) \
+ alloc_hooks(_kmem_cache_alloc_node(_s, _flags, _node), void*, NULL)

-void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
+void *_kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
__assume_kmalloc_alignment __alloc_size(3);

-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+void *_kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size) __assume_kmalloc_alignment
__alloc_size(4);
-void *kmalloc_large(size_t size, gfp_t flags) __assume_page_alignment
+#define kmalloc_trace(_s, _flags, _size) \
+ alloc_hooks(_kmalloc_trace(_s, _flags, _size), void*, NULL)
+
+#define kmalloc_node_trace(_s, _gfpflags, _node, _size) \
+ alloc_hooks(_kmalloc_node_trace(_s, _gfpflags, _node, _size), void*, NULL)
+
+void *_kmalloc_large(size_t size, gfp_t flags) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large(_size, _flags) \
+ alloc_hooks(_kmalloc_large(_size, _flags), void*, NULL)

-void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_alignment
+void *_kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large_node(_size, _flags, _node) \
+ alloc_hooks(_kmalloc_large_node(_size, _flags, _node), void*, NULL)

/**
* kmalloc - allocate kernel memory
@@ -554,37 +579,40 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_align
* Try really hard to succeed the allocation but fail
* eventually.
*/
-static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *_kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;

if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
+ return _kmalloc_large(size, flags);

index = kmalloc_index(size);
- return kmalloc_trace(
+ return _kmalloc_trace(
kmalloc_caches[kmalloc_type(flags)][index],
flags, size);
}
return __kmalloc(size, flags);
}
+#define kmalloc(_size, _flags) alloc_hooks(_kmalloc(_size, _flags), void*, NULL)

-static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *_kmalloc_node(size_t size, gfp_t flags, int node)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;

if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large_node(size, flags, node);
+ return _kmalloc_large_node(size, flags, node);

index = kmalloc_index(size);
- return kmalloc_node_trace(
+ return _kmalloc_node_trace(
kmalloc_caches[kmalloc_type(flags)][index],
flags, node, size);
}
return __kmalloc_node(size, flags, node);
}
+#define kmalloc_node(_size, _flags, _node) \
+ alloc_hooks(_kmalloc_node(_size, _flags, _node), void*, NULL)

/**
* kmalloc_array - allocate memory for an array.
@@ -592,16 +620,18 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *_kmalloc_array(size_t n, size_t size, gfp_t flags)
{
size_t bytes;

if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc(bytes, flags);
- return __kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
}
+#define kmalloc_array(_n, _size, _flags) \
+ alloc_hooks(_kmalloc_array(_n, _size, _flags), void*, NULL)

/**
* krealloc_array - reallocate memory for an array.
@@ -610,18 +640,20 @@ static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_
* @new_size: new size of a single member of the array
* @flags: the type of memory to allocate (see kmalloc)
*/
-static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
- size_t new_n,
- size_t new_size,
- gfp_t flags)
+static inline __realloc_size(2, 3) void * __must_check _krealloc_array(void *p,
+ size_t new_n,
+ size_t new_size,
+ gfp_t flags)
{
size_t bytes;

if (unlikely(check_mul_overflow(new_n, new_size, &bytes)))
return NULL;

- return krealloc(p, bytes, flags);
+ return _krealloc(p, bytes, flags);
}
+#define krealloc_array(_p, _n, _size, _flags) \
+ alloc_hooks(_krealloc_array(_p, _n, _size, _flags), void*, NULL)

/**
* kcalloc - allocate memory for an array. The memory is set to zero.
@@ -629,16 +661,14 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kcalloc(_n, _size, _flags) \
+ kmalloc_array(_n, _size, (_flags) | __GFP_ZERO)

void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
unsigned long caller) __alloc_size(1);
#define kmalloc_node_track_caller(size, flags, node) \
- __kmalloc_node_track_caller(size, flags, node, \
- _RET_IP_)
+ alloc_hooks(__kmalloc_node_track_caller(size, flags, node, \
+ _RET_IP_), void*, NULL)

/*
* kmalloc_track_caller is a special version of kmalloc that records the
@@ -648,11 +678,10 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#define kmalloc_track_caller(size, flags) \
- __kmalloc_node_track_caller(size, flags, \
- NUMA_NO_NODE, _RET_IP_)
+#define kmalloc_track_caller(size, flags) \
+ kmalloc_node_track_caller(size, flags, NUMA_NO_NODE)

-static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+static inline __alloc_size(1, 2) void *_kmalloc_array_node(size_t n, size_t size, gfp_t flags,
int node)
{
size_t bytes;
@@ -660,75 +689,53 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size,
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc_node(bytes, flags, node);
+ return _kmalloc_node(bytes, flags, node);
return __kmalloc_node(bytes, flags, node);
}
+#define kmalloc_array_node(_n, _size, _flags, _node) \
+ alloc_hooks(_kmalloc_array_node(_n, _size, _flags, _node), void*, NULL)

-static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
-{
- return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
-}
+#define kcalloc_node(_n, _size, _flags, _node) \
+ kmalloc_array_node(_n, _size, (_flags) | __GFP_ZERO, _node)

/*
* Shortcuts
*/
-static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
-{
- return kmem_cache_alloc(k, flags | __GFP_ZERO);
-}
+#define kmem_cache_zalloc(_k, _flags) \
+ kmem_cache_alloc(_k, (_flags)|__GFP_ZERO)

/**
* kzalloc - allocate memory. The memory is set to zero.
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
-{
- return kmalloc(size, flags | __GFP_ZERO);
-}
-
-/**
- * kzalloc_node - allocate zeroed memory from a particular memory node.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate (see kmalloc).
- * @node: memory node from which to allocate
- */
-static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kmalloc_node(size, flags | __GFP_ZERO, node);
-}
+#define kzalloc(_size, _flags) kmalloc(_size, (_flags)|__GFP_ZERO)
+#define kzalloc_node(_size, _flags, _node) kmalloc_node(_size, (_flags)|__GFP_ZERO, _node)

-extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
-static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
-{
- return kvmalloc_node(size, flags, NUMA_NO_NODE);
-}
-static inline __alloc_size(1) void *kvzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kvmalloc_node(size, flags | __GFP_ZERO, node);
-}
-static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
-{
- return kvmalloc(size, flags | __GFP_ZERO);
-}
+extern void *_kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
+#define kvmalloc_node(_size, _flags, _node) \
+ alloc_hooks(_kvmalloc_node(_size, _flags, _node), void*, NULL)

-static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
-{
- size_t bytes;
+#define kvmalloc(_size, _flags) kvmalloc_node(_size, _flags, NUMA_NO_NODE)
+#define kvzalloc(_size, _flags) kvmalloc(_size, _flags|__GFP_ZERO)

- if (unlikely(check_mul_overflow(n, size, &bytes)))
- return NULL;
+#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)

- return kvmalloc(bytes, flags);
-}
+#define kvmalloc_array(_n, _size, _flags) \
+({ \
+ size_t _bytes; \
+ \
+ !check_mul_overflow(_n, _size, &_bytes) ? kvmalloc(_bytes, _flags) : NULL; \
+})

-static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kvmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kvcalloc(_n, _size, _flags) kvmalloc_array(_n, _size, _flags|__GFP_ZERO)

-extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+extern void *_kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
__realloc_size(3);
+
+#define kvrealloc(_p, _oldsize, _newsize, _flags) \
+ alloc_hooks(_kvrealloc(_p, _oldsize, _newsize, _flags), void*, NULL)
+
extern void kvfree(const void *addr);
extern void kvfree_sensitive(const void *addr, size_t len);

diff --git a/mm/slab.c b/mm/slab.c
index 026f0c08708a..e08bd3496f56 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3448,18 +3448,18 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
return ret;
}

-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+void *_kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
return __kmem_cache_alloc_lru(cachep, NULL, flags);
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(_kmem_cache_alloc);

-void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
+void *_kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
gfp_t flags)
{
return __kmem_cache_alloc_lru(cachep, lru, flags);
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
+EXPORT_SYMBOL(_kmem_cache_alloc_lru);

static __always_inline void
cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags,
@@ -3471,7 +3471,7 @@ cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags,
p[i] = cache_alloc_debugcheck_after(s, flags, p[i], caller);
}

-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
struct obj_cgroup *objcg = NULL;
@@ -3510,7 +3510,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
kmem_cache_free_bulk(s, i, p);
return 0;
}
-EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+EXPORT_SYMBOL(_kmem_cache_alloc_bulk);

/**
* kmem_cache_alloc_node - Allocate an object on the specified node
@@ -3525,7 +3525,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+void *_kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
void *ret = slab_alloc_node(cachep, NULL, flags, nodeid, cachep->object_size, _RET_IP_);

@@ -3533,7 +3533,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)

return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_SYMBOL(_kmem_cache_alloc_node);

void *__kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
int nodeid, size_t orig_size,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 42777d66d0e3..a05333bbb7f1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1101,7 +1101,7 @@ size_t __ksize(const void *object)
return slab_ksize(folio_slab(folio)->slab_cache);
}

-void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
+void *_kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
size, _RET_IP_);
@@ -1111,9 +1111,9 @@ void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_trace);
+EXPORT_SYMBOL(_kmalloc_trace);

-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+void *_kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size)
{
void *ret = __kmem_cache_alloc_node(s, gfpflags, node, size, _RET_IP_);
@@ -1123,7 +1123,7 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_node_trace);
+EXPORT_SYMBOL(_kmalloc_node_trace);

gfp_t kmalloc_fix_flags(gfp_t flags)
{
@@ -1168,7 +1168,7 @@ static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
return ptr;
}

-void *kmalloc_large(size_t size, gfp_t flags)
+void *_kmalloc_large(size_t size, gfp_t flags)
{
void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE);

@@ -1176,9 +1176,9 @@ void *kmalloc_large(size_t size, gfp_t flags)
flags, NUMA_NO_NODE);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large);
+EXPORT_SYMBOL(_kmalloc_large);

-void *kmalloc_large_node(size_t size, gfp_t flags, int node)
+void *_kmalloc_large_node(size_t size, gfp_t flags, int node)
{
void *ret = __kmalloc_large_node(size, flags, node);

@@ -1186,7 +1186,7 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node)
flags, node);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large_node);
+EXPORT_SYMBOL(_kmalloc_large_node);

#ifdef CONFIG_SLAB_FREELIST_RANDOM
/* Randomize a generic freelist */
@@ -1405,7 +1405,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
return (void *)p;
}

- ret = kmalloc_track_caller(new_size, flags);
+ ret = __kmalloc_node_track_caller(new_size, flags, NUMA_NO_NODE, _RET_IP_);
if (ret && p) {
/* Disable KASAN checks as the object's redzone is accessed. */
kasan_disable_current();
@@ -1429,7 +1429,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
*
* Return: pointer to the allocated memory or %NULL in case of error
*/
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void *_krealloc(const void *p, size_t new_size, gfp_t flags)
{
void *ret;

@@ -1444,7 +1444,7 @@ void *krealloc(const void *p, size_t new_size, gfp_t flags)

return ret;
}
-EXPORT_SYMBOL(krealloc);
+EXPORT_SYMBOL(_krealloc);

/**
* kfree_sensitive - Clear sensitive information in memory before freeing
diff --git a/mm/slub.c b/mm/slub.c
index 507b71372ee4..8f57fd086f69 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3470,18 +3470,18 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
return ret;
}

-void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+void *_kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, NULL, gfpflags);
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(_kmem_cache_alloc);

-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+void *_kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, lru, gfpflags);
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
+EXPORT_SYMBOL(_kmem_cache_alloc_lru);

void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t orig_size,
@@ -3491,7 +3491,7 @@ void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
caller, orig_size);
}

-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+void *_kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);

@@ -3499,7 +3499,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)

return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_SYMBOL(_kmem_cache_alloc_node);

static noinline void free_to_partial_list(
struct kmem_cache *s, struct slab *slab,
@@ -3779,6 +3779,7 @@ static __fastpath_inline void slab_free(struct kmem_cache *s, struct slab *slab,
unsigned long addr)
{
memcg_slab_free_hook(s, slab, p, cnt);
+ alloc_tagging_slab_free_hook(s, slab, p, cnt);
/*
* With KASAN enabled slab_free_freelist_hook modifies the freelist
* to remove objects, whose reuse must be delayed.
@@ -4009,7 +4010,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
#endif /* CONFIG_SLUB_TINY */

/* Note that interrupts must be enabled when calling this function. */
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
int i;
@@ -4034,7 +4035,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s), s->object_size);
return i;
}
-EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+EXPORT_SYMBOL(_kmem_cache_alloc_bulk);


/*
diff --git a/mm/util.c b/mm/util.c
index dd12b9531ac4..e9077d1af676 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -579,7 +579,7 @@ EXPORT_SYMBOL(vm_mmap);
*
* Return: pointer to the allocated memory of %NULL in case of failure
*/
-void *kvmalloc_node(size_t size, gfp_t flags, int node)
+void *_kvmalloc_node(size_t size, gfp_t flags, int node)
{
gfp_t kmalloc_flags = flags;
void *ret;
@@ -601,7 +601,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
kmalloc_flags &= ~__GFP_NOFAIL;
}

- ret = kmalloc_node(size, kmalloc_flags, node);
+ ret = _kmalloc_node(size, kmalloc_flags, node);

/*
* It doesn't really make sense to fallback to vmalloc for sub page
@@ -630,7 +630,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
node, __builtin_return_address(0));
}
-EXPORT_SYMBOL(kvmalloc_node);
+EXPORT_SYMBOL(_kvmalloc_node);

/**
* kvfree() - Free memory.
@@ -669,7 +669,7 @@ void kvfree_sensitive(const void *addr, size_t len)
}
EXPORT_SYMBOL(kvfree_sensitive);

-void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+void *_kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
{
void *newp;

@@ -682,7 +682,7 @@ void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
kvfree(p);
return newp;
}
-EXPORT_SYMBOL(kvrealloc);
+EXPORT_SYMBOL(_kvrealloc);

/**
* __vmalloc_array - allocate memory for a virtually contiguous array.
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:08 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

It seems we need to be more forceful with the compiler on this one.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8f57fd086f69..9dd57b3384a1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1781,7 +1781,7 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
return kasan_slab_free(s, x, init);
}

-static inline bool slab_free_freelist_hook(struct kmem_cache *s,
+static __always_inline bool slab_free_freelist_hook(struct kmem_cache *s,
void **head, void **tail,
int *cnt)
{
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:11 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This adds hooks to mempools for correctly annotating mempool-backed
allocations at the correct source line, so they show up correctly in
/sys/kernel/debug/allocations.

Various inline functions are converted to wrappers so that we can invoke
alloc_hooks() in fewer places.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/mempool.h | 73 ++++++++++++++++++++---------------------
mm/mempool.c | 28 ++++++----------
2 files changed, 45 insertions(+), 56 deletions(-)

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 4aae6c06c5f2..aa6e886b01d7 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,8 @@
#ifndef _LINUX_MEMPOOL_H
#define _LINUX_MEMPOOL_H

+#include <linux/sched.h>
+#include <linux/alloc_tag.h>
#include <linux/wait.h>
#include <linux/compiler.h>

@@ -39,18 +41,32 @@ void mempool_exit(mempool_t *pool);
int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id);
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+
+int _mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
+#define mempool_init(...) \
+ alloc_hooks(_mempool_init(__VA_ARGS__), int, -ENOMEM)

extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
-extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+
+extern mempool_t *_mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int nid);
+#define mempool_create_node(...) \
+ alloc_hooks(_mempool_create_node(__VA_ARGS__), mempool_t *, NULL)
+
+#define mempool_create(_min_nr, _alloc_fn, _free_fn, _pool_data) \
+ mempool_create_node(_min_nr, _alloc_fn, _free_fn, _pool_data, \
+ GFP_KERNEL, NUMA_NO_NODE)

extern int mempool_resize(mempool_t *pool, int new_min_nr);
extern void mempool_destroy(mempool_t *pool);
-extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+
+extern void *_mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+#define mempool_alloc(_pool, _gfp) \
+ alloc_hooks(_mempool_alloc((_pool), (_gfp)), void *, NULL)
+
extern void mempool_free(void *element, mempool_t *pool);

/*
@@ -61,19 +77,10 @@ extern void mempool_free(void *element, mempool_t *pool);
void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
void mempool_free_slab(void *element, void *pool_data);

-static inline int
-mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
-{
- return mempool_init(pool, min_nr, mempool_alloc_slab,
- mempool_free_slab, (void *) kc);
-}
-
-static inline mempool_t *
-mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
-{
- return mempool_create(min_nr, mempool_alloc_slab, mempool_free_slab,
- (void *) kc);
-}
+#define mempool_init_slab_pool(_pool, _min_nr, _kc) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))
+#define mempool_create_slab_pool(_min_nr, _kc) \
+ mempool_create((_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))

/*
* a mempool_alloc_t and a mempool_free_t to kmalloc and kfree the
@@ -82,17 +89,12 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data);
void mempool_kfree(void *element, void *pool_data);

-static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, size_t size)
-{
- return mempool_init(pool, min_nr, mempool_kmalloc,
- mempool_kfree, (void *) size);
-}
-
-static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
-{
- return mempool_create(min_nr, mempool_kmalloc, mempool_kfree,
- (void *) size);
-}
+#define mempool_init_kmalloc_pool(_pool, _min_nr, _size) \
+ mempool_init(_pool, (_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))
+#define mempool_create_kmalloc_pool(_min_nr, _size) \
+ mempool_create((_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))

/*
* A mempool_alloc_t and mempool_free_t for a simple page allocator that
@@ -101,16 +103,11 @@ static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
void mempool_free_pages(void *element, void *pool_data);

-static inline int mempool_init_page_pool(mempool_t *pool, int min_nr, int order)
-{
- return mempool_init(pool, min_nr, mempool_alloc_pages,
- mempool_free_pages, (void *)(long)order);
-}
-
-static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
-{
- return mempool_create(min_nr, mempool_alloc_pages, mempool_free_pages,
- (void *)(long)order);
-}
+#define mempool_init_page_pool(_pool, _min_nr, _order) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))
+#define mempool_create_page_pool(_min_nr, _order) \
+ mempool_create((_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))

#endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/mempool.c b/mm/mempool.c
index 734bcf5afbb7..4fc90735853c 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -230,17 +230,17 @@ EXPORT_SYMBOL(mempool_init_node);
*
* Return: %0 on success, negative error code otherwise.
*/
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+int _mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
return mempool_init_node(pool, min_nr, alloc_fn, free_fn,
pool_data, GFP_KERNEL, NUMA_NO_NODE);

}
-EXPORT_SYMBOL(mempool_init);
+EXPORT_SYMBOL(_mempool_init);

/**
- * mempool_create - create a memory pool
+ * mempool_create_node - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
@@ -255,15 +255,7 @@ EXPORT_SYMBOL(mempool_init);
*
* Return: pointer to the created memory pool object or %NULL on error.
*/
-mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data)
-{
- return mempool_create_node(min_nr, alloc_fn, free_fn, pool_data,
- GFP_KERNEL, NUMA_NO_NODE);
-}
-EXPORT_SYMBOL(mempool_create);
-
-mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+mempool_t *_mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
@@ -281,7 +273,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,

return pool;
}
-EXPORT_SYMBOL(mempool_create_node);
+EXPORT_SYMBOL(_mempool_create_node);

/**
* mempool_resize - resize an existing memory pool
@@ -377,7 +369,7 @@ EXPORT_SYMBOL(mempool_resize);
*
* Return: pointer to the allocated element or %NULL on error.
*/
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void *_mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
void *element;
unsigned long flags;
@@ -444,7 +436,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
finish_wait(&pool->wait, &wait);
goto repeat_alloc;
}
-EXPORT_SYMBOL(mempool_alloc);
+EXPORT_SYMBOL(_mempool_alloc);

/**
* mempool_free - return an element to the pool.
@@ -515,7 +507,7 @@ void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data)
{
struct kmem_cache *mem = pool_data;
VM_BUG_ON(mem->ctor);
- return kmem_cache_alloc(mem, gfp_mask);
+ return _kmem_cache_alloc(mem, gfp_mask);
}
EXPORT_SYMBOL(mempool_alloc_slab);

@@ -533,7 +525,7 @@ EXPORT_SYMBOL(mempool_free_slab);
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data)
{
size_t size = (size_t)pool_data;
- return kmalloc(size, gfp_mask);
+ return _kmalloc(size, gfp_mask);
}
EXPORT_SYMBOL(mempool_kmalloc);

@@ -550,7 +542,7 @@ EXPORT_SYMBOL(mempool_kfree);
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
{
int order = (int)(long)pool_data;
- return alloc_pages(gfp_mask, order);
+ return _alloc_pages(gfp_mask, order);
}
EXPORT_SYMBOL(mempool_alloc_pages);

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:13 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

This avoids a circular header dependency in an upcoming patch by only
making hrtimer.h depend on percpu-defs.h

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Thomas Gleixner <tg...@linutronix.de>
---
include/linux/hrtimer.h | 2 +-
include/linux/time_namespace.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 0ee140176f10..e67349e84364 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -16,7 +16,7 @@
#include <linux/rbtree.h>
#include <linux/init.h>
#include <linux/list.h>
-#include <linux/percpu.h>
+#include <linux/percpu-defs.h>
#include <linux/seqlock.h>
#include <linux/timer.h>
#include <linux/timerqueue.h>
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index bb9d3f5542f8..d8e0cacfcae5 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -11,6 +11,8 @@
struct user_namespace;
extern struct user_namespace init_user_ns;

+struct vm_area_struct;
+
struct timens_offsets {
struct timespec64 monotonic;
struct timespec64 boottime;
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:15 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

Upcoming alloc tagging patches require a place to stash per-allocation
metadata.

We already do this when memcg is enabled, so this patch generalizes the
obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext
type, which we will be adding to in an upcoming patch - similarly to the
previous slabobj_ext patch.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: Dennis Zhou <den...@kernel.org>
Cc: Tejun Heo <t...@kernel.org>
Cc: Christoph Lameter <c...@linux.com>
Cc: linu...@kvack.org
---
mm/percpu-internal.h | 19 +++++++++++++++++--
mm/percpu.c | 30 +++++++++++++++---------------
2 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..2433e7b24172 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -32,6 +32,16 @@ struct pcpu_block_md {
int nr_bits; /* total bits responsible for */
};

+struct pcpuobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *cgroup;
+#endif
+};
+
+#ifdef CONFIG_MEMCG_KMEM
+#define NEED_PCPUOBJ_EXT
+#endif
+
struct pcpu_chunk {
#ifdef CONFIG_PERCPU_STATS
int nr_alloc; /* # of allocations */
@@ -57,8 +67,8 @@ struct pcpu_chunk {
int end_offset; /* additional area required to
have the region end page
aligned */
-#ifdef CONFIG_MEMCG_KMEM
- struct obj_cgroup **obj_cgroups; /* vector of object cgroups */
+#ifdef NEED_PCPUOBJ_EXT
+ struct pcpuobj_ext *obj_exts; /* vector of object cgroups */
#endif

int nr_pages; /* # of pages served by this chunk */
@@ -67,6 +77,11 @@ struct pcpu_chunk {
unsigned long populated[]; /* populated bitmap */
};

+static inline bool need_pcpuobj_ext(void)
+{
+ return !mem_cgroup_kmem_disabled();
+}
+
extern spinlock_t pcpu_lock;

extern struct list_head *pcpu_chunk_lists;
diff --git a/mm/percpu.c b/mm/percpu.c
index 28e07ede46f6..95b26a6b718d 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1392,9 +1392,9 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
panic("%s: Failed to allocate %zu bytes\n", __func__,
alloc_size);

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
/* first chunk is free to use */
- chunk->obj_cgroups = NULL;
+ chunk->obj_exts = NULL;
#endif
pcpu_init_md_blocks(chunk);

@@ -1463,12 +1463,12 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
if (!chunk->md_blocks)
goto md_blocks_fail;

-#ifdef CONFIG_MEMCG_KMEM
- if (!mem_cgroup_kmem_disabled()) {
- chunk->obj_cgroups =
+#ifdef NEED_PCPUOBJ_EXT
+ if (need_pcpuobj_ext()) {
+ chunk->obj_exts =
pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
- sizeof(struct obj_cgroup *), gfp);
- if (!chunk->obj_cgroups)
+ sizeof(struct pcpuobj_ext), gfp);
+ if (!chunk->obj_exts)
goto objcg_fail;
}
#endif
@@ -1480,7 +1480,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)

return chunk;

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
objcg_fail:
pcpu_mem_free(chunk->md_blocks);
#endif
@@ -1498,8 +1498,8 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
{
if (!chunk)
return;
-#ifdef CONFIG_MEMCG_KMEM
- pcpu_mem_free(chunk->obj_cgroups);
+#ifdef NEED_PCPUOBJ_EXT
+ pcpu_mem_free(chunk->obj_exts);
#endif
pcpu_mem_free(chunk->md_blocks);
pcpu_mem_free(chunk->bound_map);
@@ -1648,8 +1648,8 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
if (!objcg)
return;

- if (likely(chunk && chunk->obj_cgroups)) {
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+ if (likely(chunk && chunk->obj_exts)) {
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;

rcu_read_lock();
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1665,13 +1665,13 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
{
struct obj_cgroup *objcg;

- if (unlikely(!chunk->obj_cgroups))
+ if (unlikely(!chunk->obj_exts))
return;

- objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+ objcg = chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup;
if (!objcg)
return;
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;

obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:17 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

To store codetag for every per-cpu allocation, a codetag reference is
embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to
use the newly introduced codetag are added.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/percpu-internal.h | 11 +++++++++--
mm/percpu.c | 26 ++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 2433e7b24172..c5d1d6723a66 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -36,9 +36,12 @@ struct pcpuobj_ext {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cgroup;
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref tag;
+#endif
};

-#ifdef CONFIG_MEMCG_KMEM
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEM_ALLOC_PROFILING)
#define NEED_PCPUOBJ_EXT
#endif

@@ -79,7 +82,11 @@ struct pcpu_chunk {

static inline bool need_pcpuobj_ext(void)
{
- return !mem_cgroup_kmem_disabled();
+ if (IS_ENABLED(CONFIG_MEM_ALLOC_PROFILING))
+ return true;
+ if (!mem_cgroup_kmem_disabled())
+ return true;
+ return false;
}

extern spinlock_t pcpu_lock;
diff --git a/mm/percpu.c b/mm/percpu.c
index 95b26a6b718d..4e2592f2e58f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1701,6 +1701,32 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
}
#endif /* CONFIG_MEMCG_KMEM */

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+ alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
+ current->alloc_tag, size);
+ }
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
+ alloc_tag_sub_noalloc(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
+}
+#else
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif
+
/**
* pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:19 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu
to record allocations and deallocations done by these functions.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/percpu.h | 19 ++++++++----
mm/percpu.c | 66 +++++-------------------------------------
2 files changed, 22 insertions(+), 63 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1338ea2aa720..51ec257379af 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -2,12 +2,14 @@
#ifndef __LINUX_PERCPU_H
#define __LINUX_PERCPU_H

+#include <linux/alloc_tag.h>
#include <linux/mmdebug.h>
#include <linux/preempt.h>
#include <linux/smp.h>
#include <linux/cpumask.h>
#include <linux/pfn.h>
#include <linux/init.h>
+#include <linux/sched.h>

#include <asm/percpu.h>

@@ -116,7 +118,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
#endif

-extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1);
extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr);
extern bool is_kernel_percpu_address(unsigned long addr);

@@ -124,10 +125,15 @@ extern bool is_kernel_percpu_address(unsigned long addr);
extern void __init setup_per_cpu_areas(void);
#endif

-extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
-extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
-extern void free_percpu(void __percpu *__pdata);
-extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+extern void __percpu *__pcpu_alloc(size_t size, size_t align, bool reserved,
+ gfp_t gfp) __alloc_size(1);
+
+#define __alloc_percpu_gfp(_size, _align, _gfp) alloc_hooks( \
+ __pcpu_alloc(_size, _align, false, _gfp), void __percpu *, NULL)
+#define __alloc_percpu(_size, _align) alloc_hooks( \
+ __pcpu_alloc(_size, _align, false, GFP_KERNEL), void __percpu *, NULL)
+#define __alloc_reserved_percpu(_size, _align) alloc_hooks( \
+ __pcpu_alloc(_size, _align, true, GFP_KERNEL), void __percpu *, NULL)

#define alloc_percpu_gfp(type, gfp) \
(typeof(type) __percpu *)__alloc_percpu_gfp(sizeof(type), \
@@ -136,6 +142,9 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))

+extern void free_percpu(void __percpu *__pdata);
+extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+
extern unsigned long pcpu_nr_pages(void);

#endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 4e2592f2e58f..4b5cf260d8e0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1728,7 +1728,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
#endif

/**
- * pcpu_alloc - the percpu allocator
+ * __pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
* @align: alignment of area (max PAGE_SIZE)
* @reserved: allocate from the reserved chunk if available
@@ -1742,8 +1742,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
* RETURNS:
* Percpu pointer to the allocated area on success, NULL on failure.
*/
-static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
- gfp_t gfp)
+void __percpu *__pcpu_alloc(size_t size, size_t align, bool reserved,
+ gfp_t gfp)
{
gfp_t pcpu_gfp;
bool is_atomic;
@@ -1909,6 +1909,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,

pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);

+ pcpu_alloc_tag_alloc_hook(chunk, off, size);
+
return ptr;

fail_unlock:
@@ -1935,61 +1937,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,

return NULL;
}
-
-/**
- * __alloc_percpu_gfp - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- * @gfp: allocation flags
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align. If
- * @gfp doesn't contain %GFP_KERNEL, the allocation doesn't block and can
- * be called from any context but is a lot more likely to fail. If @gfp
- * has __GFP_NOWARN then no warning will be triggered on invalid or failed
- * allocation requests.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
-{
- return pcpu_alloc(size, align, false, gfp);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
-
-/**
- * __alloc_percpu - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Equivalent to __alloc_percpu_gfp(size, align, %GFP_KERNEL).
- */
-void __percpu *__alloc_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, false, GFP_KERNEL);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * __alloc_reserved_percpu - allocate reserved percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align
- * from reserved percpu area if arch has set it up; otherwise,
- * allocation is served from the same dynamic area. Might sleep.
- * Might trigger writeouts.
- *
- * CONTEXT:
- * Does GFP_KERNEL allocation.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, true, GFP_KERNEL);
-}
+EXPORT_SYMBOL_GPL(__pcpu_alloc);

/**
* pcpu_balance_free - manage the amount of free chunks
@@ -2299,6 +2247,8 @@ void free_percpu(void __percpu *ptr)

size = pcpu_free_area(chunk, off);

+ pcpu_alloc_tag_free_hook(chunk, off, size);
+
pcpu_memcg_free_hook(chunk, off, size);

/*
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:21 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

Replace linux/percpu.h include with asm/percpu.h to avoid circular
dependency.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
arch/arm64/include/asm/spectre.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/spectre.h b/arch/arm64/include/asm/spectre.h
index db7b371b367c..31823d9715ab 100644
--- a/arch/arm64/include/asm/spectre.h
+++ b/arch/arm64/include/asm/spectre.h
@@ -13,8 +13,8 @@
#define __BP_HARDEN_HYP_VECS_SZ ((BP_HARDEN_EL2_SLOTS - 1) * SZ_2K)

#ifndef __ASSEMBLY__
-
-#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <asm/percpu.h>

#include <asm/cpufeature.h>
#include <asm/virt.h>
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:24 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Make save_stack() function part of stackdepot API to be used outside of
page_owner. Also rename task_struct's in_page_owner to in_capture_stack
flag to better convey the wider use of this flag.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/sched.h | 6 ++--
include/linux/stackdepot.h | 16 +++++++++
lib/stackdepot.c | 68 ++++++++++++++++++++++++++++++++++++++
mm/page_owner.c | 52 ++---------------------------
4 files changed, 90 insertions(+), 52 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 33708bf8f191..6eca46ab6d78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -942,9 +942,9 @@ struct task_struct {
/* Stalled due to lack of memory */
unsigned in_memstall:1;
#endif
-#ifdef CONFIG_PAGE_OWNER
- /* Used by page_owner=on to detect recursion in page tracking. */
- unsigned in_page_owner:1;
+#ifdef CONFIG_STACKDEPOT
+ /* Used by stack_depot_capture_stack to detect recursion. */
+ unsigned in_capture_stack:1;
#endif
#ifdef CONFIG_EVENTFD
/* Recursion prevention for eventfd_signal() */
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index e58306783d8e..baf7e80cf449 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -164,4 +164,20 @@ depot_stack_handle_t __must_check stack_depot_set_extra_bits(
*/
unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle);

+/**
+ * stack_depot_capture_init - Initialize stack depot capture mechanism
+ *
+ * Return: Stack depot initialization status
+ */
+bool stack_depot_capture_init(void);
+
+/**
+ * stack_depot_capture_stack - Capture current stack trace into stack depot
+ *
+ * @flags: Allocation GFP flags
+ *
+ * Return: Handle of the stack trace stored in depot, 0 on failure
+ */
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags);
+
#endif
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 2f5aa851834e..c7e5e22fcb16 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -539,3 +539,71 @@ unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle)
return parts.extra;
}
EXPORT_SYMBOL(stack_depot_get_extra_bits);
+
+static depot_stack_handle_t recursion_handle;
+static depot_stack_handle_t failure_handle;
+
+static __always_inline depot_stack_handle_t create_custom_stack(void)
+{
+ unsigned long entries[4];
+ unsigned int nr_entries;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+ return stack_depot_save(entries, nr_entries, GFP_KERNEL);
+}
+
+static noinline void register_recursion_stack(void)
+{
+ recursion_handle = create_custom_stack();
+}
+
+static noinline void register_failure_stack(void)
+{
+ failure_handle = create_custom_stack();
+}
+
+bool stack_depot_capture_init(void)
+{
+ static DEFINE_MUTEX(stack_depot_capture_init_mutex);
+ static bool utility_stacks_ready;
+
+ mutex_lock(&stack_depot_capture_init_mutex);
+ if (!utility_stacks_ready) {
+ register_recursion_stack();
+ register_failure_stack();
+ utility_stacks_ready = true;
+ }
+ mutex_unlock(&stack_depot_capture_init_mutex);
+
+ return utility_stacks_ready;
+}
+
+/* TODO: teach stack_depot_capture_stack to use off stack temporal storage */
+#define CAPTURE_STACK_DEPTH (16)
+
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags)
+{
+ unsigned long entries[CAPTURE_STACK_DEPTH];
+ depot_stack_handle_t handle;
+ unsigned int nr_entries;
+
+ /*
+ * Avoid recursion.
+ *
+ * Sometimes page metadata allocation tracking requires more
+ * memory to be allocated:
+ * - when new stack trace is saved to stack depot
+ * - when backtrace itself is calculated (ia64)
+ */
+ if (current->in_capture_stack)
+ return recursion_handle;
+ current->in_capture_stack = 1;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
+ handle = stack_depot_save(entries, nr_entries, flags);
+ if (!handle)
+ handle = failure_handle;
+
+ current->in_capture_stack = 0;
+ return handle;
+}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 8b6086c666e6..9fafbc290d5b 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -15,12 +15,6 @@

#include "internal.h"

-/*
- * TODO: teach PAGE_OWNER_STACK_DEPTH (__dump_page_owner and save_stack)
- * to use off stack temporal storage
- */
-#define PAGE_OWNER_STACK_DEPTH (16)
-
struct page_owner {
unsigned short order;
short last_migrate_reason;
@@ -37,8 +31,6 @@ struct page_owner {
static bool page_owner_enabled __initdata;
DEFINE_STATIC_KEY_FALSE(page_owner_inited);

-static depot_stack_handle_t dummy_handle;
-static depot_stack_handle_t failure_handle;
static depot_stack_handle_t early_handle;

static void init_early_allocated_pages(void);
@@ -68,16 +60,6 @@ static __always_inline depot_stack_handle_t create_dummy_stack(void)
return stack_depot_save(entries, nr_entries, GFP_KERNEL);
}

-static noinline void register_dummy_stack(void)
-{
- dummy_handle = create_dummy_stack();
-}
-
-static noinline void register_failure_stack(void)
-{
- failure_handle = create_dummy_stack();
-}
-
static noinline void register_early_stack(void)
{
early_handle = create_dummy_stack();
@@ -88,8 +70,7 @@ static __init void init_page_owner(void)
if (!page_owner_enabled)
return;

- register_dummy_stack();
- register_failure_stack();
+ stack_depot_capture_init();
register_early_stack();
static_branch_enable(&page_owner_inited);
init_early_allocated_pages();
@@ -107,33 +88,6 @@ static inline struct page_owner *get_page_owner(struct page_ext *page_ext)
return (void *)page_ext + page_owner_ops.offset;
}

-static noinline depot_stack_handle_t save_stack(gfp_t flags)
-{
- unsigned long entries[PAGE_OWNER_STACK_DEPTH];
- depot_stack_handle_t handle;
- unsigned int nr_entries;
-
- /*
- * Avoid recursion.
- *
- * Sometimes page metadata allocation tracking requires more
- * memory to be allocated:
- * - when new stack trace is saved to stack depot
- * - when backtrace itself is calculated (ia64)
- */
- if (current->in_page_owner)
- return dummy_handle;
- current->in_page_owner = 1;
-
- nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
- handle = stack_depot_save(entries, nr_entries, flags);
- if (!handle)
- handle = failure_handle;
-
- current->in_page_owner = 0;
- return handle;
-}
-
void __reset_page_owner(struct page *page, unsigned short order)
{
int i;
@@ -146,7 +100,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
if (unlikely(!page_ext))
return;

- handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+ handle = stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
for (i = 0; i < (1 << order); i++) {
__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
page_owner = get_page_owner(page_ext);
@@ -189,7 +143,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
struct page_ext *page_ext;
depot_stack_handle_t handle;

- handle = save_stack(gfp_mask);
+ handle = stack_depot_capture_stack(gfp_mask);

page_ext = page_ext_get(page);
if (unlikely(!page_ext))
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:27 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add support for code tag context capture when registering a new code tag
type. When context capture for a specific code tag is enabled,
codetag_ref will point to a codetag_ctx object which can be attached
to an application-specific object storing code invocation context.
codetag_ctx has a pointer to its codetag_with_ctx object with embedded
codetag object in it. All context objects of the same code tag are placed
into codetag_with_ctx.ctx_head linked list. codetag.flag is used to
indicate when a context capture for the associated code tag is
initialized and enabled.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 50 +++++++++++++-
include/linux/codetag_ctx.h | 48 +++++++++++++
lib/codetag.c | 134 ++++++++++++++++++++++++++++++++++++
3 files changed, 231 insertions(+), 1 deletion(-)
create mode 100644 include/linux/codetag_ctx.h

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 87207f199ac9..9ab2f017e845 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -5,8 +5,12 @@
#ifndef _LINUX_CODETAG_H
#define _LINUX_CODETAG_H

+#include <linux/container_of.h>
+#include <linux/spinlock.h>
#include <linux/types.h>

+struct kref;
+struct codetag_ctx;
struct codetag_iterator;
struct codetag_type;
struct seq_buf;
@@ -18,15 +22,38 @@ struct module;
* an array of these.
*/
struct codetag {
- unsigned int flags; /* used in later patches */
+ unsigned int flags; /* has to be the first member shared with codetag_ctx */
unsigned int lineno;
const char *modname;
const char *function;
const char *filename;
} __aligned(8);

+/* codetag_with_ctx flags */
+#define CTC_FLAG_CTX_PTR (1 << 0)
+#define CTC_FLAG_CTX_READY (1 << 1)
+#define CTC_FLAG_CTX_ENABLED (1 << 2)
+
+/*
+ * Code tag with context capture support. Contains a list to store context for
+ * each tag hit, a lock protecting the list and a flag to indicate whether
+ * context capture is enabled for the tag.
+ */
+struct codetag_with_ctx {
+ struct codetag ct;
+ struct list_head ctx_head;
+ spinlock_t ctx_lock;
+} __aligned(8);
+
+/*
+ * Tag reference can point to codetag directly or indirectly via codetag_ctx.
+ * Direct codetag pointer is used when context capture is disabled or not
+ * supported. When context capture for the tag is used, the reference points
+ * to the codetag_ctx through which the codetag can be reached.
+ */
union codetag_ref {
struct codetag *ct;
+ struct codetag_ctx *ctx;
};

struct codetag_range {
@@ -46,6 +73,7 @@ struct codetag_type_desc {
struct codetag_module *cmod);
bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
+ void (*free_ctx)(struct kref *ref);
};

struct codetag_iterator {
@@ -53,6 +81,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ struct codetag_ctx *ctx;
};

#define CODE_TAG_INIT { \
@@ -63,9 +92,28 @@ struct codetag_iterator {
.flags = 0, \
}

+static inline bool is_codetag_ctx_ref(union codetag_ref *ref)
+{
+ return !!(ref->ct->flags & CTC_FLAG_CTX_PTR);
+}
+
+static inline
+struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
+{
+ return container_of(ct, struct codetag_with_ctx, ct);
+}
+
void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable);
+static inline bool codetag_ctx_enabled(struct codetag_with_ctx *ctc)
+{
+ return !!(ctc->ct.flags & CTC_FLAG_CTX_ENABLED);
+}
+bool codetag_has_ctx(struct codetag_with_ctx *ctc);

void codetag_to_text(struct seq_buf *out, struct codetag *ct);

diff --git a/include/linux/codetag_ctx.h b/include/linux/codetag_ctx.h
new file mode 100644
index 000000000000..e741484f0e08
--- /dev/null
+++ b/include/linux/codetag_ctx.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tag context
+ */
+#ifndef _LINUX_CODETAG_CTX_H
+#define _LINUX_CODETAG_CTX_H
+
+#include <linux/codetag.h>
+#include <linux/kref.h>
+
+/* Code tag hit context. */
+struct codetag_ctx {
+ unsigned int flags; /* has to be the first member shared with codetag */
+ struct codetag_with_ctx *ctc;
+ struct list_head node;
+ struct kref refcount;
+} __aligned(8);
+
+static inline struct codetag_ctx *kref_to_ctx(struct kref *refcount)
+{
+ return container_of(refcount, struct codetag_ctx, refcount);
+}
+
+static inline void add_ctx(struct codetag_ctx *ctx,
+ struct codetag_with_ctx *ctc)
+{
+ kref_init(&ctx->refcount);
+ spin_lock(&ctc->ctx_lock);
+ ctx->flags = CTC_FLAG_CTX_PTR;
+ ctx->ctc = ctc;
+ list_add_tail(&ctx->node, &ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+}
+
+static inline void rem_ctx(struct codetag_ctx *ctx,
+ void (*free_ctx)(struct kref *refcount))
+{
+ struct codetag_with_ctx *ctc = ctx->ctc;
+
+ spin_lock(&ctc->ctx_lock);
+ /* ctx might have been removed while we were using it */
+ if (!list_empty(&ctx->node))
+ list_del_init(&ctx->node);
+ spin_unlock(&ctc->ctx_lock);
+ kref_put(&ctx->refcount, free_ctx);
+}
+
+#endif /* _LINUX_CODETAG_CTX_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 84f90f3b922c..d891bbe4481d 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/codetag.h>
+#include <linux/codetag_ctx.h>
#include <linux/idr.h>
#include <linux/kallsyms.h>
#include <linux/module.h>
@@ -92,6 +93,139 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
return ct;
}

+static struct codetag_ctx *next_ctx_from_ct(struct codetag_iterator *iter)
+{
+ struct codetag_with_ctx *ctc;
+ struct codetag_ctx *ctx = NULL;
+ struct codetag *ct = iter->ct;
+
+ while (ct) {
+ if (!(ct->flags & CTC_FLAG_CTX_READY))
+ goto next;
+
+ ctc = ct_to_ctc(ct);
+ spin_lock(&ctc->ctx_lock);
+ if (!list_empty(&ctc->ctx_head)) {
+ ctx = list_first_entry(&ctc->ctx_head,
+ struct codetag_ctx, node);
+ kref_get(&ctx->refcount);
+ }
+ spin_unlock(&ctc->ctx_lock);
+ if (ctx)
+ break;
+next:
+ ct = codetag_next_ct(iter);
+ }
+
+ iter->ctx = ctx;
+ return ctx;
+}
+
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)
+{
+ struct codetag_ctx *ctx = iter->ctx;
+ struct codetag_ctx *found = NULL;
+
+ lockdep_assert_held(&iter->cttype->mod_lock);
+
+ if (!ctx)
+ return next_ctx_from_ct(iter);
+
+ spin_lock(&ctx->ctc->ctx_lock);
+ /*
+ * Do not advance if the object was isolated, restart at the same tag.
+ */
+ if (!list_empty(&ctx->node)) {
+ if (list_is_last(&ctx->node, &ctx->ctc->ctx_head)) {
+ /* Finished with this tag, advance to the next */
+ codetag_next_ct(iter);
+ } else {
+ found = list_next_entry(ctx, node);
+ kref_get(&found->refcount);
+ }
+ }
+ spin_unlock(&ctx->ctc->ctx_lock);
+ kref_put(&ctx->refcount, iter->cttype->desc.free_ctx);
+
+ if (!found)
+ return next_ctx_from_ct(iter);
+
+ iter->ctx = found;
+ return found;
+}
+
+static struct codetag_type *find_cttype(struct codetag *ct)
+{
+ struct codetag_module *cmod;
+ struct codetag_type *cttype;
+ unsigned long mod_id;
+ unsigned long tmp;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ down_read(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (ct >= cmod->range.start && ct < cmod->range.stop) {
+ up_read(&cttype->mod_lock);
+ goto found;
+ }
+ }
+ up_read(&cttype->mod_lock);
+ }
+ cttype = NULL;
+found:
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable)
+{
+ struct codetag_type *cttype = find_cttype(&ctc->ct);
+
+ if (!cttype || !cttype->desc.free_ctx)
+ return false;
+
+ lockdep_assert_held(&cttype->mod_lock);
+ BUG_ON(!rwsem_is_locked(&cttype->mod_lock));
+
+ if (codetag_ctx_enabled(ctc) == enable)
+ return false;
+
+ if (enable) {
+ /* Initialize context capture fields only once */
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY)) {
+ spin_lock_init(&ctc->ctx_lock);
+ INIT_LIST_HEAD(&ctc->ctx_head);
+ ctc->ct.flags |= CTC_FLAG_CTX_READY;
+ }
+ ctc->ct.flags |= CTC_FLAG_CTX_ENABLED;
+ } else {
+ /*
+ * The list of context objects is intentionally left untouched.
+ * It can be read back and if context capture is re-enablied it
+ * will append new objects.
+ */
+ ctc->ct.flags &= ~CTC_FLAG_CTX_ENABLED;
+ }
+
+ return true;
+}
+
+bool codetag_has_ctx(struct codetag_with_ctx *ctc)
+{
+ bool no_ctx;
+
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY))
+ return false;
+
+ spin_lock(&ctc->ctx_lock);
+ no_ctx = list_empty(&ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+
+ return !no_ctx;
+}
+
void codetag_to_text(struct seq_buf *out, struct codetag *ct)
{
seq_buf_printf(out, "%s:%u module:%s func:%s",
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:28 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Implement mechanisms for capturing allocation call context which consists
of:
- allocation size
- pid, tgid and name of the allocating task
- allocation timestamp
- allocation call stack
The patch creates allocations.ctx file which can be written to
enable/disable context capture for a specific code tag. Captured context
can be obtained by reading allocations.ctx file.
Usage example:

echo "file include/asm-generic/pgalloc.h line 63 enable" > \
/sys/kernel/debug/allocations.ctx
cat allocations.ctx
91.0MiB 212 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109646361
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0x90
move_page_tables.part.0+0x994/0xa60
shift_arg_pages+0xa4/0x180
setup_arg_pages+0x286/0x2d0
load_elf_binary+0x4e1/0x18d0
bprm_execve+0x26b/0x660
do_execveat_common.isra.0+0x19d/0x220
__x64_sys_execve+0x2e/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109711801
call stack:
pte_alloc_one+0xfe/0x130
__do_fault+0x52/0xc0
__handle_mm_fault+0x7d9/0xdd0
handle_mm_fault+0xc0/0x2b0
do_user_addr_fault+0x1c3/0x660
exc_page_fault+0x62/0x150
asm_exc_page_fault+0x22/0x30
...

echo "file include/asm-generic/pgalloc.h line 63 disable" > \
/sys/kernel/debug/alloc_tags.ctx

Note that disabling context capture will not clear already captured
context but no new context will be captured.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/alloc_tag.h | 25 +++-
include/linux/codetag.h | 3 +-
include/linux/pgalloc_tag.h | 4 +-
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 238 +++++++++++++++++++++++++++++++++++-
lib/codetag.c | 20 +--
6 files changed, 272 insertions(+), 19 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 07922d81b641..2a3d248aae10 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -17,20 +17,29 @@
* an array of these. Embedded codetag utilizes codetag framework.
*/
struct alloc_tag {
- struct codetag ct;
+ struct codetag_with_ctx ctc;
struct lazy_percpu_counter bytes_allocated;
} __aligned(8);

#ifdef CONFIG_MEM_ALLOC_PROFILING

+static inline struct alloc_tag *ctc_to_alloc_tag(struct codetag_with_ctx *ctc)
+{
+ return container_of(ctc, struct alloc_tag, ctc);
+}
+
static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
{
- return container_of(ct, struct alloc_tag, ct);
+ return container_of(ct_to_ctc(ct), struct alloc_tag, ctc);
}

+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size);
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag);
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable);
+
#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
static struct alloc_tag _alloc_tag __used __aligned(8) \
- __section("alloc_tags") = { .ct = CODE_TAG_INIT }; \
+ __section("alloc_tags") = { .ctc.ct = CODE_TAG_INIT }; \
struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)

extern struct static_key_true mem_alloc_profiling_key;
@@ -54,7 +63,10 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
if (!ref || !ref->ct)
return;

- tag = ct_to_alloc_tag(ref->ct);
+ if (is_codetag_ctx_ref(ref))
+ alloc_tag_free_ctx(ref->ctx, &tag);
+ else
+ tag = ct_to_alloc_tag(ref->ct);

if (may_allocate)
lazy_percpu_counter_add(&tag->bytes_allocated, -bytes);
@@ -88,7 +100,10 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
if (!ref || !tag)
return;

- ref->ct = &tag->ct;
+ if (codetag_ctx_enabled(&tag->ctc))
+ ref->ctx = alloc_tag_create_ctx(tag, bytes);
+ else
+ ref->ct = &tag->ctc.ct;
lazy_percpu_counter_add(&tag->bytes_allocated, bytes);
}

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 9ab2f017e845..b6a2f0287a83 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -104,7 +104,8 @@ struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
}

void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 0cbba13869b5..e4661bbd40c6 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -6,6 +6,7 @@
#define _LINUX_PGALLOC_TAG_H

#include <linux/alloc_tag.h>
+#include <linux/codetag_ctx.h>

#ifdef CONFIG_MEM_ALLOC_PROFILING

@@ -70,7 +71,8 @@ static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
if (!ref->ct)
goto out;

- tag = ct_to_alloc_tag(ref->ct);
+ tag = is_codetag_ctx_ref(ref) ? ctc_to_alloc_tag(ref->ctx->ctc)
+ : ct_to_alloc_tag(ref->ct);
page_ext = page_ext_next(page_ext);
for (i = 1; i < nr; i++) {
/* New reference with 0 bytes accounted */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4157c2251b07..1b83ef17d232 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -969,6 +969,7 @@ config MEM_ALLOC_PROFILING
select LAZY_PERCPU_COUNTER
select PAGE_EXTENSION
select SLAB_OBJ_EXT
+ select STACKDEPOT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 4a0b95a46b2e..675c7a08e38b 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -1,13 +1,18 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/alloc_tag.h>
+#include <linux/codetag_ctx.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
#include <linux/page_ext.h>
+#include <linux/sched/clock.h>
#include <linux/seq_buf.h>
+#include <linux/stackdepot.h>
#include <linux/uaccess.h>

+#define STACK_BUF_SIZE 1024
+
DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);

/*
@@ -23,6 +28,16 @@ static int __init mem_alloc_profiling_disable(char *s)
}
__setup("nomem_profiling", mem_alloc_profiling_disable);

+struct alloc_call_ctx {
+ struct codetag_ctx ctx;
+ size_t size;
+ pid_t pid;
+ pid_t tgid;
+ char comm[TASK_COMM_LEN];
+ u64 ts_nsec;
+ depot_stack_handle_t stack_handle;
+} __aligned(8);
+
struct alloc_tag_file_iterator {
struct codetag_iterator ct_iter;
struct seq_buf buf;
@@ -64,7 +79,7 @@ static int allocations_file_open(struct inode *inode, struct file *file)
return -ENOMEM;

codetag_lock_module_list(cttype, true);
- iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_init_iter(&iter->ct_iter, cttype);
codetag_lock_module_list(cttype, false);
seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
file->private_data = iter;
@@ -125,24 +140,240 @@ static const struct file_operations allocations_file_ops = {
.read = allocations_file_read,
};

+static void alloc_tag_ops_free_ctx(struct kref *refcount)
+{
+ kfree(container_of(kref_to_ctx(refcount), struct alloc_call_ctx, ctx));
+}
+
+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
+{
+ struct alloc_call_ctx *ac_ctx;
+
+ /* TODO: use a dedicated kmem_cache */
+ ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
+ if (WARN_ON(!ac_ctx))
+ return NULL;
+
+ ac_ctx->size = size;
+ ac_ctx->pid = current->pid;
+ ac_ctx->tgid = current->tgid;
+ strscpy(ac_ctx->comm, current->comm, sizeof(ac_ctx->comm));
+ ac_ctx->ts_nsec = local_clock();
+ ac_ctx->stack_handle =
+ stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
+ add_ctx(&ac_ctx->ctx, &tag->ctc);
+
+ return &ac_ctx->ctx;
+}
+EXPORT_SYMBOL_GPL(alloc_tag_create_ctx);
+
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag)
+{
+ *ptag = ctc_to_alloc_tag(ctx->ctc);
+ rem_ctx(ctx, alloc_tag_ops_free_ctx);
+}
+EXPORT_SYMBOL_GPL(alloc_tag_free_ctx);
+
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable)
+{
+ static bool stack_depot_ready;
+
+ if (enable && !stack_depot_ready) {
+ stack_depot_init();
+ stack_depot_capture_init();
+ stack_depot_ready = true;
+ }
+
+ return codetag_enable_ctx(&tag->ctc, enable);
+}
+
+static void alloc_tag_ctx_to_text(struct seq_buf *out, struct codetag_ctx *ctx)
+{
+ struct alloc_call_ctx *ac_ctx;
+ char *buf;
+
+ ac_ctx = container_of(ctx, struct alloc_call_ctx, ctx);
+ seq_buf_printf(out, " size: %zu\n", ac_ctx->size);
+ seq_buf_printf(out, " pid: %d\n", ac_ctx->pid);
+ seq_buf_printf(out, " tgid: %d\n", ac_ctx->tgid);
+ seq_buf_printf(out, " comm: %s\n", ac_ctx->comm);
+ seq_buf_printf(out, " ts: %llu\n", ac_ctx->ts_nsec);
+
+ buf = kmalloc(STACK_BUF_SIZE, GFP_KERNEL);
+ if (buf) {
+ int bytes_read = stack_depot_snprint(ac_ctx->stack_handle, buf,
+ STACK_BUF_SIZE - 1, 8);
+ buf[bytes_read] = '\0';
+ seq_buf_printf(out, " call stack:\n%s\n", buf);
+ }
+ kfree(buf);
+}
+
+static ssize_t allocations_ctx_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct codetag_iterator *ct_iter = &iter->ct_iter;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag_ctx *ctx;
+ struct codetag *prev_ct;
+ int err = 0;
+
+ codetag_lock_module_list(ct_iter->cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ prev_ct = ct_iter->ct;
+ ctx = codetag_next_ctx(ct_iter);
+ if (!ctx)
+ break;
+
+ if (prev_ct != &ctx->ctc->ct)
+ alloc_tag_to_text(&iter->buf, &ctx->ctc->ct);
+ alloc_tag_ctx_to_text(&iter->buf, ctx);
+ }
+ codetag_lock_module_list(ct_iter->cttype, false);
+
+ return err ? : buf.ret;
+}
+
+#define CTX_CAPTURE_TOKENS() \
+ x(disable, 0) \
+ x(enable, 0)
+
+static const char * const ctx_capture_token_strs[] = {
+#define x(name, nr_args) #name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+ NULL
+};
+
+enum ctx_capture_token {
+#define x(name, nr_args) TOK_##name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+};
+
+static int enable_ctx_capture(struct codetag_type *cttype,
+ struct codetag_query *query, bool enable)
+{
+ struct codetag_iterator ct_iter;
+ struct codetag_with_ctx *ctc;
+ struct codetag *ct;
+ unsigned int nfound = 0;
+
+ codetag_lock_module_list(cttype, true);
+
+ codetag_init_iter(&ct_iter, cttype);
+ while ((ct = codetag_next_ct(&ct_iter))) {
+ if (!codetag_matches_query(query, ct, ct_iter.cmod, NULL))
+ continue;
+
+ ctc = ct_to_ctc(ct);
+ if (codetag_ctx_enabled(ctc) == enable)
+ continue;
+
+ if (!alloc_tag_enable_ctx(ctc_to_alloc_tag(ctc), enable)) {
+ pr_warn("Failed to toggle context capture\n");
+ continue;
+ }
+
+ nfound++;
+ }
+
+ codetag_lock_module_list(cttype, false);
+
+ return nfound ? 0 : -ENOENT;
+}
+
+static int parse_command(struct codetag_type *cttype, char *buf)
+{
+ struct codetag_query query = { NULL };
+ char *cmd;
+ int ret;
+ int tok;
+
+ buf = codetag_query_parse(&query, buf);
+ if (IS_ERR(buf))
+ return PTR_ERR(buf);
+
+ cmd = strsep_no_empty(&buf, " \t\r\n");
+ if (!cmd)
+ return -EINVAL; /* no command */
+
+ tok = match_string(ctx_capture_token_strs,
+ ARRAY_SIZE(ctx_capture_token_strs), cmd);
+ if (tok < 0)
+ return -EINVAL; /* unknown command */
+
+ ret = enable_ctx_capture(cttype, &query, tok == TOK_enable);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
+static ssize_t allocations_ctx_file_write(struct file *file, const char __user *ubuf,
+ size_t len, loff_t *offp)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ char tmpbuf[256];
+
+ if (len == 0)
+ return 0;
+ /* we don't check *offp -- multiple writes() are allowed */
+ if (len > sizeof(tmpbuf) - 1)
+ return -E2BIG;
+
+ if (copy_from_user(tmpbuf, ubuf, len))
+ return -EFAULT;
+
+ tmpbuf[len] = '\0';
+ parse_command(iter->ct_iter.cttype, tmpbuf);
+
+ *offp += len;
+ return len;
+}
+
+static const struct file_operations allocations_ctx_file_ops = {
+ .owner = THIS_MODULE,
+ .open = allocations_file_open,
+ .release = allocations_file_release,
+ .read = allocations_ctx_file_read,
+ .write = allocations_ctx_file_write,
+};
+
static int __init dbgfs_init(struct codetag_type *cttype)
{
struct dentry *file;
+ struct dentry *ctx_file;

file = debugfs_create_file("allocations", 0444, NULL, cttype,
&allocations_file_ops);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ctx_file = debugfs_create_file("allocations.ctx", 0666, NULL, cttype,
+ &allocations_ctx_file_ops);
+ if (IS_ERR(ctx_file)) {
+ debugfs_remove(file);
+ return PTR_ERR(ctx_file);
+ }

- return IS_ERR(file) ? PTR_ERR(file) : 0;
+ return 0;
}

static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
{
- struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct codetag_iterator iter;
bool module_unused = true;
struct alloc_tag *tag;
struct codetag *ct;
size_t bytes;

+ codetag_init_iter(&iter, cttype);
for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
if (iter.cmod != cmod)
continue;
@@ -183,6 +414,7 @@ static int __init alloc_tag_init(void)
.section = "alloc_tags",
.tag_size = sizeof(struct alloc_tag),
.module_unload = alloc_tag_module_unload,
+ .free_ctx = alloc_tag_ops_free_ctx,
};

cttype = codetag_register_type(&desc);
diff --git a/lib/codetag.c b/lib/codetag.c
index d891bbe4481d..cbff146b3fe8 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -27,16 +27,14 @@ void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
up_read(&cttype->mod_lock);
}

-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype)
{
- struct codetag_iterator iter = {
- .cttype = cttype,
- .cmod = NULL,
- .mod_id = 0,
- .ct = NULL,
- };
-
- return iter;
+ iter->cttype = cttype;
+ iter->cmod = NULL;
+ iter->mod_id = 0;
+ iter->ct = NULL;
+ iter->ctx = NULL;
}

static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
@@ -128,6 +126,10 @@ struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)

lockdep_assert_held(&iter->cttype->mod_lock);

+ /* Move to the first codetag if search just started */
+ if (!iter->ct)
+ codetag_next_ct(iter);
+
if (!ctx)
return next_ctx_from_ct(iter);

--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:31 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Include allocations in show_mem reports.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/alloc_tag.h | 2 ++
lib/alloc_tag.c | 48 +++++++++++++++++++++++++++++++++++----
lib/show_mem.c | 15 ++++++++++++
3 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 2a3d248aae10..190ab793f7e5 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -23,6 +23,8 @@ struct alloc_tag {

#ifdef CONFIG_MEM_ALLOC_PROFILING

+void alloc_tags_show_mem_report(struct seq_buf *s);
+
static inline struct alloc_tag *ctc_to_alloc_tag(struct codetag_with_ctx *ctc)
{
return container_of(ctc, struct alloc_tag, ctc);
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 675c7a08e38b..e2ebab8999a9 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -13,6 +13,8 @@

#define STACK_BUF_SIZE 1024

+static struct codetag_type *alloc_tag_cttype;
+
DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);

/*
@@ -133,6 +135,43 @@ static ssize_t allocations_file_read(struct file *file, char __user *ubuf,
return err ? : buf.ret;
}

+void alloc_tags_show_mem_report(struct seq_buf *s)
+{
+ struct codetag_iterator iter;
+ struct codetag *ct;
+ struct {
+ struct codetag *tag;
+ size_t bytes;
+ } tags[10], n;
+ unsigned int i, nr = 0;
+
+ codetag_init_iter(&iter, alloc_tag_cttype);
+
+ codetag_lock_module_list(alloc_tag_cttype, true);
+ while ((ct = codetag_next_ct(&iter))) {
+ n.tag = ct;
+ n.bytes = lazy_percpu_counter_read(&ct_to_alloc_tag(ct)->bytes_allocated);
+
+ for (i = 0; i < nr; i++)
+ if (n.bytes > tags[i].bytes)
+ break;
+
+ if (i < ARRAY_SIZE(tags)) {
+ nr -= nr == ARRAY_SIZE(tags);
+ memmove(&tags[i + 1],
+ &tags[i],
+ sizeof(tags[0]) * (nr - i));
+ nr++;
+ tags[i] = n;
+ }
+ }
+
+ for (i = 0; i < nr; i++)
+ alloc_tag_to_text(s, tags[i].tag);
+
+ codetag_lock_module_list(alloc_tag_cttype, false);
+}
+
static const struct file_operations allocations_file_ops = {
.owner = THIS_MODULE,
.open = allocations_file_open,
@@ -409,7 +448,6 @@ EXPORT_SYMBOL(page_alloc_tagging_ops);

static int __init alloc_tag_init(void)
{
- struct codetag_type *cttype;
const struct codetag_type_desc desc = {
.section = "alloc_tags",
.tag_size = sizeof(struct alloc_tag),
@@ -417,10 +455,10 @@ static int __init alloc_tag_init(void)
.free_ctx = alloc_tag_ops_free_ctx,
};

- cttype = codetag_register_type(&desc);
- if (IS_ERR_OR_NULL(cttype))
- return PTR_ERR(cttype);
+ alloc_tag_cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(alloc_tag_cttype))
+ return PTR_ERR(alloc_tag_cttype);

- return dbgfs_init(cttype);
+ return dbgfs_init(alloc_tag_cttype);
}
module_init(alloc_tag_init);
diff --git a/lib/show_mem.c b/lib/show_mem.c
index 1485c87be935..5c82f29168e3 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -7,6 +7,7 @@

#include <linux/mm.h>
#include <linux/cma.h>
+#include <linux/seq_buf.h>

void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
{
@@ -34,4 +35,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
#ifdef CONFIG_MEMORY_FAILURE
printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ {
+ struct seq_buf s;
+ char *buf = kmalloc(4096, GFP_ATOMIC);
+
+ if (buf) {
+ printk("Memory allocations:\n");
+ seq_buf_init(&s, buf, 4096);
+ alloc_tags_show_mem_report(&s);
+ printk("%s", buf);
+ kfree(buf);
+ }
+ }
+#endif
}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:33 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/alloc_tag.h | 28 ++++++++++++++++++++++++++++
mm/slab.h | 33 +++++++++++++++++++++++++++++++++
mm/slab_common.c | 1 +
3 files changed, 62 insertions(+)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 190ab793f7e5..2c3f4f3a8c93 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -51,6 +51,28 @@ static inline bool mem_alloc_profiling_enabled(void)
return static_branch_likely(&mem_alloc_profiling_key);
}

+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+#define CODETAG_EMPTY (void *)1
+
+static inline bool is_codetag_empty(union codetag_ref *ref)
+{
+ return ref->ct == CODETAG_EMPTY;
+}
+
+static inline void set_codetag_empty(union codetag_ref *ref)
+{
+ if (ref)
+ ref->ct = CODETAG_EMPTY;
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+static inline void set_codetag_empty(union codetag_ref *ref) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
bool may_allocate)
{
@@ -65,6 +87,11 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
if (!ref || !ref->ct)
return;

+ if (is_codetag_empty(ref)) {
+ ref->ct = NULL;
+ return;
+ }
+
if (is_codetag_ctx_ref(ref))
alloc_tag_free_ctx(ref->ctx, &tag);
else
@@ -112,6 +139,7 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
#else

#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void set_codetag_empty(union codetag_ref *ref) {}
static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
diff --git a/mm/slab.h b/mm/slab.h
index f9442d3a10b2..50d86008a86a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -416,6 +416,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_slab);

+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
+{
+ struct slabobj_ext *slab_exts;
+ struct slab *obj_exts_slab;
+
+ obj_exts_slab = virt_to_slab(obj_exts);
+ slab_exts = slab_obj_exts(obj_exts_slab);
+ if (slab_exts) {
+ unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
+ obj_exts_slab, obj_exts);
+ /* codetag should be NULL */
+ WARN_ON(slab_exts[offs].ref.ct);
+ set_codetag_empty(&slab_exts[offs].ref);
+ }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline bool need_slab_obj_ext(void)
{
#ifdef CONFIG_MEM_ALLOC_PROFILING
@@ -437,6 +462,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
if (!obj_exts)
return;

+ /*
+ * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+ * corresponding extension will be NULL. alloc_tag_sub() will throw a
+ * warning if slab has extensions but the extension of an object is
+ * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
+ * the extension for obj_exts is expected to be NULL.
+ */
+ mark_objexts_empty(obj_exts);
kfree(obj_exts);
slab->obj_exts = 0;
}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a05333bbb7f1..89265f825c43 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -244,6 +244,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
* assign slabobj_exts in parallel. In this case the existing
* objcg vector should be reused.
*/
+ mark_objexts_empty(vec);
kfree(vec);
return 0;
}
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:36 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.
Maybe we can annotate reserved pages correctly and avoid this?

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..f5969cb85879 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5,6 +5,7 @@
#include <linux/errno.h>
#include <linux/mmdebug.h>
#include <linux/gfp.h>
+#include <linux/pgalloc_tag.h>
#include <linux/bug.h>
#include <linux/list.h>
#include <linux/mmzone.h>
@@ -2920,6 +2921,13 @@ extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void free_reserved_page(struct page *page)
{
+ union codetag_ref *ref;
+
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ set_codetag_empty(ref);
+ put_page_tag_ref(ref);
+ }
ClearPageReserved(page);
init_page_count(page);
__free_page(page);
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:38 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 4 +++-
mm/slab_common.c | 27 +++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c7f21b15b540..3eb8975c1462 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -356,8 +356,10 @@ enum page_memcg_data_flags {
#endif /* CONFIG_MEMCG */

enum objext_flags {
+ /* slabobj_ext vector failed to allocate */
+ OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
/* the next bit after the last actual flag */
- __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1),
};

#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 89265f825c43..5b7e096b70a5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -217,21 +217,44 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
{
unsigned int objects = objs_per_slab(s, slab);
unsigned long obj_exts;
- void *vec;
+ struct slabobj_ext *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
/* Prevent recursive extension vector allocation */
gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
- if (!vec)
+ if (!vec) {
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ if (new_slab) {
+ /* Mark vectors which failed to allocate */
+ slab->obj_exts = OBJEXTS_ALLOC_FAIL;
+#ifdef CONFIG_MEMCG
+ slab->obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ }
+#endif
return -ENOMEM;
+ }

obj_exts = (unsigned long)vec;
#ifdef CONFIG_MEMCG
obj_exts |= MEMCG_DATA_OBJEXTS;
#endif
if (new_slab) {
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /*
+ * If vector previously failed to allocate then we have live
+ * objects with no tag reference. Mark all references in this
+ * vector as empty to avoid warnings later on.
+ */
+ if (slab->obj_exts & OBJEXTS_ALLOC_FAIL) {
+ unsigned int i;
+
+ for (i = 0; i < objects; i++)
+ set_codetag_empty(&vec[i].ref);
+ }
+#endif
/*
* If the slab is brand new and nobody can yet access its
* obj_exts, no synchronization is required and obj_exts can
--
2.40.1.495.gc816e09b53d-goog

Suren Baghdasaryan

unread,
May 1, 2023, 12:56:40 PM5/1/23
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

The new code & libraries added are being maintained - mark them as such.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
MAINTAINERS | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3889d1adf71f..6f3b79266204 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5116,6 +5116,13 @@ S: Supported
F: Documentation/process/code-of-conduct-interpretation.rst
F: Documentation/process/code-of-conduct.rst

+CODE TAGGING
+M: Suren Baghdasaryan <sur...@google.com>
+M: Kent Overstreet <kent.ov...@linux.dev>
+S: Maintained
+F: include/linux/codetag.h
+F: lib/codetag.c
+
COMEDI DRIVERS
M: Ian Abbott <abb...@mev.co.uk>
M: H Hartley Sweeten <hswe...@visionengravers.com>
@@ -11658,6 +11665,12 @@ S: Maintained
F: Documentation/devicetree/bindings/leds/backlight/kinetic,ktz8866.yaml
F: drivers/video/backlight/ktz8866.c

+LAZY PERCPU COUNTERS
+M: Kent Overstreet <kent.ov...@linux.dev>
+S: Maintained
+F: include/linux/lazy-percpu-counter.h
+F: lib/lazy-percpu-counter.c
+
L3MDEV
M: David Ahern <dsa...@kernel.org>
L: net...@vger.kernel.org
@@ -13468,6 +13481,15 @@ F: mm/memblock.c
F: mm/mm_init.c
F: tools/testing/memblock/

+MEMORY ALLOCATION PROFILING
+M: Suren Baghdasaryan <sur...@google.com>
+M: Kent Overstreet <kent.ov...@linux.dev>
+S: Maintained
+F: include/linux/alloc_tag.h
+F: include/linux/codetag_ctx.h
+F: lib/alloc_tag.c
+F: lib/pgalloc_tag.c
+
MEMORY CONTROLLER DRIVERS
M: Krzysztof Kozlowski <krzysztof...@linaro.org>
L: linux-...@vger.kernel.org
--
2.40.1.495.gc816e09b53d-goog

Roman Gushchin

unread,
May 1, 2023, 1:47:20 PM5/1/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below is performance
> comparison between the baseline kernel, profiling when enabled, profiling
> when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
>
> kmalloc pgalloc
> Baseline (6.3-rc7) 9.200s 31.050s
> profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
> profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
> memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)

Hm, this makes me think we have a regression with memcg_kmem in one of
the recent releases. When I measured it a couple of years ago, the overhead
was definitely within 100%.

Do you understand what makes the your profiling drastically faster than kmem?

Thanks!

Suren Baghdasaryan

unread,
May 1, 2023, 2:08:22 PM5/1/23
to Roman Gushchin, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
I haven't profiled or looked into kmem overhead closely but I can do
that. I just wanted to see how the overhead compares with the existing
accounting mechanisms.

For kmalloc, the overhead is low because after we create the vector of
slab_ext objects (which is the same as what memcg_kmem does), memory
profiling just increments a lazy counter (which in many cases would be
a per-cpu counter). memcg_kmem operates on cgroup hierarchy with
additional overhead associated with that. I'm guessing that's the
reason for the big difference between these mechanisms but, I didn't
look into the details to understand memcg_kmem performance.

>
> Thanks!

Roman Gushchin

unread,
May 1, 2023, 2:15:05 PM5/1/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, May 01, 2023 at 11:08:05AM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.g...@linux.dev> wrote:
> >
> > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> > > Performance overhead:
> > > To evaluate performance we implemented an in-kernel test executing
> > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > affinity set to a specific CPU to minimize the noise. Below is performance
> > > comparison between the baseline kernel, profiling when enabled, profiling
> > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> > >
> > > kmalloc pgalloc
> > > Baseline (6.3-rc7) 9.200s 31.050s
> > > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
> > > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
> > > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)
> >
> > Hm, this makes me think we have a regression with memcg_kmem in one of
> > the recent releases. When I measured it a couple of years ago, the overhead
> > was definitely within 100%.
> >
> > Do you understand what makes the your profiling drastically faster than kmem?
>
> I haven't profiled or looked into kmem overhead closely but I can do
> that. I just wanted to see how the overhead compares with the existing
> accounting mechanisms.

It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
should be ok for the production use, which is great!
In the reality, most workloads are not that sensitive to the speed of
memory allocation.

>
> For kmalloc, the overhead is low because after we create the vector of
> slab_ext objects (which is the same as what memcg_kmem does), memory
> profiling just increments a lazy counter (which in many cases would be
> a per-cpu counter).

So does kmem (this is why I'm somewhat surprised by the difference).

> memcg_kmem operates on cgroup hierarchy with
> additional overhead associated with that. I'm guessing that's the
> reason for the big difference between these mechanisms but, I didn't
> look into the details to understand memcg_kmem performance.

I suspect recent rt-related changes and also the wide usage of
rcu primitives in the kmem code. I'll try to look closer as well.

Thanks!

Davidlohr Bueso

unread,
May 1, 2023, 2:45:35 PM5/1/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, 01 May 2023, Suren Baghdasaryan wrote:

>From: Kent Overstreet <kent.ov...@linux.dev>
>
>Previously, string_get_size() outputted a space between the number and
>the units, i.e.
> 9.88 MiB
>
>This changes it to
> 9.88MiB
>
>which allows it to be parsed correctly by the 'sort -h' command.

Wouldn't this break users that already parse it the current way?

Thanks,
Davidlohr

Randy Dunlap

unread,
May 1, 2023, 3:18:37 PM5/1/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Hi--

On 5/1/23 09:54, Suren Baghdasaryan wrote:
> From: Kent Overstreet <kent.ov...@linux.dev>
>
> This patch adds lib/lazy-percpu-counter.c, which implements counters
> that start out as atomics, but lazily switch to percpu mode if the
> update rate crosses some threshold (arbitrarily set at 256 per second).
>

from submitting-patches.rst:

Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour.

> Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> ---
> include/linux/lazy-percpu-counter.h | 102 ++++++++++++++++++++++
> lib/Kconfig | 3 +
> lib/Makefile | 2 +
> lib/lazy-percpu-counter.c | 127 ++++++++++++++++++++++++++++
> 4 files changed, 234 insertions(+)
> create mode 100644 include/linux/lazy-percpu-counter.h
> create mode 100644 lib/lazy-percpu-counter.c
>
> diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
> new file mode 100644
> index 000000000000..45ca9e2ce58b
> --- /dev/null
> +++ b/include/linux/lazy-percpu-counter.h
> @@ -0,0 +1,102 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Lazy percpu counters:
> + * (C) 2022 Kent Overstreet
> + *
> + * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
> + * the update rate crosses some threshold.
> + *
> + * This means we don't have to decide between low memory overhead atomic
> + * counters and higher performance percpu counters - we can have our cake and
> + * eat it, too!
> + *
> + * Internally we use an atomic64_t, where the low bit indicates whether we're in
> + * percpu mode, and the high 8 bits are a secondary counter that's incremented
> + * when the counter is modified - meaning 55 bits of precision are available for
> + * the counter itself.
> + */
> +
> +#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
> +#define _LINUX_LAZY_PERCPU_COUNTER_H
> +
> +#include <linux/atomic.h>
> +#include <asm/percpu.h>
> +
> +struct lazy_percpu_counter {
> + atomic64_t v;
> + unsigned long last_wrap;
> +};
> +
> +void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
> +void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
> +void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i);
> +s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c);
> +
> +/*
> + * We use the high bits of the atomic counter for a secondary counter, which is
> + * incremented every time the counter is touched. When the secondary counter
> + * wraps, we check the time the counter last wrapped, and if it was recent
> + * enough that means the update frequency has crossed our threshold and we
> + * switch to percpu mode:
> + */
> +#define COUNTER_MOD_BITS 8
> +#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
> +#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
> +
> +/*
> + * We use the low bit of the counter to indicate whether we're in atomic mode
> + * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
> + * percpu counters:
> + */
> +#define COUNTER_IS_PCPU_BIT 1
> +
> +static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
> +{
> + if (!(v & COUNTER_IS_PCPU_BIT))
> + return NULL;
> +
> + v ^= COUNTER_IS_PCPU_BIT;
> + return (u64 __percpu *)(unsigned long)v;
> +}
> +
> +/**
> + * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter

For kernel-doc, the function name should be followed by '-', not ':'.
(many places)

> + *
> + * @c: counter to modify
> + * @i: value to add
> + */
> +static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (likely(pcpu_v))
> + this_cpu_add(*pcpu_v, i);
> + else
> + lazy_percpu_counter_add_slowpath(c, i);
> +}
> +
> +/**
> + * lazy_percpu_counter_add_noupgrade: Add a value to a lazy_percpu_counter,
> + * without upgrading to percpu mode
> + *
> + * @c: counter to modify
> + * @i: value to add
> + */
> +static inline void lazy_percpu_counter_add_noupgrade(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (likely(pcpu_v))
> + this_cpu_add(*pcpu_v, i);
> + else
> + lazy_percpu_counter_add_slowpath_noupgrade(c, i);
> +}
> +
> +static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
> +{
> + lazy_percpu_counter_add(c, -i);
> +}
> +
> +#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */

> diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
> new file mode 100644
> index 000000000000..4f4e32c2dc09
> --- /dev/null
> +++ b/lib/lazy-percpu-counter.c
> @@ -0,0 +1,127 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/atomic.h>
> +#include <linux/gfp.h>
> +#include <linux/jiffies.h>
> +#include <linux/lazy-percpu-counter.h>
> +#include <linux/percpu.h>
> +
> +static inline s64 lazy_percpu_counter_atomic_val(s64 v)
> +{
> + /* Ensure output is sign extended properly: */
> + return (v << COUNTER_MOD_BITS) >>
> + (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
> +}
> +
...
> +
> +/**
> + * lazy_percpu_counter_exit: Free resources associated with a
> + * lazy_percpu_counter

Same kernel-doc comment.

> + *
> + * @c: counter to exit
> + */
> +void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
> +{
> + free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
> +}
> +EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
> +
> +/**
> + * lazy_percpu_counter_read: Read current value of a lazy_percpu_counter
> + *
> + * @c: counter to read
> + */
> +s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
> +{
> + s64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (pcpu_v) {
> + int cpu;
> +
> + v = 0;
> + for_each_possible_cpu(cpu)
> + v += *per_cpu_ptr(pcpu_v, cpu);
> + } else {
> + v = lazy_percpu_counter_atomic_val(v);
> + }
> +
> + return v;
> +}
> +EXPORT_SYMBOL_GPL(lazy_percpu_counter_read);
> +
> +void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 atomic_i;
> + u64 old, v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v;
> +
> + atomic_i = i << COUNTER_IS_PCPU_BIT;
> + atomic_i &= ~COUNTER_MOD_MASK;
> + atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
> +
> + do {
> + pcpu_v = lazy_percpu_counter_is_pcpu(v);
> + if (pcpu_v) {
> + this_cpu_add(*pcpu_v, i);
> + return;
> + }
> +
> + old = v;
> + } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
> +
> + if (unlikely(!(v & COUNTER_MOD_MASK))) {
> + unsigned long now = jiffies;
> +
> + if (c->last_wrap &&
> + unlikely(time_after(c->last_wrap + HZ, now)))
> + lazy_percpu_counter_switch_to_pcpu(c);
> + else
> + c->last_wrap = now;
> + }
> +}
> +EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
> +
> +void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 atomic_i;
> + u64 old, v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v;
> +
> + atomic_i = i << COUNTER_IS_PCPU_BIT;
> + atomic_i &= ~COUNTER_MOD_MASK;
> +
> + do {
> + pcpu_v = lazy_percpu_counter_is_pcpu(v);
> + if (pcpu_v) {
> + this_cpu_add(*pcpu_v, i);
> + return;
> + }
> +
> + old = v;
> + } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
> +}
> +EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath_noupgrade);

These last 2 exported functions could use some comments, preferably in
kernel-doc format.

Thanks.
--
~Randy

Kent Overstreet

unread,
May 1, 2023, 3:35:58 PM5/1/23
to Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
It's not impossible - but it's not used in very many places and we
wouldn't be printing in human-readable units if it was meant to be
parsed - it's mainly used for debug output currently.

If someone raises a specific objection we'll do something different,
otherwise I think standardizing on what userspace tooling already parses
is a good idea.

Kent Overstreet

unread,
May 1, 2023, 3:38:10 PM5/1/23
to Roman Gushchin, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote:
> It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
> should be ok for the production use, which is great!
> In the reality, most workloads are not that sensitive to the speed of
> memory allocation.

:)

My main takeaway has been "the slub fast path is _really_ fast". No
disabling of preemption, no atomic instructions, just a non locked
double word cmpxchg - it's a slick piece of work.

> > For kmalloc, the overhead is low because after we create the vector of
> > slab_ext objects (which is the same as what memcg_kmem does), memory
> > profiling just increments a lazy counter (which in many cases would be
> > a per-cpu counter).
>
> So does kmem (this is why I'm somewhat surprised by the difference).
>
> > memcg_kmem operates on cgroup hierarchy with
> > additional overhead associated with that. I'm guessing that's the
> > reason for the big difference between these mechanisms but, I didn't
> > look into the details to understand memcg_kmem performance.
>
> I suspect recent rt-related changes and also the wide usage of
> rcu primitives in the kmem code. I'll try to look closer as well.

Happy to give you something to compare against :)

Andy Shevchenko

unread,
May 1, 2023, 3:57:45 PM5/1/23
to Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, May 1, 2023 at 10:36 PM Kent Overstreet
<kent.ov...@linux.dev> wrote:
>
> On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> >
> > > From: Kent Overstreet <kent.ov...@linux.dev>
> > >
> > > Previously, string_get_size() outputted a space between the number and
> > > the units, i.e.
> > > 9.88 MiB
> > >
> > > This changes it to
> > > 9.88MiB
> > >
> > > which allows it to be parsed correctly by the 'sort -h' command.

But why do we need that? What's the use case?

> > Wouldn't this break users that already parse it the current way?
>
> It's not impossible - but it's not used in very many places and we
> wouldn't be printing in human-readable units if it was meant to be
> parsed - it's mainly used for debug output currently.
>
> If someone raises a specific objection we'll do something different,
> otherwise I think standardizing on what userspace tooling already parses
> is a good idea.

Yes, I NAK this on the basis of
https://english.stackexchange.com/a/2911/153144


--
With Best Regards,
Andy Shevchenko

Kent Overstreet

unread,
May 1, 2023, 5:17:13 PM5/1/23
to Andy Shevchenko, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, May 01, 2023 at 10:57:07PM +0300, Andy Shevchenko wrote:
> On Mon, May 1, 2023 at 10:36 PM Kent Overstreet
> <kent.ov...@linux.dev> wrote:
> >
> > On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> > >
> > > > From: Kent Overstreet <kent.ov...@linux.dev>
> > > >
> > > > Previously, string_get_size() outputted a space between the number and
> > > > the units, i.e.
> > > > 9.88 MiB
> > > >
> > > > This changes it to
> > > > 9.88MiB
> > > >
> > > > which allows it to be parsed correctly by the 'sort -h' command.
>
> But why do we need that? What's the use case?

As was in the commit message: to produce output that sort -h knows how
to parse.

> > > Wouldn't this break users that already parse it the current way?
> >
> > It's not impossible - but it's not used in very many places and we
> > wouldn't be printing in human-readable units if it was meant to be
> > parsed - it's mainly used for debug output currently.
> >
> > If someone raises a specific objection we'll do something different,
> > otherwise I think standardizing on what userspace tooling already parses
> > is a good idea.
>
> Yes, I NAK this on the basis of
> https://english.stackexchange.com/a/2911/153144

Not sure I find a style guide on stackexchange more compelling than
interop with a tool everyone already has installed :)

Roman Gushchin

unread,
May 1, 2023, 5:18:38 PM5/1/23
to Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
To be fair, it's not an apple-to-apple comparison, because:
1) memcgs are organized in a tree, these days usually with at least 3 layers,
2) memcgs are dynamic. In theory a task can be moved to a different
memcg while performing a (very slow) allocation, and the original
memcg can be released. To prevent this we have to perform a lot
of operations which you can happily avoid.

That said, there is clearly a place for optimization, so thank you
for indirectly bringing this up.

Thanks!

Liam R. Howlett

unread,
May 1, 2023, 5:34:42 PM5/1/23
to Andy Shevchenko, Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
* Andy Shevchenko <andy.sh...@gmail.com> [230501 15:57]:
This fixes the output to be better aligned with:
the output of ls -sh
the input expected by find -size

Are there counter-examples of commands that follow the SI Brochure?

Thanks,
Liam

Kent Overstreet

unread,
May 1, 2023, 8:11:16 PM5/1/23
to Liam R. Howlett, Andy Shevchenko, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, May 01, 2023 at 05:33:49PM -0400, Liam R. Howlett wrote:
> * Andy Shevchenko <andy.sh...@gmail.com> [230501 15:57]:
> This fixes the output to be better aligned with:
> the output of ls -sh
> the input expected by find -size
>
> Are there counter-examples of commands that follow the SI Brochure?

Even perf, which is included in the kernel tree, doesn't include the
space - example perf top output:

0 bcachefs:move_extent_fail
0 bcachefs:move_extent_alloc_mem_fail
3 bcachefs:move_data
0 bcachefs:evacuate_bucket
0 bcachefs:copygc
2 bcachefs:copygc_wait
195K bcachefs:transaction_commit
0 bcachefs:trans_restart_injected

(I'm also going to need to submit a patch that deletes or makes optional
the B suffix, just because we're using human readable units doesn't mean
it's bytes).

Kent Overstreet

unread,
May 1, 2023, 8:53:49 PM5/1/23
to Andy Shevchenko, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, May 01, 2023 at 10:57:07PM +0300, Andy Shevchenko wrote:
> But why do we need that? What's the use case?

It looks like we missed you on the initial CC, here's the use case:
https://lore.kernel.org/linux-fsdevel/ZFAsm0XTqC%2F%2Ff4FP@P9FQF9L96D/T/#mdda814a8c569e2214baa31320912b0ef83432fa9

James Bottomley

unread,
May 1, 2023, 10:22:32 PM5/1/23
to Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, 2023-05-01 at 15:35 -0400, Kent Overstreet wrote:
> On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> >
> > > From: Kent Overstreet <kent.ov...@linux.dev>
> > >
> > > Previously, string_get_size() outputted a space between the
> > > number and the units, i.e.
> > >  9.88 MiB
> > >
> > > This changes it to
> > >  9.88MiB
> > >
> > > which allows it to be parsed correctly by the 'sort -h' command.
> >
> > Wouldn't this break users that already parse it the current way?
>
> It's not impossible - but it's not used in very many places and we
> wouldn't be printing in human-readable units if it was meant to be
> parsed - it's mainly used for debug output currently.

It is not used just for debug. It's used all over the kernel for
printing out device sizes. The output mostly goes to the kernel print
buffer, so it's anyone's guess as to what, if any, tools are parsing
it, but the concern about breaking log parsers seems to be a valid one.

> If someone raises a specific objection we'll do something different,
> otherwise I think standardizing on what userspace tooling already
> parses is a good idea.

If you want to omit the space, why not simply add your own variant? A
string_get_size_nospace() which would use most of the body of this one
as a helper function but give its own snprintf format string at the
end. It's only a couple of lines longer as a patch and has the bonus
that it definitely wouldn't break anything by altering an existing
output.

James

Kent Overstreet

unread,
May 1, 2023, 11:18:04 PM5/1/23
to James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> It is not used just for debug. It's used all over the kernel for
> printing out device sizes. The output mostly goes to the kernel print
> buffer, so it's anyone's guess as to what, if any, tools are parsing
> it, but the concern about breaking log parsers seems to be a valid one.

Ok, there is sd_print_capacity() - but who in their right mind would be
trying to scrape device sizes, in human readable units, from log
messages when it's available in sysfs/procfs (actually, is it in sysfs?
if not, that's an oversight) in more reasonable units?

Correct me if I'm wrong, but I've yet to hear about kernel log messages
being consider a stable interface, and this seems a bit out there.

But, you did write the code :)

> > If someone raises a specific objection we'll do something different,
> > otherwise I think standardizing on what userspace tooling already
> > parses is a good idea.
>
> If you want to omit the space, why not simply add your own variant? A
> string_get_size_nospace() which would use most of the body of this one
> as a helper function but give its own snprintf format string at the
> end. It's only a couple of lines longer as a patch and has the bonus
> that it definitely wouldn't break anything by altering an existing
> output.

I'm happy to do that - I just wanted to post this version first to see
if we can avoid the fragmentation and do a bit of standardizing with
how everything else seems to do that.

Andy Shevchenko

unread,
May 2, 2023, 1:34:35 AM5/2/23
to Kent Overstreet, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Tue, May 2, 2023 at 6:18 AM Kent Overstreet
<kent.ov...@linux.dev> wrote:
> On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:

...

> > > If someone raises a specific objection we'll do something different,
> > > otherwise I think standardizing on what userspace tooling already
> > > parses is a good idea.
> >
> > If you want to omit the space, why not simply add your own variant? A
> > string_get_size_nospace() which would use most of the body of this one
> > as a helper function but give its own snprintf format string at the
> > end. It's only a couple of lines longer as a patch and has the bonus
> > that it definitely wouldn't break anything by altering an existing
> > output.
>
> I'm happy to do that - I just wanted to post this version first to see
> if we can avoid the fragmentation and do a bit of standardizing with
> how everything else seems to do that.

Actually instead of producing zillions of variants, do a %p extension
to the printf() and that's it. We have, for example, %pt with T and
with space to follow users that want one or the other variant. Same
can be done with string_get_size().

Kent Overstreet

unread,
May 2, 2023, 2:22:09 AM5/2/23
to Andy Shevchenko, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> Actually instead of producing zillions of variants, do a %p extension
> to the printf() and that's it. We have, for example, %pt with T and
> with space to follow users that want one or the other variant. Same
> can be done with string_get_size().

God no.

Jani Nikula

unread,
May 2, 2023, 3:56:13 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Trønnes
On Mon, 01 May 2023, Suren Baghdasaryan <sur...@google.com> wrote:
> From: Kent Overstreet <kent.ov...@linux.dev>
>
> Previously, string_get_size() outputted a space between the number and
> the units, i.e.
> 9.88 MiB
>
> This changes it to
> 9.88MiB
>
> which allows it to be parsed correctly by the 'sort -h' command.

The former is easier for humans to parse, and that should be
preferred. 'sort -h' is supposed to compare "human readable numbers", so
arguably sort does not do its job here.

BR,
Jani.

>
> Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> Cc: Andy Shevchenko <an...@kernel.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: "Michael S. Tsirkin" <m...@redhat.com>
> Cc: Jason Wang <jaso...@redhat.com>
> Cc: "Noralf Trønnes" <nor...@tronnes.org>
> Cc: Jens Axboe <ax...@kernel.dk>
> ---
> lib/string_helpers.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/lib/string_helpers.c b/lib/string_helpers.c
> index 230020a2e076..593b29fece32 100644
> --- a/lib/string_helpers.c
> +++ b/lib/string_helpers.c
> @@ -126,8 +126,7 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
> else
> unit = units_str[units][i];
>
> - snprintf(buf, len, "%u%s %s", (u32)size,
> - tmp, unit);
> + snprintf(buf, len, "%u%s%s", (u32)size, tmp, unit);
> }
> EXPORT_SYMBOL(string_get_size);

--
Jani Nikula, Intel Open Source Graphics Center

Petr Tesařík

unread,
May 2, 2023, 8:35:35 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
On Mon, 1 May 2023 09:54:13 -0700
Suren Baghdasaryan <sur...@google.com> wrote:

> From: Kent Overstreet <kent.ov...@linux.dev>
>
> We're introducing alloc tagging, which tracks memory allocations by
> callsite. Converting alloc_inode_sb() to a macro means allocations will
> be tracked by its caller, which is a bit more useful.
>
> Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> Cc: Alexander Viro <vi...@zeniv.linux.org.uk>
> ---
> include/linux/fs.h | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21a981680856..4905ce14db0b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
> * This must be used for allocating filesystems specific inodes to set
> * up the inode reclaim context correctly.
> */
> -static inline void *
> -alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
> -{
> - return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
> -}
> +#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)

Honestly, I don't like this change. In general, pre-processor macros
are ugly and error-prone.

Besides, it works for you only because __kmem_cache_alloc_lru() is
declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
you probably don't want the tracking either). In any case, it's going
to be difficult for people to understand why and how this works.

If the actual caller of alloc_inode_sb() is needed, I'd rather add it
as a parameter and pass down _RET_IP_ explicitly here.

Just my two cents,
Petr T

Petr Tesařík

unread,
May 2, 2023, 8:37:41 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, 1 May 2023 09:54:16 -0700
Suren Baghdasaryan <sur...@google.com> wrote:

> From: Kent Overstreet <kent.ov...@linux.dev>
>
> This adds a new helper which is like strsep, except that it skips empty
> tokens.
>
> Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> ---
> include/linux/string.h | 1 +
> lib/string.c | 19 +++++++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/include/linux/string.h b/include/linux/string.h
> index c062c581a98b..6cd5451c262c 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -96,6 +96,7 @@ extern char * strpbrk(const char *,const char *);
> #ifndef __HAVE_ARCH_STRSEP
> extern char * strsep(char **,const char *);
> #endif
> +extern char *strsep_no_empty(char **, const char *);
> #ifndef __HAVE_ARCH_STRSPN
> extern __kernel_size_t strspn(const char *,const char *);
> #endif
> diff --git a/lib/string.c b/lib/string.c
> index 3d55ef890106..dd4914baf45a 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -520,6 +520,25 @@ char *strsep(char **s, const char *ct)
> EXPORT_SYMBOL(strsep);
> #endif
>
> +/**
> + * strsep_no_empt - Split a string into tokens, but don't return empty tokens
^^^^
Typo: strsep_no_empty

Petr T

Petr Tesařík

unread,
May 2, 2023, 8:50:21 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, 1 May 2023 09:54:19 -0700
Suren Baghdasaryan <sur...@google.com> wrote:

> Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
> when allocating slabobj_ext on a slab.
>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> ---
> include/linux/gfp_types.h | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 6583a58670c5..aab1959130f9 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
> #define ___GFP_SKIP_ZERO 0
> #define ___GFP_SKIP_KASAN 0
> #endif
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +#define ___GFP_NO_OBJ_EXT 0x4000000u
> +#else
> +#define ___GFP_NO_OBJ_EXT 0
> +#endif
> #ifdef CONFIG_LOCKDEP
> -#define ___GFP_NOLOCKDEP 0x4000000u
> +#define ___GFP_NOLOCKDEP 0x8000000u

So now we have two flags that depend on config options, but the first
one is always allocated in fact. I wonder if you could use an enum to
let the compiler allocate bits. Something similar to what Muchun Song
did with section flags.

See commit ed7802dd48f7a507213cbb95bb4c6f1fe134eb5d for reference.

> #else
> #define ___GFP_NOLOCKDEP 0
> #endif
> @@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
> * node with no fallbacks or placement policy enforcements.
> *
> * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
> + *
> + * %__GFP_NO_OBJ_EXT causes slab allocation to have no object
> extension. */
> #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
> #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
> #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
> #define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
> +#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)
>
> /**
> * DOC: Watermark modifiers
> @@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
> #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>
> /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
> +#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))

If the above suggestion is implemented, this could be changed to
something like __GFP_LAST_BIT (the enum's last identifier).

Petr T

Andy Shevchenko

unread,
May 2, 2023, 11:20:04 AM5/2/23
to Kent Overstreet, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
Any elaboration what's wrong with that?

God no for zillion APIs for almost the same. Today you want space,
tomorrow some other (special) delimiter.

Thomas Gleixner

unread,
May 2, 2023, 11:50:43 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, May 01 2023 at 09:54, Suren Baghdasaryan wrote:
> From: Kent Overstreet <kent.ov...@linux.dev>
>
> This avoids a circular header dependency in an upcoming patch by only
> making hrtimer.h depend on percpu-defs.h
>
> Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> Cc: Thomas Gleixner <tg...@linutronix.de>

Reviewed-by: Thomas Gleixner <tg...@linutronix.de>

Petr Tesařík

unread,
May 2, 2023, 11:50:57 AM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon, 1 May 2023 09:54:29 -0700
Suren Baghdasaryan <sur...@google.com> wrote:

> After redefining alloc_pages, all uses of that name are being replaced.
> Change the conflicting names to prevent preprocessor from replacing them
> when it's not intended.
>
> Signed-off-by: Suren Baghdasaryan <sur...@google.com>
> ---
> arch/x86/kernel/amd_gart_64.c | 2 +-
> drivers/iommu/dma-iommu.c | 2 +-
> drivers/xen/grant-dma-ops.c | 2 +-
> drivers/xen/swiotlb-xen.c | 2 +-
> include/linux/dma-map-ops.h | 2 +-
> kernel/dma/mapping.c | 4 ++--
> 6 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> index 56a917df410d..842a0ec5eaa9 100644
> --- a/arch/x86/kernel/amd_gart_64.c
> +++ b/arch/x86/kernel/amd_gart_64.c
> @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> .get_sgtable = dma_common_get_sgtable,
> .dma_supported = dma_direct_supported,
> .get_required_mask = dma_direct_get_required_mask,
> - .alloc_pages = dma_direct_alloc_pages,
> + .alloc_pages_op = dma_direct_alloc_pages,
> .free_pages = dma_direct_free_pages,
> };
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7a9f0b0bddbd..76a9d5ca4eee 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> .free = iommu_dma_free,
> - .alloc_pages = dma_common_alloc_pages,
> + .alloc_pages_op = dma_common_alloc_pages,
> .free_pages = dma_common_free_pages,
> .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> .free_noncontiguous = iommu_dma_free_noncontiguous,
> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> index 9784a77fa3c9..6c7d984f164d 100644
> --- a/drivers/xen/grant-dma-ops.c
> +++ b/drivers/xen/grant-dma-ops.c
> @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> static const struct dma_map_ops xen_grant_dma_ops = {
> .alloc = xen_grant_dma_alloc,
> .free = xen_grant_dma_free,
> - .alloc_pages = xen_grant_dma_alloc_pages,
> + .alloc_pages_op = xen_grant_dma_alloc_pages,
> .free_pages = xen_grant_dma_free_pages,
> .mmap = dma_common_mmap,
> .get_sgtable = dma_common_get_sgtable,
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 67aa74d20162..5ab2616153f0 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> .dma_supported = xen_swiotlb_dma_supported,
> .mmap = dma_common_mmap,
> .get_sgtable = dma_common_get_sgtable,
> - .alloc_pages = dma_common_alloc_pages,
> + .alloc_pages_op = dma_common_alloc_pages,
> .free_pages = dma_common_free_pages,
> };
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 31f114f486c4..d741940dcb3b 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -27,7 +27,7 @@ struct dma_map_ops {
> unsigned long attrs);
> void (*free)(struct device *dev, size_t size, void *vaddr,
> dma_addr_t dma_handle, unsigned long attrs);
> - struct page *(*alloc_pages)(struct device *dev, size_t size,
> + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> dma_addr_t *dma_handle, enum dma_data_direction dir,
> gfp_t gfp);
> void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 9a4db5cce600..fc42930af14b 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> size = PAGE_ALIGN(size);
> if (dma_alloc_direct(dev, ops))
> return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> - if (!ops->alloc_pages)
> + if (!ops->alloc_pages_op)
> return NULL;
> - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> }
>
> struct page *dma_alloc_pages(struct device *dev, size_t size,

I'm not impressed. This patch increases churn for code which does not
(directly) benefit from the change, and that for limitations in your
tooling?

Why not just rename the conflicting uses in your local tree, but then
remove the rename from the final patch series?

Suren Baghdasaryan

unread,
May 2, 2023, 2:33:27 PM5/2/23
to Petr Tesařík, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Thanks for the reference. I'll take a closer look and will try to clean it up.
Ack.

Thanks for reviewing!
Suren.

>
> Petr T

Suren Baghdasaryan

unread,
May 2, 2023, 2:39:01 PM5/2/23
to Petr Tesařík, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
With alloc_pages function becoming a macro, the preprocessor ends up
replacing all instances of that name, even when it's not used as a
function. That what necessitates this change. If there is a way to
work around this issue without changing all alloc_pages() calls in the
source base I would love to learn it but I'm not quite clear about
your suggestion and if it solves the issue. Could you please provide
more details?

>
> Just my two cents,
> Petr T
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team...@android.com.
>

Kent Overstreet

unread,
May 2, 2023, 3:58:07 PM5/2/23
to Petr Tesařík, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
It's a one line macro, it's fine.

> Besides, it works for you only because __kmem_cache_alloc_lru() is
> declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
> you probably don't want the tracking either). In any case, it's going
> to be difficult for people to understand why and how this works.

I think you must be confused. kmem_cache_alloc_lru() is a macro, and we
need that macro to be expanded at the alloc_inode_sb() callsite. It's
got nothing to do with whether or not __kmem_cache_alloc_lru() is inline
or not.

> If the actual caller of alloc_inode_sb() is needed, I'd rather add it
> as a parameter and pass down _RET_IP_ explicitly here.

That approach was considered, but adding an ip parameter to every memory
allocation function would've been far more churn.

Petr Tesařík

unread,
May 2, 2023, 4:09:15 PM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Tue, 2 May 2023 11:38:49 -0700
Ah, right, I admit I did not quite understand why this change is
needed. However, this is exactly what I don't like about preprocessor
macros. Each macro effectively adds a new keyword to the language.

I believe everything can be solved with inline functions. What exactly
does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
and then add an alloc_pages() inline function which calls
alloc_pages_caller() with _RET_IP_ as a parameter?

Petr T

Kent Overstreet

unread,
May 2, 2023, 4:18:17 PM5/2/23
to Petr Tesařík, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Tue, May 02, 2023 at 10:09:09PM +0200, Petr Tesařík wrote:
> Ah, right, I admit I did not quite understand why this change is
> needed. However, this is exactly what I don't like about preprocessor
> macros. Each macro effectively adds a new keyword to the language.
>
> I believe everything can be solved with inline functions. What exactly
> does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
> and then add an alloc_pages() inline function which calls
> alloc_pages_caller() with _RET_IP_ as a parameter?

Perhaps you should spend a little more time reading the patchset and
learning how the code works before commenting.

Petr Tesařík

unread,
May 2, 2023, 4:20:42 PM5/2/23
to Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
It's not the same. A macro effectively adds a keyword, because it gets
expanded regardless of context; for example, you can't declare a local
variable called alloc_inode_sb, and the compiler errors may be quite
confusing at first. See also the discussion about patch 19/40 in this
series.

> > Besides, it works for you only because __kmem_cache_alloc_lru() is
> > declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
> > you probably don't want the tracking either). In any case, it's going
> > to be difficult for people to understand why and how this works.
>
> I think you must be confused. kmem_cache_alloc_lru() is a macro, and we
> need that macro to be expanded at the alloc_inode_sb() callsite. It's
> got nothing to do with whether or not __kmem_cache_alloc_lru() is inline
> or not.

Oh no, I am not confused. Look at the definition of
kmem_cache_alloc_lru():

void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, lru, gfpflags);
}

See? No _RET_IP_ here. That's because it's here:

static __fastpath_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);

trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);

return ret;
}

Now, if __kmem_cache_alloc_lru() is not inlined, then this _RET_IP_
will be somewhere inside kmem_cache_alloc_lru(), which is not very
useful.

But what is __fastpath_inline? Well, it depends:

#ifndef CONFIG_SLUB_TINY
#define __fastpath_inline __always_inline
#else
#define __fastpath_inline
#endif

In short, if CONFIG_SLUB_TINY is defined, it's up to the C compiler
whether __kmem_cache_alloc_lru() is inlined or not.

> > If the actual caller of alloc_inode_sb() is needed, I'd rather add it
> > as a parameter and pass down _RET_IP_ explicitly here.
>
> That approach was considered, but adding an ip parameter to every memory
> allocation function would've been far more churn.

See my reply to patch 19/40. Rename the original function, but add an
__always_inline function with the original signature, and let it take
care of _RET_IP_.

Petr T

Suren Baghdasaryan

unread,
May 2, 2023, 4:24:50 PM5/2/23
to Petr Tesařík, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
I don't think that would work because we need to inject the codetag at
the file/line of the actual allocation call. If we pass _REP_IT_ then
we would have to lookup the codetag associated with that _RET_IP_
which results in additional runtime overhead.

Petr Tesařík

unread,
May 2, 2023, 4:39:20 PM5/2/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Tue, 2 May 2023 13:24:37 -0700
OK. If the reference to source code itself must be recorded in the
kernel, and not resolved later (either by the debugfs read fops, or by
a tool which reads the file), then this information can only be
obtained with a preprocessor macro.

I was hoping that a debugging feature could be less intrusive. OTOH
it's not my call to balance the tradeoffs.

Thank you for your patient explanations.

Petr T

Suren Baghdasaryan

unread,
May 2, 2023, 4:42:02 PM5/2/23
to Petr Tesařík, ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Thanks for reviewing and the suggestions! I'll address the actionable
ones in the next version.
Suren.

Dave Chinner

unread,
May 2, 2023, 6:50:21 PM5/2/23
to James Bottomley, Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Tue, May 02, 2023 at 07:42:59AM -0400, James Bottomley wrote:
> On Mon, 2023-05-01 at 23:17 -0400, Kent Overstreet wrote:
> > On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> > > It is not used just for debug.  It's used all over the kernel for
> > > printing out device sizes.  The output mostly goes to the kernel
> > > print buffer, so it's anyone's guess as to what, if any, tools are
> > > parsing it, but the concern about breaking log parsers seems to be
> > > a valid one.
> >
> > Ok, there is sd_print_capacity() - but who in their right mind would
> > be trying to scrape device sizes, in human readable units,
>
> If you bother to google "kernel log parser", you'll discover it's quite
> an active area which supports a load of company business models.

That doesn't mean log messages are unchangable ABI. Indeed, we had
the whole "printk_index_emit()" addition recently to create
an external index of printk message formats for such applications to
use. [*]

> > from log messages when it's available in sysfs/procfs (actually, is
> > it in sysfs? if not, that's an oversight) in more reasonable units?
>
> It's not in sysfs, no. As aren't a lot of things, which is why log
> parsing for system monitoring is big business.

And that big business is why printk_index_emit() exists to allow
them to easily determine how log messages change format and come and
go across different kernel versions.

> > Correct me if I'm wrong, but I've yet to hear about kernel log
> > messages being consider a stable interface, and this seems a bit out
> > there.
>
> It might not be listed as stable, but when it's known there's a large
> ecosystem out there consuming it we shouldn't break it just because you
> feel like it.

But we've solved this problem already, yes?

If the userspace applications are not using the kernel printk format
index to detect such changes between kernel version, then they
should be. This makes trivial issues like whether we have a space or
not between units is completely irrelevant because the entry in the
printk format index for the log output we emit will match whatever
is output by the kernel....

Cheers,

Dave.

[*]
commit 337015573718b161891a3473d25f59273f2e626b
Author: Chris Down <ch...@chrisdown.name>
Date: Tue Jun 15 17:52:53 2021 +0100

printk: Userspace format indexing support

We have a number of systems industry-wide that have a subset of their
functionality that works as follows:

1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.

As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.

While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.

Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.

As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.

One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.

This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.

Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.

This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:

$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"

This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.

There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.

Signed-off-by: Chris Down <ch...@chrisdown.name>
Cc: Petr Mladek <pml...@suse.com>
Cc: Jessica Yu <je...@kernel.org>
Cc: Sergey Senozhatsky <sergey.se...@gmail.com>
Cc: John Ogness <john....@linutronix.de>
Cc: Steven Rostedt <ros...@goodmis.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Johannes Weiner <han...@cmpxchg.org>
Cc: Kees Cook <kees...@chromium.org>
Reviewed-by: Petr Mladek <pml...@suse.com>
Tested-by: Petr Mladek <pml...@suse.com>
Reported-by: kernel test robot <l...@intel.com>
Acked-by: Andy Shevchenko <andy.sh...@gmail.com>
Acked-by: Jessica Yu <je...@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pml...@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d...@chrisdown.name

--
Dave Chinner
da...@fromorbit.com

Kent Overstreet

unread,
May 2, 2023, 10:07:37 PM5/2/23
to Andy Shevchenko, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> <kent.ov...@linux.dev> wrote:
> > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > Actually instead of producing zillions of variants, do a %p extension
> > > to the printf() and that's it. We have, for example, %pt with T and
> > > with space to follow users that want one or the other variant. Same
> > > can be done with string_get_size().
> >
> > God no.
>
> Any elaboration what's wrong with that?

I'm really not a fan of %p extensions in general (they are what people
reach for because we can't standardize on a common string output API),
but when we'd be passing it bare integers the lack of type safety would
be a particularly big footgun.

> God no for zillion APIs for almost the same. Today you want space,
> tomorrow some other (special) delimiter.

No, I just want to delete the space and output numbers the same way
everyone else does. And if we are stuck with two string_get_size()
functions, %p extensions in no way improve the situation.

Andy Shevchenko

unread,
May 3, 2023, 2:30:50 AM5/3/23
to Kent Overstreet, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Wed, May 3, 2023 at 5:07 AM Kent Overstreet
<kent.ov...@linux.dev> wrote:
> On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> > On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> > <kent.ov...@linux.dev> wrote:
> > > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > > Actually instead of producing zillions of variants, do a %p extension
> > > > to the printf() and that's it. We have, for example, %pt with T and
> > > > with space to follow users that want one or the other variant. Same
> > > > can be done with string_get_size().
> > >
> > > God no.
> >
> > Any elaboration what's wrong with that?
>
> I'm really not a fan of %p extensions in general (they are what people
> reach for because we can't standardize on a common string output API),

The whole story behind, for example, %pt is to _standardize_ the
output of the same stanza in the kernel.

> but when we'd be passing it bare integers the lack of type safety would
> be a particularly big footgun.

There is no difference to any other place in the kernel where we can
shoot into our foot.

> > God no for zillion APIs for almost the same. Today you want space,
> > tomorrow some other (special) delimiter.
>
> No, I just want to delete the space and output numbers the same way
> everyone else does. And if we are stuck with two string_get_size()
> functions, %p extensions in no way improve the situation.

I think it's exactly for the opposite, i.e. standardize that output
once and for all.

Kent Overstreet

unread,
May 3, 2023, 3:13:15 AM5/3/23
to Andy Shevchenko, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Wed, May 03, 2023 at 09:30:11AM +0300, Andy Shevchenko wrote:
> On Wed, May 3, 2023 at 5:07 AM Kent Overstreet
> <kent.ov...@linux.dev> wrote:
> > On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> > > On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> > > <kent.ov...@linux.dev> wrote:
> > > > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > > > Actually instead of producing zillions of variants, do a %p extension
> > > > > to the printf() and that's it. We have, for example, %pt with T and
> > > > > with space to follow users that want one or the other variant. Same
> > > > > can be done with string_get_size().
> > > >
> > > > God no.
> > >
> > > Any elaboration what's wrong with that?
> >
> > I'm really not a fan of %p extensions in general (they are what people
> > reach for because we can't standardize on a common string output API),
>
> The whole story behind, for example, %pt is to _standardize_ the
> output of the same stanza in the kernel.

Wtf does this have to do with the rest of the discussion? The %p thing
seems like a total non sequitar and a distraction.

I'm not getting involved with that. All I'm interested in is fixing the
memory allocation profiling output to make it more usable.

> > but when we'd be passing it bare integers the lack of type safety would
> > be a particularly big footgun.
>
> There is no difference to any other place in the kernel where we can
> shoot into our foot.

Yeah, no, absolutely not. Passing different size integers to
string_get_size() is fine; passing pointers to different size integers
to a %p extension will explode and the compiler won't be able to warn.

>
> > > God no for zillion APIs for almost the same. Today you want space,
> > > tomorrow some other (special) delimiter.
> >
> > No, I just want to delete the space and output numbers the same way
> > everyone else does. And if we are stuck with two string_get_size()
> > functions, %p extensions in no way improve the situation.
>
> I think it's exactly for the opposite, i.e. standardize that output
> once and for all.

So, are you dropping your NACK then, so we can standardize the kernel on
the way everything else does it?

Michal Hocko

unread,
May 3, 2023, 3:25:32 AM5/3/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> Memory allocation profiling infrastructure provides a low overhead
> mechanism to make all kernel allocations in the system visible. It can be
> used to monitor memory usage, track memory hotspots, detect memory leaks,
> identify memory regressions.
>
> To keep the overhead to the minimum, we record only allocation sizes for
> every allocation in the codebase. With that information, if users are
> interested in more detailed context for a specific allocation, they can
> enable in-depth context tracking, which includes capturing the pid, tgid,
> task name, allocation size, timestamp and call stack for every allocation
> at the specified code location.
[...]
> Implementation utilizes a more generic concept of code tagging, introduced
> as part of this patchset. Code tag is a structure identifying a specific
> location in the source code which is generated at compile time and can be
> embedded in an application-specific structure. A number of applications
> for code tagging have been presented in the original RFC [1].
> Code tagging uses the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.
>
> To profile memory allocations, we instrument page, slab and percpu
> allocators to record total memory allocated in the associated code tag at
> every allocation in the codebase. Every time an allocation is performed by
> an instrumented allocator, the code tag at that location increments its
> counter by allocation size. Every time the memory is freed the counter is
> decremented. To decrement the counter upon freeing, allocated object needs
> a reference to its code tag. Page allocators use page_ext to record this
> reference while slab allocators use memcg_data (renamed into more generic
> slabobj_ext) of the slab page.
[...]
> [1] https://lore.kernel.org/all/20220830214919...@google.com/
[...]
> 70 files changed, 2765 insertions(+), 554 deletions(-)

Sorry for cutting the cover considerably but I believe I have quoted the
most important/interesting parts here. The approach is not fundamentally
different from the previous version [1] and there was a significant
discussion around this approach. The cover letter doesn't summarize nor
deal with concerns expressed previous AFAICS. So let me bring those up
back. At least those I find the most important:
- This is a big change and it adds a significant maintenance burden
because each allocation entry point needs to be handled specifically.
The cost will grow with the intended coverage especially there when
allocation is hidden in a library code.
- It has been brought up that this is duplicating functionality already
available via existing tracing infrastructure. You should make it very
clear why that is not suitable for the job
- We already have page_owner infrastructure that provides allocation
tracking data. Why it cannot be used/extended?

Thanks!
--
Michal Hocko
SUSE Labs

Kent Overstreet

unread,
May 3, 2023, 3:34:35 AM5/3/23
to Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
We covered this previously, I'll just be giving the same answers I did
before:

> - This is a big change and it adds a significant maintenance burden
> because each allocation entry point needs to be handled specifically.
> The cost will grow with the intended coverage especially there when
> allocation is hidden in a library code.

We've made this as clean and simple as posssible: a single new macro
invocation per allocation function, no calling convention changes (that
would indeed have been a lot of churn!)

> - It has been brought up that this is duplicating functionality already
> available via existing tracing infrastructure. You should make it very
> clear why that is not suitable for the job

Tracing people _claimed_ this, but never demonstrated it. Tracepoints
exist but the tooling that would consume them to provide this kind of
information does not exist; it would require maintaining an index of
_every outstanding allocation_ so that frees could be accounted
correctly - IOW, it would be _drastically_ higher overhead, so not at
all comparable.

> - We already have page_owner infrastructure that provides allocation
> tracking data. Why it cannot be used/extended?

Page owner is also very high overhead, and the output is not very user
friendly (tracking full call stack means many related overhead gets
split, not generally what you want), and it doesn't cover slab.

This tracks _all_ memory allocations - slab, page, vmalloc, percpu.

Michal Hocko

unread,
May 3, 2023, 3:36:01 AM5/3/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
[...]
> +static inline void add_ctx(struct codetag_ctx *ctx,
> + struct codetag_with_ctx *ctc)
> +{
> + kref_init(&ctx->refcount);
> + spin_lock(&ctc->ctx_lock);
> + ctx->flags = CTC_FLAG_CTX_PTR;
> + ctx->ctc = ctc;
> + list_add_tail(&ctx->node, &ctc->ctx_head);
> + spin_unlock(&ctc->ctx_lock);

AFAIU every single tracked allocation will get its own codetag_ctx.
There is no aggregation per allocation site or anything else. This looks
like a scalability and a memory overhead red flag to me.

> +}
> +
> +static inline void rem_ctx(struct codetag_ctx *ctx,
> + void (*free_ctx)(struct kref *refcount))
> +{
> + struct codetag_with_ctx *ctc = ctx->ctc;
> +
> + spin_lock(&ctc->ctx_lock);

This could deadlock when allocator is called from the IRQ context.

> + /* ctx might have been removed while we were using it */
> + if (!list_empty(&ctx->node))
> + list_del_init(&ctx->node);
> + spin_unlock(&ctc->ctx_lock);
> + kref_put(&ctx->refcount, free_ctx);

Michal Hocko

unread,
May 3, 2023, 3:39:09 AM5/3/23
to Suren Baghdasaryan, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Mon 01-05-23 09:54:45, Suren Baghdasaryan wrote:
[...]
> +struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
> +{
> + struct alloc_call_ctx *ac_ctx;
> +
> + /* TODO: use a dedicated kmem_cache */
> + ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);

You cannot really use GFP_KERNEL here. This is post_alloc_hook path and
that has its own gfp context.

Michal Hocko

unread,
May 3, 2023, 3:51:52 AM5/3/23
to Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Your answers have shown your insight into tracing is very limited. I
have a clear recollection there were many suggestions on how to get what
you need and willingness to help out. Repeating your previous position
will not help much to be honest with you.

> > - This is a big change and it adds a significant maintenance burden
> > because each allocation entry point needs to be handled specifically.
> > The cost will grow with the intended coverage especially there when
> > allocation is hidden in a library code.
>
> We've made this as clean and simple as posssible: a single new macro
> invocation per allocation function, no calling convention changes (that
> would indeed have been a lot of churn!)

That doesn't really make the concern any less relevant. I believe you
and Suren have made a great effort to reduce the churn as much as
possible but looking at the diffstat the code changes are clearly there
and you have to convince the rest of the community that this maintenance
overhead is really worth it. The above statement hasn't helped to
convinced me to be honest.

> > - It has been brought up that this is duplicating functionality already
> > available via existing tracing infrastructure. You should make it very
> > clear why that is not suitable for the job
>
> Tracing people _claimed_ this, but never demonstrated it.

The burden is on you and Suren. You are proposing the implement an
alternative tracing infrastructure.

> Tracepoints
> exist but the tooling that would consume them to provide this kind of
> information does not exist;

Any reasons why an appropriate tooling cannot be developed?

> it would require maintaining an index of
> _every outstanding allocation_ so that frees could be accounted
> correctly - IOW, it would be _drastically_ higher overhead, so not at
> all comparable.

Do you have any actual data points to prove your claim?

> > - We already have page_owner infrastructure that provides allocation
> > tracking data. Why it cannot be used/extended?
>
> Page owner is also very high overhead,

Is there any data to prove that claim? I would be really surprised that
page_owner would give higher overhead than page tagging with profiling
enabled (there is an allocation for each allocation request!!!). We can
discuss the bare bone page tagging comparision to page_owner because of
the full stack unwinding but is that overhead really prohibitively costly?
Can we reduce that by trimming the unwinder information?

> and the output is not very user
> friendly (tracking full call stack means many related overhead gets
> split, not generally what you want), and it doesn't cover slab.

Is this something we cannot do anything about? Have you explored any
potential ways?

> This tracks _all_ memory allocations - slab, page, vmalloc, percpu.

Kent Overstreet

unread,
May 3, 2023, 4:05:21 AM5/3/23
to Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, May 03, 2023 at 09:51:49AM +0200, Michal Hocko wrote:
> Your answers have shown your insight into tracing is very limited. I
> have a clear recollection there were many suggestions on how to get what
> you need and willingness to help out. Repeating your previous position
> will not help much to be honest with you.

Please enlighten us, oh wise one.

> > > - It has been brought up that this is duplicating functionality already
> > > available via existing tracing infrastructure. You should make it very
> > > clear why that is not suitable for the job
> >
> > Tracing people _claimed_ this, but never demonstrated it.
>
> The burden is on you and Suren. You are proposing the implement an
> alternative tracing infrastructure.

No, we're still waiting on the tracing people to _demonstrate_, not
claim, that this is at all possible in a comparable way with tracing.

It's not on us to make your argument for you, and before making
accusations about honesty you should try to be more honest yourself.

The expectations you're trying to level have never been the norm in the
kernel community, sorry. When there's a technical argument about the
best way to do something, _code wins_ and we've got working code to do
something that hasn't been possible previously.

There's absolutely no rule that "tracing has to be the one and only tool
for kernel visibility".

I'm considering the tracing discussion closed until someone in the
pro-tracing camp shows something new.

> > > - We already have page_owner infrastructure that provides allocation
> > > tracking data. Why it cannot be used/extended?
> >
> > Page owner is also very high overhead,
>
> Is there any data to prove that claim? I would be really surprised that
> page_owner would give higher overhead than page tagging with profiling
> enabled (there is an allocation for each allocation request!!!). We can
> discuss the bare bone page tagging comparision to page_owner because of
> the full stack unwinding but is that overhead really prohibitively costly?
> Can we reduce that by trimming the unwinder information?

Honestly, this isn't terribly relevant, because as noted before page
owner is limited to just page allocations.

>
> > and the output is not very user
> > friendly (tracking full call stack means many related overhead gets
> > split, not generally what you want), and it doesn't cover slab.
>
> Is this something we cannot do anything about? Have you explored any
> potential ways?
>
> > This tracks _all_ memory allocations - slab, page, vmalloc, percpu.

Michel, the discussions with you seem to perpetually go in circles; it's
clear you're negative on the patchset, you keep raising the same
objections while refusing to concede a single point.

I believe I've answered enough, so I'll leave off further discussions
with you.

Andy Shevchenko

unread,
May 3, 2023, 5:12:52 AM5/3/23
to Kent Overstreet, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Wed, May 3, 2023 at 10:13 AM Kent Overstreet
This is another topic. Yes, there is a discussion to have a compiler
plugin to check this.

> > > > God no for zillion APIs for almost the same. Today you want space,
> > > > tomorrow some other (special) delimiter.
> > >
> > > No, I just want to delete the space and output numbers the same way
> > > everyone else does. And if we are stuck with two string_get_size()
> > > functions, %p extensions in no way improve the situation.
> >
> > I think it's exactly for the opposite, i.e. standardize that output
> > once and for all.
>
> So, are you dropping your NACK then, so we can standardize the kernel on
> the way everything else does it?

No, you are breaking existing users. The NAK stays.
The whole discussion after that is to make the way on how users can
utilize your format and existing format without multiplying APIs.

Kent Overstreet

unread,
May 3, 2023, 5:16:23 AM5/3/23
to Andy Shevchenko, James Bottomley, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
On Wed, May 03, 2023 at 12:12:12PM +0300, Andy Shevchenko wrote:
> > So, are you dropping your NACK then, so we can standardize the kernel on
> > the way everything else does it?
>
> No, you are breaking existing users. The NAK stays.
> The whole discussion after that is to make the way on how users can
> utilize your format and existing format without multiplying APIs.

Dave seems to think we shouldn't be, and I'm in agreement.

Vlastimil Babka

unread,
May 3, 2023, 5:28:39 AM5/3/23
to Dave Chinner, James Bottomley, Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes, Andy Shevchenko
If I understand that correctly from the commit changelog, this would have
indeed helped, but if the change was reflected in format string. But with
string_get_size() it's always an %s and the change of the helper's or a
switch to another variant of the helper that would omit the space, wouldn't
be reflected in the format string at all? I guess that would be an argument
for Andy's suggestion for adding a new %pt / %pT which would then be
reflected in the format string. And also more concise to use than using the
helper, fwiw.

Andy Shevchenko

unread,
May 3, 2023, 5:45:02 AM5/3/23
to Vlastimil Babka, Dave Chinner, James Bottomley, Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, mho...@suse.com, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Tr�nnes
(Note, there is no respective %p extension for string_get_size() yet.
%pt is for time and was used as an example when its evolution included
a change like this)

> reflected in the format string. And also more concise to use than using the
> helper, fwiw.



Petr Tesařík

unread,
May 3, 2023, 5:50:59 AM5/3/23
to Michal Hocko, Kent Overstreet, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, 3 May 2023 09:51:49 +0200
Michal Hocko <mho...@suse.com> wrote:

> On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
>[...]
> > We've made this as clean and simple as posssible: a single new macro
> > invocation per allocation function, no calling convention changes (that
> > would indeed have been a lot of churn!)
>
> That doesn't really make the concern any less relevant. I believe you
> and Suren have made a great effort to reduce the churn as much as
> possible but looking at the diffstat the code changes are clearly there
> and you have to convince the rest of the community that this maintenance
> overhead is really worth it.

I believe this is the crucial point.

I have my own concerns about the use of preprocessor macros, which goes
against the basic idea of a code tagging framework (patch 13/40).
AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
line as the tagged code, which makes it hard to use without further
macros (unless you want to make the source code unreadable beyond
imagination). That's why all allocation functions must be converted to
macros.

If anyone ever wants to use this code tagging framework for something
else, they will also have to convert relevant functions to macros,
slowly changing the kernel to a minefield where local identifiers,
struct, union and enum tags, field names and labels must avoid name
conflict with a tagged function. For now, I have to remember that
alloc_pages is forbidden, but the list may grow.

FWIW I can see some occurences of "alloc_pages" under arch/ which are
not renamed by patch 19/40 of this series. For instance, does the
kernel build for s390x after applying the patch series?

New code may also work initially, but explode after adding an #include
later...

HOWEVER, if the rest of the community agrees that the added value of
code tagging is worth all these potential risks, I can live with it.

Petr T

Kent Overstreet

unread,
May 3, 2023, 5:54:58 AM5/3/23
to Petr Tesařík, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
No, we've got other code tagging applications (that have already been
posted!) and they don't "convert functions to macros" in the way this
patchset does - they do introduce new macros, but as new identifiers,
which we do all the time.

This was simply the least churny way to hook memory allocations.

Kent Overstreet

unread,
May 3, 2023, 5:57:28 AM5/3/23
to Petr Tesařík, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> If anyone ever wants to use this code tagging framework for something
> else, they will also have to convert relevant functions to macros,
> slowly changing the kernel to a minefield where local identifiers,
> struct, union and enum tags, field names and labels must avoid name
> conflict with a tagged function. For now, I have to remember that
> alloc_pages is forbidden, but the list may grow.

Also, since you're not actually a kernel contributor yet...

It's not really good decorum to speculate in code review about things
that can be answered by just reading the code. If you're going to
comment, please do the necessary work to make sure you're saying
something that makes sense.

Petr Tesařík

unread,
May 3, 2023, 6:24:49 AM5/3/23
to Kent Overstreet, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Yes, new all-lowercase macros which do not expand to a single
identifier are still added under include/linux. It's unfortunate IMO,
but it's a fact of life. You have a point here.

> This was simply the least churny way to hook memory allocations.

This is a bold statement. You certainly know what you plan to do, but
other people keep coming up with ideas... Like, anyone would like to
tag semaphore use: up() and down()?

Don't get me wrong. I can see how the benefits of code tagging, and I
agree that my concerns are not very strong. I just want that the
consequences are understood and accepted, and they don't take us by
surprise.

Petr T

Petr Tesařík

unread,
May 3, 2023, 6:26:31 AM5/3/23
to Kent Overstreet, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, 3 May 2023 05:57:15 -0400
Kent Overstreet <kent.ov...@linux.dev> wrote:

> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > If anyone ever wants to use this code tagging framework for something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.
>
> Also, since you're not actually a kernel contributor yet...

I see, I've been around only since 2007...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a

Petr T

Suren Baghdasaryan

unread,
May 3, 2023, 10:32:10 AM5/3/23
to James Bottomley, Kent Overstreet, Petr Tesařík, Michal Hocko, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, May 3, 2023 at 5:34 AM James Bottomley
<James.B...@hansenpartnership.com> wrote:
>
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> You have an amazing talent for being wrong. But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.

Kent, I asked you before and I'm asking you again. Please focus on the
technical discussion and stop personal attacks. That is extremely
counter-productive.

>
> James

Suren Baghdasaryan

unread,
May 3, 2023, 11:09:41 AM5/3/23
to Michal Hocko, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Thanks for summarizing!

> At least those I find the most important:
> - This is a big change and it adds a significant maintenance burden
> because each allocation entry point needs to be handled specifically.
> The cost will grow with the intended coverage especially there when
> allocation is hidden in a library code.

Do you mean with more allocations in the codebase more codetags will
be generated? Is that the concern? Or maybe as you commented in
another patch that context capturing feature does not limit how many
stacks will be captured?

> - It has been brought up that this is duplicating functionality already
> available via existing tracing infrastructure. You should make it very
> clear why that is not suitable for the job

I experimented with using tracing with _RET_IP_ to implement this
accounting. The major issue is the _RET_IP_ to codetag lookup runtime
overhead which is orders of magnitude higher than proposed code
tagging approach. With code tagging proposal, that link is resolved at
compile time. Since we want this mechanism deployed in production, we
want to keep the overhead to the absolute minimum.
You asked me before how much overhead would be tolerable and the
answer will always be "as small as possible". This is especially true
for slab allocators which are ridiculously fast and regressing them
would be very noticable (due to the frequent use).

There is another issue, which I think can be solved in a smart way but
will either affect performance or would require more memory. With the
tracing approach we don't know beforehand how many individual
allocation sites exist, so we have to allocate code tags (or similar
structures for counting) at runtime vs compile time. We can be smart
about it and allocate in batches or even preallocate more than we need
beforehand but, as I said, it will require some kind of compromise.

I understand that code tagging creates additional maintenance burdens
but I hope it also produces enough benefits that people will want
this. The cost is also hopefully amortized when additional
applications like the ones we presented in RFC [1] are built using the
same framework.

> - We already have page_owner infrastructure that provides allocation
> tracking data. Why it cannot be used/extended?

1. The overhead.
2. Covers only page allocators.

I didn't think about extending the page_owner approach to slab
allocators but I suspect it would not be trivial. I don't see
attaching an owner to every slab object to be a scalable solution. The
overhead would again be of concern here.

I should point out that there was one important technical concern
about lack of a kill switch for this feature, which was an issue for
distributions that can't disable the CONFIG flag. In this series we
addressed that concern.

[1] https://lore.kernel.org/all/20220830214919...@google.com/

Thanks,
Suren.

Suren Baghdasaryan

unread,
May 3, 2023, 11:18:53 AM5/3/23
to Michal Hocko, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, May 3, 2023 at 12:36 AM Michal Hocko <mho...@suse.com> wrote:
>
> On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
> [...]
> > +static inline void add_ctx(struct codetag_ctx *ctx,
> > + struct codetag_with_ctx *ctc)
> > +{
> > + kref_init(&ctx->refcount);
> > + spin_lock(&ctc->ctx_lock);
> > + ctx->flags = CTC_FLAG_CTX_PTR;
> > + ctx->ctc = ctc;
> > + list_add_tail(&ctx->node, &ctc->ctx_head);
> > + spin_unlock(&ctc->ctx_lock);
>
> AFAIU every single tracked allocation will get its own codetag_ctx.
> There is no aggregation per allocation site or anything else. This looks
> like a scalability and a memory overhead red flag to me.

True. The allocations here would not be limited. We could introduce a
global limit to the amount of memory that we can use to store contexts
and maybe reuse the oldest entry (in LRU fashion) when we hit that
limit?

>
> > +}
> > +
> > +static inline void rem_ctx(struct codetag_ctx *ctx,
> > + void (*free_ctx)(struct kref *refcount))
> > +{
> > + struct codetag_with_ctx *ctc = ctx->ctc;
> > +
> > + spin_lock(&ctc->ctx_lock);
>
> This could deadlock when allocator is called from the IRQ context.

I see. spin_lock_irqsave() then?

Thanks for the feedback!
Suren.

Suren Baghdasaryan

unread,
May 3, 2023, 11:24:33 AM5/3/23
to Michal Hocko, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
I missed that. Would it be appropriate to use the gfp_flags parameter
of post_alloc_hook() here?

Dave Hansen

unread,
May 3, 2023, 11:26:48 AM5/3/23
to Suren Baghdasaryan, Michal Hocko, ak...@linux-foundation.org, kent.ov...@linux.dev, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On 5/3/23 08:18, Suren Baghdasaryan wrote:
>>> +static inline void rem_ctx(struct codetag_ctx *ctx,
>>> + void (*free_ctx)(struct kref *refcount))
>>> +{
>>> + struct codetag_with_ctx *ctc = ctx->ctc;
>>> +
>>> + spin_lock(&ctc->ctx_lock);
>> This could deadlock when allocator is called from the IRQ context.
> I see. spin_lock_irqsave() then?

Yes. But, even better, please turn on lockdep when you are testing. It
will find these for you. If you're on x86, we have a set of handy-dandy
debug options that you can add to an existing config with:

make x86_debug.config

That said, I'm as concerned as everyone else that this is all "new" code
and doesn't lean on existing tracing or things like PAGE_OWNER enough.

Kent Overstreet

unread,
May 3, 2023, 11:28:22 AM5/3/23
to James Bottomley, Petr Tesařík, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> You have an amazing talent for being wrong. But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.

...Err, what? That was intended _in no way_ as a personal attack.

If I was mistaken I do apologize, but lately I've run across quite a lot
of people offering review feedback to patches I post that turn out to
have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
offering feedback in strong language with a presumption of experience
that takes a lot to respond to adequately on a technical basis.

I don't think a suggestion to spend a bit more time reading code instead
of speculating is out of order! We could all, put more effort into how
we offer review feedback.

Kent Overstreet

unread,
May 3, 2023, 11:30:54 AM5/3/23
to Petr Tesařík, Michal Hocko, Suren Baghdasaryan, ak...@linux-foundation.org, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, ldu...@linux.ibm.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
My sincere apologies :) I'd searched for your name and email and found
nothing, whoops.
It is loading more messages.
0 new messages