[PATCH v3 00/35] Memory allocation profiling

2 views
Skip to first unread message

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:35 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Memory allocation, v3 and final:

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for debug
kernels, overhead low enough to be deployed in production.

We're aiming to get this in the next merge window, for 6.9. The feedback
we've gotten has been that even out of tree this patchset has already
been useful, and there's a significant amount of other work gated on the
code tagging functionality included in this patchset [2].

Example output:
root@moria-kvm:~# sort -h /proc/allocinfo|tail
3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page

Since v2:
- tglx noticed a circular header dependency between sched.h and percpu.h;
a bunch of header cleanups were merged into 6.8 to ameliorate this [3].

- a number of improvements, moving alloc_hooks() annotations to the
correct place for better tracking (mempool), and bugfixes.

- looked at alternate hooking methods.
There were suggestions on alternate methods (compiler attribute,
trampolines), but they wouldn't have made the patchset any cleaner
(we still need to have different function versions for accounting vs. no
accounting to control at which point in a call chain the accounting
happens), and they would have added a dependency on toolchain
support.

Usage:
kconfig options:
- CONFIG_MEM_ALLOC_PROFILING
- CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
- CONFIG_MEM_ALLOC_PROFILING_DEBUG
adds warnings for allocations that weren't accounted because of a
missing annotation

sysctl:
/proc/sys/vm/mem_profiling

Runtime info:
/proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

kmalloc pgalloc
(1 baseline) 6.764s 16.902s
(2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
(3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
(4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
(5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)

Memory overhead:
Kernel size:

text data bss dec diff
(1) 26515311 18890222 17018880 62424413
(2) 26524728 19423818 16740352 62688898 264485
(3) 26524724 19423818 16740352 62688894 264481
(4) 26524728 19423818 16740352 62688898 264485
(5) 26541782 18964374 16957440 62463596 39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags: 192 kB
PageExts: 262144 kB (256MB)
SlabExts: 9876 kB (9.6MB)
PcpuExts: 512 kB (0.5MB)

Total overhead is 0.2% of total memory.

[2]: Improved fault injection is the big one; the alloc_hooks() macro
this patchset introduces is also used for per-callsite fault injection
points in the dynamic fault injection patchset, which means we can
easily do fault injection on a per module or per file basis; this makes
it much easier to integrate memory fault injection into existing tests.

Vlastimil recently raised concerns about exposing GFP_NOWAIT as a
PF_MEMALLOC_* flag, as this might introduce GFP_NOWAIT to allocation
paths that have never had their failure paths tested - this is something
we need to address.

[3]: The circular dependency looks to be unavoidable; the issue is that
alloc_tag_save() -> current -> get_current() requires percpu.h, and
percpu.h requires sched.h because of course it does. But this doesn't
actually cause build errors because we're only using macros, so the main
concern is just not leaving a difficult-to-disentangle minefield for
later.
So, sched.h is now pretty close to being a types only header that
imports types and declares types - this is the header cleanups that were
merged for 6.8.


Kent Overstreet (11):
lib/string_helpers: Add flags param to string_get_size()
scripts/kallysms: Always include __start and __stop symbols
fs: Convert alloc_inode_sb() to a macro
mm/slub: Mark slab_free_freelist_hook() __always_inline
mempool: Hook up to memory allocation profiling
xfs: Memory allocation profiling fixups
mm: percpu: Introduce pcpuobj_ext
mm: percpu: Add codetag reference into pcpuobj_ext
mm: vmalloc: Enable memory allocation profiling
rhashtable: Plumb through alloc tag
MAINTAINERS: Add entries for code tagging and memory allocation
profiling

Suren Baghdasaryan (24):
mm: enumerate all gfp flags
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
lib: code tagging framework
lib: code tagging module support
lib: prevent module unloading if memory is not freed
lib: add allocation tagging support for memory allocation profiling
lib: introduce support for page allocation tagging
mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation
tags
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging
mm: create new codetag references during page splitting
mm/page_ext: enable early_page_ext when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
lib: add codetag reference into slabobj_ext
mm/slab: add allocation accounting into slab allocation and free paths
mm/slab: enable slab allocation tagging for kmalloc and friends
mm: percpu: enable per-cpu allocation tagging
lib: add memory allocations report in show_mem()
codetag: debug: skip objext checking when it's for objext itself
codetag: debug: mark codetags for reserved pages as empty
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
allocations

Documentation/admin-guide/sysctl/vm.rst | 16 ++
Documentation/filesystems/proc.rst | 28 ++
MAINTAINERS | 16 ++
arch/alpha/kernel/pci_iommu.c | 2 +-
arch/mips/jazz/jazzdma.c | 2 +-
arch/powerpc/kernel/dma-iommu.c | 2 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
arch/powerpc/platforms/ps3/system-bus.c | 4 +-
arch/powerpc/platforms/pseries/vio.c | 2 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/block/virtio_blk.c | 4 +-
drivers/gpu/drm/gud/gud_drv.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/mmc/core/block.c | 4 +-
drivers/mtd/spi-nor/debugfs.c | 6 +-
.../ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 4 +-
drivers/parisc/ccio-dma.c | 2 +-
drivers/parisc/sba_iommu.c | 2 +-
drivers/scsi/sd.c | 8 +-
drivers/staging/media/atomisp/pci/hmm/hmm.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
fs/xfs/kmem.c | 4 +-
fs/xfs/kmem.h | 10 +-
include/asm-generic/codetag.lds.h | 14 +
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 188 +++++++++++++
include/linux/codetag.h | 83 ++++++
include/linux/dma-map-ops.h | 2 +-
include/linux/fortify-string.h | 5 +-
include/linux/fs.h | 6 +-
include/linux/gfp.h | 126 +++++----
include/linux/gfp_types.h | 101 +++++--
include/linux/memcontrol.h | 56 +++-
include/linux/mempool.h | 73 +++--
include/linux/mm.h | 8 +
include/linux/mm_types.h | 4 +-
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 +-
include/linux/percpu.h | 27 +-
include/linux/pgalloc_tag.h | 105 +++++++
include/linux/rhashtable-types.h | 11 +-
include/linux/sched.h | 24 ++
include/linux/slab.h | 184 +++++++------
include/linux/string.h | 4 +-
include/linux/string_helpers.h | 11 +-
include/linux/vmalloc.h | 60 +++-
init/Kconfig | 4 +
kernel/dma/mapping.c | 4 +-
kernel/kallsyms_selftest.c | 2 +-
kernel/module/main.c | 25 +-
lib/Kconfig.debug | 31 +++
lib/Makefile | 3 +
lib/alloc_tag.c | 213 +++++++++++++++
lib/codetag.c | 258 ++++++++++++++++++
lib/rhashtable.c | 52 +++-
lib/string_helpers.c | 22 +-
lib/test-string_helpers.c | 4 +-
mm/compaction.c | 7 +-
mm/filemap.c | 6 +-
mm/huge_memory.c | 2 +
mm/hugetlb.c | 8 +-
mm/kfence/core.c | 14 +-
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +---
mm/mempolicy.c | 52 ++--
mm/mempool.c | 36 +--
mm/mm_init.c | 10 +
mm/page_alloc.c | 66 +++--
mm/page_ext.c | 13 +
mm/page_owner.c | 2 +-
mm/percpu-internal.h | 26 +-
mm/percpu.c | 120 ++++----
mm/show_mem.c | 15 +
mm/slab.h | 176 ++++++++++--
mm/slab_common.c | 65 ++++-
mm/slub.c | 138 ++++++----
mm/util.c | 44 +--
mm/vmalloc.c | 88 +++---
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
81 files changed, 2126 insertions(+), 695 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c

--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:36 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Andy Shevchenko, Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Michael S. Tsirkin, Jason Wang, Noralf Trønnes
From: Kent Overstreet <kent.ov...@linux.dev>

The new flags parameter allows controlling
- Whether or not the units suffix is separated by a space, for
compatibility with sort -h
- Whether or not to append a B suffix - we're not always printing
bytes.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Andy Shevchenko <an...@kernel.org>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jaso...@redhat.com>
Cc: "Noralf Trønnes" <nor...@tronnes.org>
Cc: Jens Axboe <ax...@kernel.dk>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
drivers/block/virtio_blk.c | 4 ++--
drivers/gpu/drm/gud/gud_drv.c | 2 +-
drivers/mmc/core/block.c | 4 ++--
drivers/mtd/spi-nor/debugfs.c | 6 ++---
.../ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 4 ++--
drivers/scsi/sd.c | 8 +++----
include/linux/string_helpers.h | 11 +++++-----
lib/string_helpers.c | 22 ++++++++++++++-----
lib/test-string_helpers.c | 4 ++--
mm/hugetlb.c | 8 +++----
11 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c6a4ac766b2b..27aa5a083ff0 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -260,7 +260,7 @@ print_mapping(unsigned long start, unsigned long end, unsigned long size, bool e
if (end <= start)
return;

- string_get_size(size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));

pr_info("Mapped 0x%016lx-0x%016lx with %s pages%s\n", start, end, buf,
exec ? " (exec)" : "");
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2bf14a0e2815..94fba7f57079 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -934,9 +934,9 @@ static void virtblk_update_capacity(struct virtio_blk *vblk, bool resize)
nblocks = DIV_ROUND_UP_ULL(capacity, queue_logical_block_size(q) >> 9);

string_get_size(nblocks, queue_logical_block_size(q),
- STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
+ STRING_SIZE_BASE2, cap_str_2, sizeof(cap_str_2));
string_get_size(nblocks, queue_logical_block_size(q),
- STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
+ 0, cap_str_10, sizeof(cap_str_10));

dev_notice(&vdev->dev,
"[%s] %s%llu %d-byte logical blocks (%s/%s)\n",
diff --git a/drivers/gpu/drm/gud/gud_drv.c b/drivers/gpu/drm/gud/gud_drv.c
index 9d7bf8ee45f1..6b1748e1f666 100644
--- a/drivers/gpu/drm/gud/gud_drv.c
+++ b/drivers/gpu/drm/gud/gud_drv.c
@@ -329,7 +329,7 @@ static int gud_stats_debugfs(struct seq_file *m, void *data)
struct gud_device *gdrm = to_gud_device(entry->dev);
char buf[10];

- string_get_size(gdrm->bulk_len, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(gdrm->bulk_len, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(m, "Max buffer size: %s\n", buf);
seq_printf(m, "Number of errors: %u\n", gdrm->stats_num_errors);

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 32d49100dff5..1cded1e9aca4 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2557,7 +2557,7 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,

blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled);

- string_get_size((u64)size, 512, STRING_UNITS_2,
+ string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));
pr_info("%s: %s %s %s%s\n",
md->disk->disk_name, mmc_card_id(card), mmc_card_name(card),
@@ -2753,7 +2753,7 @@ static int mmc_blk_alloc_rpmb_part(struct mmc_card *card,

list_add(&rpmb->node, &md->rpmbs);

- string_get_size((u64)size, 512, STRING_UNITS_2,
+ string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));

pr_info("%s: %s %s %s, chardev (%d:%d)\n",
diff --git a/drivers/mtd/spi-nor/debugfs.c b/drivers/mtd/spi-nor/debugfs.c
index 2dbda6b6938a..f6c3ca430df1 100644
--- a/drivers/mtd/spi-nor/debugfs.c
+++ b/drivers/mtd/spi-nor/debugfs.c
@@ -85,7 +85,7 @@ static int spi_nor_params_show(struct seq_file *s, void *data)

seq_printf(s, "name\t\t%s\n", info->name);
seq_printf(s, "id\t\t%*ph\n", SPI_NOR_MAX_ID_LEN, nor->id);
- string_get_size(params->size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(params->size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(s, "size\t\t%s\n", buf);
seq_printf(s, "write size\t%u\n", params->writesize);
seq_printf(s, "page size\t%u\n", params->page_size);
@@ -130,14 +130,14 @@ static int spi_nor_params_show(struct seq_file *s, void *data)
struct spi_nor_erase_type *et = &erase_map->erase_type[i];

if (et->size) {
- string_get_size(et->size, 1, STRING_UNITS_2, buf,
+ string_get_size(et->size, 1, STRING_SIZE_BASE2, buf,
sizeof(buf));
seq_printf(s, " %02x (%s) [%d]\n", et->opcode, buf, i);
}
}

if (!(nor->flags & SNOR_F_NO_OP_CHIP_ERASE)) {
- string_get_size(params->size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(params->size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(s, " %02x (%s)\n", nor->params->die_erase_opcode, buf);
}

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 14e0d989c3ba..7d5fbebd36fc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -3457,8 +3457,8 @@ static void mem_region_show(struct seq_file *seq, const char *name,
{
char buf[40];

- string_get_size((u64)to - from + 1, 1, STRING_UNITS_2, buf,
- sizeof(buf));
+ string_get_size((u64)to - from + 1, 1, STRING_SIZE_BASE2,
+ buf, sizeof(buf));
seq_printf(seq, "%-15s %#x-%#x [%s]\n", name, from, to, buf);
}

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0833b3e6aa6e..e23bcb1d1ffa 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2731,10 +2731,10 @@ sd_print_capacity(struct scsi_disk *sdkp,
if (!sdkp->first_scan && old_capacity == sdkp->capacity)
return;

- string_get_size(sdkp->capacity, sector_size,
- STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
- string_get_size(sdkp->capacity, sector_size,
- STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
+ string_get_size(sdkp->capacity, sector_size, STRING_SIZE_BASE2,
+ cap_str_2, sizeof(cap_str_2));
+ string_get_size(sdkp->capacity, sector_size, 0,
+ cap_str_10, sizeof(cap_str_10));

sd_printk(KERN_NOTICE, sdkp,
"%llu %d-byte logical blocks: (%s/%s)\n",
diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index 58fb1f90eda5..a54467d891db 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
return memchr(s, '\0', len) ? true : false;
}

-/* Descriptions of the types of units to
- * print in */
-enum string_size_units {
- STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
- STRING_UNITS_2, /* use binary powers of 2^10 */
+enum string_size_flags {
+ STRING_SIZE_BASE2 = (1 << 0),
+ STRING_SIZE_NOSPACE = (1 << 1),
+ STRING_SIZE_NOBYTES = (1 << 2),
};

-int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+int string_get_size(u64 size, u64 blk_size, enum string_size_flags flags,
char *buf, int len);

int parse_int_array_user(const char __user *from, size_t count, int **array);
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 7713f73e66b0..a5d7d1caed70 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -19,11 +19,17 @@
#include <linux/string.h>
#include <linux/string_helpers.h>

+enum string_size_units {
+ STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
+ STRING_UNITS_2, /* use binary powers of 2^10 */
+};
+
/**
* string_get_size - get the size in the specified units
* @size: The size to be converted in blocks
* @blk_size: Size of the block (use 1 for size in bytes)
- * @units: units to use (powers of 1000 or 1024)
+ * @flags: units to use (powers of 1000 or 1024), whether to include space
+ * separator
* @buf: buffer to format to
* @len: length of buffer
*
@@ -34,14 +40,16 @@
* Return value: number of characters of output that would have been written
* (which may be greater than len, if output was truncated).
*/
-int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+int string_get_size(u64 size, u64 blk_size, enum string_size_flags flags,
char *buf, int len)
{
+ enum string_size_units units = flags & flags & STRING_SIZE_BASE2
+ ? STRING_UNITS_2 : STRING_UNITS_10;
static const char *const units_10[] = {
- "B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
+ "", "k", "M", "G", "T", "P", "E", "Z", "Y"
};
static const char *const units_2[] = {
- "B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"
+ "", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi", "Yi"
};
static const char *const *const units_str[] = {
[STRING_UNITS_10] = units_10,
@@ -128,8 +136,10 @@ int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];

- return snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ return snprintf(buf, len, "%u%s%s%s%s", (u32)size, tmp,
+ (flags & STRING_SIZE_NOSPACE) ? "" : " ",
+ unit,
+ (flags & STRING_SIZE_NOBYTES) ? "" : "B");
}
EXPORT_SYMBOL(string_get_size);

diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c
index 9a68849a5d55..0b01ffca96fb 100644
--- a/lib/test-string_helpers.c
+++ b/lib/test-string_helpers.c
@@ -507,8 +507,8 @@ static __init void __test_string_get_size(const u64 size, const u64 blk_size,
char buf10[string_get_size_maxbuf];
char buf2[string_get_size_maxbuf];

- string_get_size(size, blk_size, STRING_UNITS_10, buf10, sizeof(buf10));
- string_get_size(size, blk_size, STRING_UNITS_2, buf2, sizeof(buf2));
+ string_get_size(size, blk_size, 0, buf10, sizeof(buf10));
+ string_get_size(size, blk_size, STRING_SIZE_BASE2, buf2, sizeof(buf2));

test_string_get_size_check("STRING_UNITS_10", exp_result10, buf10,
size, blk_size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..26a8028e4bb7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3475,7 +3475,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
if (i == h->max_huge_pages_node[nid])
return;

- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: allocating %u of page size %s failed node%d. Only allocated %lu hugepages.\n",
h->max_huge_pages_node[nid], buf, nid, i);
h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
@@ -3561,7 +3561,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (i < h->max_huge_pages) {
char buf[32];

- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
h->max_huge_pages, buf, i);
h->max_huge_pages = i;
@@ -3607,7 +3607,7 @@ static void __init report_hugepages(void)
for_each_hstate(h) {
char buf[32];

- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
buf, h->free_huge_pages);
pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
@@ -4527,7 +4527,7 @@ static int __init hugetlb_init(void)
char buf[32];

string_get_size(huge_page_size(&default_hstate),
- 1, STRING_UNITS_2, buf, 32);
+ 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: Ignoring hugepages=%lu associated with %s page size\n",
default_hstate.max_huge_pages, buf);
pr_warn("HugeTLB: Using hugepages=%lu for number of default huge pages\n",
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:38 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
From: Kent Overstreet <kent.ov...@linux.dev>

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 653b92f6d4c8..47978efe4797 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -204,6 +204,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}

+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -211,6 +216,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:41 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Alexander Viro
From: Kent Overstreet <kent.ov...@linux.dev>

We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.

Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Cc: Alexander Viro <vi...@zeniv.linux.org.uk>
---
include/linux/fs.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..7794b4182bac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3013,11 +3013,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
* This must be used for allocating filesystems specific inodes to set
* up the inode reclaim context correctly.
*/
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
- return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)

extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:42 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org, Petr Tesařík
Introduce GFP bits enumeration to let compiler track the number of used
bits (which depends on the config options) instead of hardcoding them.
That simplifies __GFP_BITS_SHIFT calculation.

Suggested-by: Petr Tesařík <pe...@tesarici.cz>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/gfp_types.h | 90 +++++++++++++++++++++++++++------------
1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 1b6053da8754..868c8fb1bbc1 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -21,44 +21,78 @@ typedef unsigned int __bitwise gfp_t;
* include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
*/

+enum {
+ ___GFP_DMA_BIT,
+ ___GFP_HIGHMEM_BIT,
+ ___GFP_DMA32_BIT,
+ ___GFP_MOVABLE_BIT,
+ ___GFP_RECLAIMABLE_BIT,
+ ___GFP_HIGH_BIT,
+ ___GFP_IO_BIT,
+ ___GFP_FS_BIT,
+ ___GFP_ZERO_BIT,
+ ___GFP_UNUSED_BIT, /* 0x200u unused */
+ ___GFP_DIRECT_RECLAIM_BIT,
+ ___GFP_KSWAPD_RECLAIM_BIT,
+ ___GFP_WRITE_BIT,
+ ___GFP_NOWARN_BIT,
+ ___GFP_RETRY_MAYFAIL_BIT,
+ ___GFP_NOFAIL_BIT,
+ ___GFP_NORETRY_BIT,
+ ___GFP_MEMALLOC_BIT,
+ ___GFP_COMP_BIT,
+ ___GFP_NOMEMALLOC_BIT,
+ ___GFP_HARDWALL_BIT,
+ ___GFP_THISNODE_BIT,
+ ___GFP_ACCOUNT_BIT,
+ ___GFP_ZEROTAGS_BIT,
+#ifdef CONFIG_KASAN_HW_TAGS
+ ___GFP_SKIP_ZERO_BIT,
+ ___GFP_SKIP_KASAN_BIT,
+#endif
+#ifdef CONFIG_LOCKDEP
+ ___GFP_NOLOCKDEP_BIT,
+#endif
+ ___GFP_LAST_BIT
+};
+
/* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32 0x04u
-#define ___GFP_MOVABLE 0x08u
-#define ___GFP_RECLAIMABLE 0x10u
-#define ___GFP_HIGH 0x20u
-#define ___GFP_IO 0x40u
-#define ___GFP_FS 0x80u
-#define ___GFP_ZERO 0x100u
+#define ___GFP_DMA BIT(___GFP_DMA_BIT)
+#define ___GFP_HIGHMEM BIT(___GFP_HIGHMEM_BIT)
+#define ___GFP_DMA32 BIT(___GFP_DMA32_BIT)
+#define ___GFP_MOVABLE BIT(___GFP_MOVABLE_BIT)
+#define ___GFP_RECLAIMABLE BIT(___GFP_RECLAIMABLE_BIT)
+#define ___GFP_HIGH BIT(___GFP_HIGH_BIT)
+#define ___GFP_IO BIT(___GFP_IO_BIT)
+#define ___GFP_FS BIT(___GFP_FS_BIT)
+#define ___GFP_ZERO BIT(___GFP_ZERO_BIT)
/* 0x200u unused */
-#define ___GFP_DIRECT_RECLAIM 0x400u
-#define ___GFP_KSWAPD_RECLAIM 0x800u
-#define ___GFP_WRITE 0x1000u
-#define ___GFP_NOWARN 0x2000u
-#define ___GFP_RETRY_MAYFAIL 0x4000u
-#define ___GFP_NOFAIL 0x8000u
-#define ___GFP_NORETRY 0x10000u
-#define ___GFP_MEMALLOC 0x20000u
-#define ___GFP_COMP 0x40000u
-#define ___GFP_NOMEMALLOC 0x80000u
-#define ___GFP_HARDWALL 0x100000u
-#define ___GFP_THISNODE 0x200000u
-#define ___GFP_ACCOUNT 0x400000u
-#define ___GFP_ZEROTAGS 0x800000u
+#define ___GFP_DIRECT_RECLAIM BIT(___GFP_DIRECT_RECLAIM_BIT)
+#define ___GFP_KSWAPD_RECLAIM BIT(___GFP_KSWAPD_RECLAIM_BIT)
+#define ___GFP_WRITE BIT(___GFP_WRITE_BIT)
+#define ___GFP_NOWARN BIT(___GFP_NOWARN_BIT)
+#define ___GFP_RETRY_MAYFAIL BIT(___GFP_RETRY_MAYFAIL_BIT)
+#define ___GFP_NOFAIL BIT(___GFP_NOFAIL_BIT)
+#define ___GFP_NORETRY BIT(___GFP_NORETRY_BIT)
+#define ___GFP_MEMALLOC BIT(___GFP_MEMALLOC_BIT)
+#define ___GFP_COMP BIT(___GFP_COMP_BIT)
+#define ___GFP_NOMEMALLOC BIT(___GFP_NOMEMALLOC_BIT)
+#define ___GFP_HARDWALL BIT(___GFP_HARDWALL_BIT)
+#define ___GFP_THISNODE BIT(___GFP_THISNODE_BIT)
+#define ___GFP_ACCOUNT BIT(___GFP_ACCOUNT_BIT)
+#define ___GFP_ZEROTAGS BIT(___GFP_ZEROTAGS_BIT)
#ifdef CONFIG_KASAN_HW_TAGS
-#define ___GFP_SKIP_ZERO 0x1000000u
-#define ___GFP_SKIP_KASAN 0x2000000u
+#define ___GFP_SKIP_ZERO BIT(___GFP_SKIP_ZERO_BIT)
+#define ___GFP_SKIP_KASAN BIT(___GFP_SKIP_KASAN_BIT)
#else
#define ___GFP_SKIP_ZERO 0
#define ___GFP_SKIP_KASAN 0
#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x4000000u
+#define ___GFP_NOLOCKDEP BIT(___GFP_NOLOCKDEP_BIT)
#else
#define ___GFP_NOLOCKDEP 0
#endif
-/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -249,7 +283,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT ___GFP_LAST_BIT
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:44 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 20 ++++++---
include/linux/mm_types.h | 4 +-
init/Kconfig | 4 ++
mm/kfence/core.c | 14 +++---
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +++--------------------
mm/page_owner.c | 2 +-
mm/slab.h | 92 +++++++++++++++++++++++++++++---------
mm/slab_common.c | 48 ++++++++++++++++++++
mm/slub.c | 64 +++++++++++++-------------
10 files changed, 189 insertions(+), 119 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ff87f8e001..eb1dc181e412 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -348,8 +348,8 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;

enum page_memcg_data_flags {
- /* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ /* page->memcg_data is a pointer to an slabobj_ext vector */
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -387,7 +387,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -408,7 +408,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -505,7 +505,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;

if (memcg_data & MEMCG_DATA_KMEM) {
@@ -551,7 +551,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}

@@ -1633,6 +1633,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */

+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..9ff97f4e74c5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -169,7 +169,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;

-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif

@@ -306,7 +306,7 @@ struct folio {
};
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)
diff --git a/init/Kconfig b/init/Kconfig
index deda3d14135b..8ca5285108be 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -949,10 +949,14 @@ config CGROUP_FAVOR_DYNMODS

Say N if unsure.

+config SLAB_OBJ_EXT
+ bool
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+ select SLAB_OBJ_EXT
help
Provides control over the memory footprint of tasks in a cgroup.

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 8350f5c06f2e..964b8482275b 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -595,9 +595,9 @@ static unsigned long kfence_init_pool(void)
continue;

__folio_set_slab(slab_folio(slab));
-#ifdef CONFIG_MEMCG
- slab->memcg_data = (unsigned long)&kfence_metadata_init[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts |
+ MEMCG_DATA_OBJEXTS;
#endif
}

@@ -645,8 +645,8 @@ static unsigned long kfence_init_pool(void)

if (!i || (i % 2))
continue;
-#ifdef CONFIG_MEMCG
- slab->memcg_data = 0;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = 0;
#endif
__folio_clear_slab(slab_folio(slab));
}
@@ -1139,8 +1139,8 @@ void __kfence_free(void *addr)
{
struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

-#ifdef CONFIG_MEMCG
- KFENCE_WARN_ON(meta->objcg);
+#ifdef CONFIG_MEMCG_KMEM
+ KFENCE_WARN_ON(meta->obj_exts.objcg);
#endif
/*
* If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index f46fbb03062b..084f5f36e8e7 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -97,8 +97,8 @@ struct kfence_metadata {
struct kfence_track free_track;
/* For updating alloc_covered on frees. */
u32 alloc_stack_hash;
-#ifdef CONFIG_MEMCG
- struct obj_cgroup *objcg;
+#ifdef CONFIG_MEMCG_KMEM
+ struct slabobj_ext obj_exts;
#endif
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ed40f9d3a27..7021639d2a6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2977,13 +2977,6 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
}

#ifdef CONFIG_MEMCG_KMEM
-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
- __GFP_ACCOUNT | __GFP_NOFAIL)

/*
* mod_objcg_mlstate() may be called with irq enabled, so
@@ -3003,62 +2996,27 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
rcu_read_unlock();
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
-{
- unsigned int objects = objs_per_slab(s, slab);
- unsigned long memcg_data;
- void *vec;
-
- gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- slab_nid(slab));
- if (!vec)
- return -ENOMEM;
-
- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
- if (new_slab) {
- /*
- * If the slab is brand new and nobody can yet access its
- * memcg_data, no synchronization is required and memcg_data can
- * be simply assigned.
- */
- slab->memcg_data = memcg_data;
- } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
- /*
- * If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
- * objcg vector should be reused.
- */
- kfree(vec);
- return 0;
- }
-
- kmemleak_not_leak(vec);
- return 0;
-}
-
static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
{
/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * slab->memcg_data.
+ * slab->obj_exts.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;

slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;

off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);

return NULL;
}
@@ -3066,7 +3024,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
/*
* folio_memcg_check() is used here, because in theory we can encounter
* a folio where the slab flag has been cleared already, but
- * slab->memcg_data has not been freed yet
+ * slab->obj_exts has not been freed yet
* folio_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5634e5d890f8..262aa7d25f40 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -377,7 +377,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");

diff --git a/mm/slab.h b/mm/slab.h
index 54deeb0428c6..436a126486b5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -87,8 +87,8 @@ struct slab {
unsigned int __unused;

atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long memcg_data;
+#ifdef CONFIG_SLAB_OBJ_EXT
+ unsigned long obj_exts;
#endif
};

@@ -97,8 +97,8 @@ struct slab {
SLAB_MATCH(flags, __page_flags);
SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-SLAB_MATCH(memcg_data, memcg_data);
+#ifdef CONFIG_SLAB_OBJ_EXT
+SLAB_MATCH(memcg_data, obj_exts);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
@@ -541,42 +541,90 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(slab->memcg_data);
+ unsigned long obj_exts = READ_ONCE(slab->obj_exts);

- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS),
slab_page(slab));
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
+ VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
+#else
+ return (struct slabobj_ext *)obj_exts;
+#endif
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
-#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);
+
+static inline bool need_slab_obj_ext(void)
+{
+ /*
+ * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
+ * inside memcg_slab_post_alloc_hook. No other users for now.
+ */
+ return false;
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ if (!p)
+ return NULL;
+
+ if (!need_slab_obj_ext())
+ return NULL;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab) &&
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name))
+ return NULL;
+
+ return slab_obj_exts(slab) + obj_to_index(s, slab, p);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
return NULL;
}

-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
{
return 0;
}
-#endif /* CONFIG_MEMCG_KMEM */
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+#endif

size_t __ksize(const void *objp);

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 238293b1dbe1..6bfa1810da5e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -201,6 +201,54 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}

+#ifdef CONFIG_SLAB_OBJ_EXT
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
+ __GFP_ACCOUNT | __GFP_NOFAIL)
+
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
+{
+ unsigned int objects = objs_per_slab(s, slab);
+ unsigned long obj_exts;
+ void *vec;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+ slab_nid(slab));
+ if (!vec)
+ return -ENOMEM;
+
+ obj_exts = (unsigned long)vec;
+#ifdef CONFIG_MEMCG
+ obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ if (new_slab) {
+ /*
+ * If the slab is brand new and nobody can yet access its
+ * obj_exts, no synchronization is required and obj_exts can
+ * be simply assigned.
+ */
+ slab->obj_exts = obj_exts;
+ } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ /*
+ * If the slab is already in use, somebody can allocate and
+ * assign slabobj_exts in parallel. In this case the existing
+ * objcg vector should be reused.
+ */
+ kfree(vec);
+ return 0;
+ }
+
+ kmemleak_not_leak(vec);
+ return 0;
+}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
diff --git a/mm/slub.c b/mm/slub.c
index 2ef88bbf56a3..1eb1050814aa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -683,10 +683,10 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla

if (s->flags & __CMPXCHG_DOUBLE) {
ret = __update_freelist_fast(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
} else {
ret = __update_freelist_slow(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
}
if (likely(ret))
return true;
@@ -710,13 +710,13 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,

if (s->flags & __CMPXCHG_DOUBLE) {
ret = __update_freelist_fast(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
} else {
unsigned long flags;

local_irq_save(flags);
ret = __update_freelist_slow(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
local_irq_restore(flags);
}
if (likely(ret))
@@ -1881,13 +1881,25 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
}

-#ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_free_slab_cgroups(struct slab *slab)
+#ifdef CONFIG_SLAB_OBJ_EXT
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+ struct slabobj_ext *obj_exts;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ kfree(obj_exts);
+ slab->obj_exts = 0;
+}
+#else
+static inline void free_slab_obj_exts(struct slab *slab)
{
- kfree(slab_objcgs(slab));
- slab->memcg_data = 0;
}
+#endif

+#ifdef CONFIG_MEMCG_KMEM
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -1966,15 +1978,15 @@ static void __memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);

- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags, false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}

off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -1995,18 +2007,18 @@ void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,

static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects,
- struct obj_cgroup **objcgs)
+ struct slabobj_ext *obj_exts)
{
for (int i = 0; i < objects; i++) {
struct obj_cgroup *objcg;
unsigned int off;

off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;

- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -2018,16 +2030,16 @@ static __fastpath_inline
void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;

if (!memcg_kmem_online())
return;

- objcgs = slab_objcgs(slab);
- if (likely(!objcgs))
+ obj_exts = slab_obj_exts(slab);
+ if (likely(!obj_exts))
return;

- __memcg_slab_free_hook(s, slab, p, objects, objcgs);
+ __memcg_slab_free_hook(s, slab, p, objects, obj_exts);
}

static inline
@@ -2038,15 +2050,6 @@ void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects,
obj_cgroup_uncharge(objcg, objects * obj_full_size(s));
}
#else /* CONFIG_MEMCG_KMEM */
-static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
-{
- return NULL;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -2314,7 +2317,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -2323,8 +2326,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_online())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -3775,6 +3777,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ struct slabobj_ext *obj_exts;
bool kasan_init = init;
size_t i;
gfp_t init_flags = flags & gfp_allowed_mask;
@@ -3817,6 +3820,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, init_flags);
kmsan_slab_alloc(s, p[i], init_flags);
+ obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
}

memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:46 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/gfp_types.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 868c8fb1bbc1..e36e168d8cfd 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -52,6 +52,9 @@ enum {
#endif
#ifdef CONFIG_LOCKDEP
___GFP_NOLOCKDEP_BIT,
+#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+ ___GFP_NO_OBJ_EXT_BIT,
#endif
___GFP_LAST_BIT
};
@@ -93,6 +96,11 @@ enum {
#else
#define ___GFP_NOLOCKDEP 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT BIT(___GFP_NO_OBJ_EXT_BIT)
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif

/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -133,12 +141,15 @@ enum {
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)

/**
* DOC: Watermark modifiers
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:48 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/slab.h | 7 +++++++
mm/slub.c | 5 +++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b5f5ee8308d0..3ac2fc830f0f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -164,6 +164,13 @@
#endif
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */

+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slub.c b/mm/slub.c
index 1eb1050814aa..9fd96238ed39 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5650,7 +5650,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);

create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

@@ -5660,7 +5661,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:51 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
2 files changed, 8 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 436a126486b5..f4ff635091e4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -589,6 +589,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
if (!need_slab_obj_ext())
return NULL;

+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return NULL;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6bfa1810da5e..83fec2dd2e2d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -218,6 +218,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:54 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce objext_flags to store additional objext flags unrelated to memcg.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++-------
mm/slab.h | 4 +---
2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb1dc181e412..f3584e98b640 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -356,7 +356,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS = (1UL << 2),
};

-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG (1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+ /* the next bit after the last actual flag */
+ __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG

static inline bool folio_memcg_kmem(struct folio *folio);

@@ -390,7 +405,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -411,7 +426,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

- return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -468,11 +483,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

/*
@@ -511,11 +526,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;

- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}

- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}

static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index f4ff635091e4..77cf7474fe46 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -560,10 +560,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
- return (struct slabobj_ext *)obj_exts;
#endif
+ return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
}

int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:56 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.

Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 71 ++++++++++++++
lib/Kconfig.debug | 4 +
lib/Makefile | 1 +
lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 275 insertions(+)
create mode 100644 include/linux/codetag.h
create mode 100644 lib/codetag.c

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index 000000000000..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include <linux/types.h>
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged. At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+ unsigned int flags; /* used in later patches */
+ unsigned int lineno;
+ const char *modname;
+ const char *function;
+ const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+ struct codetag *ct;
+};
+
+struct codetag_range {
+ struct codetag *start;
+ struct codetag *stop;
+};
+
+struct codetag_module {
+ struct module *mod;
+ struct codetag_range range;
+};
+
+struct codetag_type_desc {
+ const char *section;
+ size_t tag_size;
+};
+
+struct codetag_iterator {
+ struct codetag_type *cttype;
+ struct codetag_module *cmod;
+ unsigned long mod_id;
+ struct codetag *ct;
+};
+
+#define CODE_TAG_INIT { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .lineno = __LINE__, \
+ .flags = 0, \
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 975a07f9f1cc..0be2d00c3696 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -968,6 +968,10 @@ config DEBUG_STACKOVERFLOW

If in doubt, say "N".

+config CODE_TAGGING
+ bool
+ select KALLSYMS
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e61..6b48b22fdfac 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -235,6 +235,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

+obj-$(CONFIG_CODE_TAGGING) += codetag.o
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index 000000000000..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/codetag.h>
+#include <linux/idr.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/slab.h>
+
+struct codetag_type {
+ struct list_head link;
+ unsigned int count;
+ struct idr mod_idr;
+ struct rw_semaphore mod_lock; /* protects mod_idr */
+ struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+ if (lock)
+ down_read(&cttype->mod_lock);
+ else
+ up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+ struct codetag_iterator iter = {
+ .cttype = cttype,
+ .cmod = NULL,
+ .mod_id = 0,
+ .ct = NULL,
+ };
+
+ return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+ return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+ struct codetag *res = (struct codetag *)
+ ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+ return res < iter->cmod->range.stop ? res : NULL;
+}
+
+struct codetag *codetag_next_ct(struct codetag_iterator *iter)
+{
+ struct codetag_type *cttype = iter->cttype;
+ struct codetag_module *cmod;
+ struct codetag *ct;
+
+ lockdep_assert_held(&cttype->mod_lock);
+
+ if (unlikely(idr_is_empty(&cttype->mod_idr)))
+ return NULL;
+
+ ct = NULL;
+ while (true) {
+ cmod = idr_find(&cttype->mod_idr, iter->mod_id);
+
+ /* If module was removed move to the next one */
+ if (!cmod)
+ cmod = idr_get_next_ul(&cttype->mod_idr,
+ &iter->mod_id);
+
+ /* Exit if no more modules */
+ if (!cmod)
+ break;
+
+ if (cmod != iter->cmod) {
+ iter->cmod = cmod;
+ ct = get_first_module_ct(cmod);
+ } else
+ ct = get_next_module_ct(iter);
+
+ if (ct)
+ break;
+
+ iter->mod_id++;
+ }
+
+ iter->ct = ct;
+ return ct;
+}
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ seq_buf_printf(out, "%s:%u module:%s func:%s",
+ ct->filename, ct->lineno,
+ ct->modname, ct->function);
+}
+
+static inline size_t range_size(const struct codetag_type *cttype,
+ const struct codetag_range *range)
+{
+ return ((char *)range->stop - (char *)range->start) /
+ cttype->desc.tag_size;
+}
+
+static void *get_symbol(struct module *mod, const char *prefix, const char *name)
+{
+ char buf[64];
+ int res;
+
+ res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
+ if (WARN_ON(res < 1 || res > sizeof(buf)))
+ return NULL;
+
+ return mod ?
+ (void *)find_kallsyms_symbol_value(mod, buf) :
+ (void *)kallsyms_lookup_name(buf);
+}
+
+static struct codetag_range get_section_range(struct module *mod,
+ const char *section)
+{
+ return (struct codetag_range) {
+ get_symbol(mod, "__start_", section),
+ get_symbol(mod, "__stop_", section),
+ };
+}
+
+static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
+{
+ struct codetag_range range;
+ struct codetag_module *cmod;
+ int err;
+
+ range = get_section_range(mod, cttype->desc.section);
+ if (!range.start || !range.stop) {
+ pr_warn("Failed to load code tags of type %s from the module %s\n",
+ cttype->desc.section,
+ mod ? mod->name : "(built-in)");
+ return -EINVAL;
+ }
+
+ /* Ignore empty ranges */
+ if (range.start == range.stop)
+ return 0;
+
+ BUG_ON(range.start > range.stop);
+
+ cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
+ if (unlikely(!cmod))
+ return -ENOMEM;
+
+ cmod->mod = mod;
+ cmod->range = range;
+
+ down_write(&cttype->mod_lock);
+ err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
+ if (err >= 0)
+ cttype->count += range_size(cttype, &range);
+ up_write(&cttype->mod_lock);
+
+ if (err < 0) {
+ kfree(cmod);
+ return err;
+ }
+
+ return 0;
+}
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc)
+{
+ struct codetag_type *cttype;
+ int err;
+
+ BUG_ON(desc->tag_size <= 0);
+
+ cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
+ if (unlikely(!cttype))
+ return ERR_PTR(-ENOMEM);
+
+ cttype->desc = *desc;
+ idr_init(&cttype->mod_idr);
+ init_rwsem(&cttype->mod_lock);
+
+ err = codetag_module_init(cttype, NULL);
+ if (unlikely(err)) {
+ kfree(cttype);
+ return ERR_PTR(err);
+ }
+
+ mutex_lock(&codetag_lock);
+ list_add_tail(&cttype->link, &codetag_types);
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:39:57 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Add support for code tagging from dynamically loaded modules.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/codetag.h | 12 +++++++++
kernel/module/main.c | 4 +++
lib/codetag.c | 58 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};

struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);

+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 36681911c05a..f400ba076cc7 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
#include <linux/cfi.h>
+#include <linux/codetag.h>
#include <linux/debugfs.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1242,6 +1243,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);

+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -2978,6 +2980,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);

+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);

diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type *cttype,
static void *get_symbol(struct module *mod, const char *prefix, const char *name)
{
char buf[64];
+ void *ret;
int res;

res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;

- return mod ?
+ preempt_disable();
+ ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+ preempt_enable();
+
+ return ret;
}

static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)

down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);

if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)

return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:00 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/codetag.h | 6 +++---
kernel/module/main.c | 23 +++++++++++++++--------
lib/codetag.c | 11 ++++++++---
3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
- void (*module_unload)(struct codetag_type *cttype,
+ bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
};

@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);

#ifdef CONFIG_CODE_TAGGING
void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
#else
static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif

#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index f400ba076cc7..658b631e76ad 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1211,15 +1211,19 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
return module_alloc(size);
}

-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+ bool unload_codetags)
{
+ if (!unload_codetags && mod_mem_type_is_core_data(type))
+ return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
}

-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
{
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1230,20 +1234,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod_mem->base, type,
+ unload_codetags);
}

/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
}

/* Free a module, remove from lists, etc. */
static void free_module(struct module *mod)
{
+ bool unload_codetags;
+
trace_module_free(mod);

- codetag_unload_module(mod);
+ unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -1285,7 +1292,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, unload_codetags);
}

void *__symbol_get(const char *symbol)
@@ -2298,7 +2305,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod->mem[t].base, t, true);
return ret;
}

@@ -2428,7 +2435,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);

- free_mod_mem(mod);
+ free_mod_mem(mod, true);
}

int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/seq_buf.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>

struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
}

-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
{
struct codetag_type *cttype;
+ bool unload_ok = true;

if (!mod)
- return;
+ return true;

mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag_unload_module(struct module *mod)
}
if (found) {
if (cttype->desc.module_unload)
- cttype->desc.module_unload(cttype, cmod);
+ if (!cttype->desc.module_unload(cttype, cmod))
+ unload_ok = false;

cttype->count -= range_size(cttype, &cmod->range);
idr_remove(&cttype->mod_idr, mod_id);
@@ -250,4 +253,6 @@ void codetag_unload_module(struct module *mod)
up_write(&cttype->mod_lock);
}
mutex_unlock(&codetag_lock);
+
+ return unload_ok;
}
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:02 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It registers an "alloc_tags" codetag type
with /proc/allocinfo interface to output allocation tag information when
the feature is enabled.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.
Memory allocation profiling can be enabled or disabled at runtime using
/proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
profiling by default.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
Documentation/admin-guide/sysctl/vm.rst | 16 +++
Documentation/filesystems/proc.rst | 28 +++++
include/asm-generic/codetag.lds.h | 14 +++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 133 ++++++++++++++++++++
include/linux/sched.h | 24 ++++
lib/Kconfig.debug | 25 ++++
lib/Makefile | 2 +
lib/alloc_tag.c | 158 ++++++++++++++++++++++++
scripts/module.lds.S | 7 ++
10 files changed, 410 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index c59889de122b..a214719492ea 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
The default value is 65530.


+mem_profiling
+==============
+
+Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
+
+1: Enable memory profiling.
+
+0: Disabld memory profiling.
+
+Enabling memory profiling introduces a small performance overhead for all
+memory allocations.
+
+The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
+
+
memory_failure_early_kill:
==========================

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 104c6d047d9b..40d6d18308e4 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -688,6 +688,7 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
bootconfig Kernel command line obtained from boot config,
and, if there were kernel parameters from the
@@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.


+allocinfo
+~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module and the function calling the allocation. The number of bytes
+allocated at each location is reported.
+
+Example output.
+
+::
+
+ > cat /proc/allocinfo
+
+ 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
+ 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
+ 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
+ 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
+ 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
+ 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
+ 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
+ 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
+ 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
+ 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
+ ...
+
+
meminfo
~~~~~~~

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5dd3a61d673d..c9997dc50c50 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/

+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -366,6 +368,7 @@
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..cf55a149fa84
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/preempt.h>
+#include <asm/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/static_key.h>
+
+struct alloc_tag_counters {
+ u64 bytes;
+ u64 calls;
+};
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ struct alloc_tag_counters __percpu *counters;
+} __aligned(8);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#ifdef ARCH_NEEDS_WEAK_PER_CPU
+/*
+ * When percpu variables are required to be defined as weak, static percpu
+ * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
+ */
+#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
+#endif
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
+ static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { \
+ .ct = CODE_TAG_INIT, \
+ .counters = &_alloc_tag_cntr }; \
+ struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ mem_alloc_profiling_key);
+
+static inline bool mem_alloc_profiling_enabled(void)
+{
+ return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ &mem_alloc_profiling_key);
+}
+
+static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
+{
+ struct alloc_tag_counters v = { 0, 0 };
+ struct alloc_tag_counters *counter;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ counter = per_cpu_ptr(tag->counters, cpu);
+ v.bytes += counter->bytes;
+ v.calls += counter->calls;
+ }
+
+ return v;
+}
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ struct alloc_tag *tag;
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
+#endif
+ if (!ref || !ref->ct)
+ return;
+
+ tag = ct_to_alloc_tag(ref->ct);
+
+ this_cpu_sub(tag->counters->bytes, bytes);
+ this_cpu_dec(tag->counters->calls);
+
+ ref->ct = NULL;
+}
+
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes);
+}
+
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes);
+}
+
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN_ONCE(ref && ref->ct,
+ "alloc_tag was not cleared (got tag for %s:%u)\n",\
+ ref->ct->filename, ref->ct->lineno);
+
+ WARN_ONCE(!tag, "current->alloc_tag not set");
+#endif
+ if (!ref || !tag)
+ return;
+
+ ref->ct = &tag->ct;
+ this_cpu_add(tag->counters->bytes, bytes);
+ this_cpu_inc(tag->counters->calls);
+}
+
+#else
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ size_t bytes) {}
+
+#endif
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..da68a10517c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,10 @@ struct task_struct {
unsigned int flags;
unsigned int ptrace;

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
+
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
@@ -810,6 +814,7 @@ struct task_struct {
struct task_group *sched_task_group;
#endif

+
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
@@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }

extern void sched_set_stop_task(int cpu, struct task_struct *stop);

+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
+{
+ swap(current->alloc_tag, tag);
+ return tag;
+}
+
+static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
+ current->alloc_tag = old;
+}
+#else
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
+#define alloc_tag_restore(_tag, _old)
+#endif
+
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0be2d00c3696..78d258ca508f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -972,6 +972,31 @@ config CODE_TAGGING
bool
select KALLSYMS

+config MEM_ALLOC_PROFILING
+ bool "Enable memory allocation profiling"
+ default n
+ depends on PROC_FS
+ depends on !DEBUG_FORCE_WEAK_PER_CPU
+ select CODE_TAGGING
+ help
+ Track allocation source code and record total allocation size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance and memory impact.
+
+config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ bool "Enable memory allocation profiling by default"
+ default y
+ depends on MEM_ALLOC_PROFILING
+
+config MEM_ALLOC_PROFILING_DEBUG
+ bool "Memory allocation profiler debugging"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ help
+ Adds warnings with helpful error messages for memory allocation
+ profiling.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 6b48b22fdfac..859112f09bf5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..4fc031f9cefd
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_buf.h>
+#include <linux/seq_file.h>
+
+static struct codetag_type *alloc_tag_cttype;
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ mem_alloc_profiling_key);
+
+static void *allocinfo_start(struct seq_file *m, loff_t *pos)
+{
+ struct codetag_iterator *iter;
+ struct codetag *ct;
+ loff_t node = *pos;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ m->private = iter;
+ if (!iter)
+ return NULL;
+
+ codetag_lock_module_list(alloc_tag_cttype, true);
+ *iter = codetag_get_ct_iter(alloc_tag_cttype);
+ while ((ct = codetag_next_ct(iter)) != NULL && node)
+ node--;
+
+ return ct ? iter : NULL;
+}
+
+static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)arg;
+ struct codetag *ct = codetag_next_ct(iter);
+
+ (*pos)++;
+ if (!ct)
+ return NULL;
+
+ return iter;
+}
+
+static void allocinfo_stop(struct seq_file *m, void *arg)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
+
+ if (iter) {
+ codetag_lock_module_list(alloc_tag_cttype, false);
+ kfree(iter);
+ }
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ struct alloc_tag_counters counter = alloc_tag_read(tag);
+ s64 bytes = counter.bytes;
+ char val[10], *p = val;
+
+ if (bytes < 0) {
+ *p++ = '-';
+ bytes = -bytes;
+ }
+
+ string_get_size(bytes, 1,
+ STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
+ p, val + ARRAY_SIZE(val) - p);
+
+ seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, ' ');
+ seq_buf_putc(out, '\n');
+}
+
+static int allocinfo_show(struct seq_file *m, void *arg)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)arg;
+ char *bufp;
+ size_t n = seq_get_buf(m, &bufp);
+ struct seq_buf buf;
+
+ seq_buf_init(&buf, bufp, n);
+ alloc_tag_to_text(&buf, iter->ct);
+ seq_commit(m, seq_buf_used(&buf));
+ return 0;
+}
+
+static const struct seq_operations allocinfo_seq_op = {
+ .start = allocinfo_start,
+ .next = allocinfo_next,
+ .stop = allocinfo_stop,
+ .show = allocinfo_show,
+};
+
+static void __init procfs_init(void)
+{
+ proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
+}
+
+static bool alloc_tag_module_unload(struct codetag_type *cttype,
+ struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct alloc_tag_counters counter;
+ bool module_unused = true;
+ struct alloc_tag *tag;
+ struct codetag *ct;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ if (iter.cmod != cmod)
+ continue;
+
+ tag = ct_to_alloc_tag(ct);
+ counter = alloc_tag_read(tag);
+
+ if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
+ ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
+ module_unused = false;
+ }
+
+ return module_unused;
+}
+
+static struct ctl_table memory_allocation_profiling_sysctls[] = {
+ {
+ .procname = "mem_profiling",
+ .data = &mem_alloc_profiling_key,
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ .mode = 0444,
+#else
+ .mode = 0644,
+#endif
+ .proc_handler = proc_do_static_key,
+ },
+ { }
+};
+
+static int __init alloc_tag_init(void)
+{
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ alloc_tag_cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(alloc_tag_cttype))
+ return PTR_ERR(alloc_tag_cttype);
+
+ register_sysctl_init("vm", memory_allocation_profiling_sysctls);
+ procfs_init();
+
+ return 0;
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..45c67a0994f3 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -9,6 +9,8 @@
#define DISCARD_EH_FRAME *(.eh_frame)
#endif

+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,12 +49,17 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}

.rodata : {
*(.rodata .rodata.[0-9a-zA-Z_]*)
*(.rodata..L*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}

--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:04 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/page_ext.h | 1 -
include/linux/pgalloc_tag.h | 73 +++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 17 +++++++++
mm/mm_init.c | 1 +
mm/page_alloc.c | 4 ++
mm/page_ext.c | 4 ++
7 files changed, 100 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pgalloc_tag.h

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index be98564191e6..07e0656898f9 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@

#include <linux/types.h>
#include <linux/stacktrace.h>
-#include <linux/stackdepot.h>

struct pglist_data;

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index 000000000000..a060c26eb449
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include <linux/alloc_tag.h>
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+#include <linux/page_ext.h>
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
+{
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
+{
+ return (void *)ref - page_alloc_tagging_ops.offset;
+}
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+ if (page && mem_alloc_profiling_enabled()) {
+ struct page_ext *page_ext = page_ext_get(page);
+
+ if (page_ext)
+ return codetag_ref_from_page_ext(page_ext);
+ }
+ return NULL;
+}
+
+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+ page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+ unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref) {
+ alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+}
+
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref) {
+ alloc_tag_sub(ref, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+ unsigned int order) {}
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 78d258ca508f..7bbdb0ddb011 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -978,6 +978,7 @@ config MEM_ALLOC_PROFILING
depends on PROC_FS
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
+ select PAGE_EXTENSION
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 4fc031f9cefd..2d5226d9262d 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -3,6 +3,7 @@
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/page_ext.h>
#include <linux/proc_fs.h>
#include <linux/seq_buf.h>
#include <linux/seq_file.h>
@@ -124,6 +125,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype,
return module_unused;
}

+static __init bool need_page_alloc_tagging(void)
+{
+ return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+ .size = sizeof(union codetag_ref),
+ .need = need_page_alloc_tagging,
+ .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
static struct ctl_table memory_allocation_profiling_sysctls[] = {
{
.procname = "mem_profiling",
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36..e9ea2919d02d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
#include <linux/page_ext.h>
#include <linux/pti.h>
#include <linux/pgtable.h>
+#include <linux/stackdepot.h>
#include <linux/swap.h>
#include <linux/cma.h>
#include <linux/crash_dump.h>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..edb79a55a252 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -53,6 +53,7 @@
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
#include <linux/cacheinfo.h>
+#include <linux/pgalloc_tag.h>
#include <asm/div64.h>
#include "internal.h"
#include "shuffle.h"
@@ -1100,6 +1101,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_sub(page, order);
return false;
}

@@ -1139,6 +1141,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_sub(page, order);

if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1532,6 +1535,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,

set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+ pgalloc_tag_add(page, current, order);
}

static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4548fcc66d74..3c58fe8a24df 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
#include <linux/page_idle.h>
#include <linux/page_table_check.h>
#include <linux/rcupdate.h>
+#include <linux/pgalloc_tag.h>

/*
* struct page extension
@@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ &page_alloc_tagging_ops,
+#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
&page_table_check_ops,
#endif
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:06 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
As each allocation tag generates a per-cpu variable, more space is required
to store them. Increase PERCPU_MODULE_RESERVE to provide enough area. A
better long-term solution would be to allocate this memory dynamically.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Tejun Heo <t...@kernel.org>
---
include/linux/percpu.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 8c677f185901..62b5eb45bd89 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -14,7 +14,11 @@

/* enough to cover all DEFINE_PER_CPUs in modules */
#ifdef CONFIG_MODULES
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+#define PERCPU_MODULE_RESERVE (8 << 12)
+#else
#define PERCPU_MODULE_RESERVE (8 << 10)
+#endif
#else
#define PERCPU_MODULE_RESERVE 0
#endif
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:08 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
arch/alpha/kernel/pci_iommu.c | 2 +-
arch/mips/jazz/jazzdma.c | 2 +-
arch/powerpc/kernel/dma-iommu.c | 2 +-
arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
arch/powerpc/platforms/pseries/vio.c | 2 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/parisc/ccio-dma.c | 2 +-
drivers/parisc/sba_iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
13 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index c81183935e97..7fcf3e9b7103 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -929,7 +929,7 @@ const struct dma_map_ops alpha_pci_ops = {
.dma_supported = alpha_pci_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
EXPORT_SYMBOL(alpha_pci_ops);
diff --git a/arch/mips/jazz/jazzdma.c b/arch/mips/jazz/jazzdma.c
index eabddb89d221..c97b089b9902 100644
--- a/arch/mips/jazz/jazzdma.c
+++ b/arch/mips/jazz/jazzdma.c
@@ -617,7 +617,7 @@ const struct dma_map_ops jazz_dma_ops = {
.sync_sg_for_device = jazz_dma_sync_sg_for_device,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
EXPORT_SYMBOL(jazz_dma_ops);
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 8920862ffd79..f0ae39e77e37 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -216,6 +216,6 @@ const struct dma_map_ops dma_iommu_ops = {
.get_required_mask = dma_iommu_get_required_mask,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
index d6b5f5ecd515..56dc6b29a3e7 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -695,7 +695,7 @@ static const struct dma_map_ops ps3_sb_dma_ops = {
.unmap_page = ps3_unmap_page,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};

@@ -709,7 +709,7 @@ static const struct dma_map_ops ps3_ioc0_dma_ops = {
.unmap_page = ps3_unmap_page,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};

diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2dc9cbc4bcd8..0c90fc4c3796 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -611,7 +611,7 @@ static const struct dma_map_ops vio_dma_mapping_ops = {
.get_required_mask = dma_iommu_get_required_mask,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 2ae98f754e59..c884deca839b 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..8a1f7f5d1bca 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1710,7 +1710,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c
index 9ce0d20a6c58..feef537257d0 100644
--- a/drivers/parisc/ccio-dma.c
+++ b/drivers/parisc/ccio-dma.c
@@ -1022,7 +1022,7 @@ static const struct dma_map_ops ccio_ops = {
.map_sg = ccio_map_sg,
.unmap_sg = ccio_unmap_sg,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};

diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index 784037837f65..fc3863c09f83 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -1090,7 +1090,7 @@ static const struct dma_map_ops sba_ops = {
.map_sg = sba_map_sg,
.unmap_sg = sba_unmap_sg,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};

diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 76f6f26265a3..29257d2639db 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 0e6c6c25d154..1c4ef5111651 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,7 +403,7 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.max_mapping_size = swiotlb_max_mapping_size,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4abc60f04209..9ee319851b5f 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -29,7 +29,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 58db8fd70471..5e2d51e1cdf6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}

struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:10 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/alloc_tag.h | 10 +++
include/linux/gfp.h | 126 ++++++++++++++++++++++++--------------
include/linux/pagemap.h | 9 ++-
mm/compaction.c | 7 ++-
mm/filemap.c | 6 +-
mm/mempolicy.c | 52 ++++++++--------
mm/page_alloc.c | 60 +++++++++---------
7 files changed, 160 insertions(+), 110 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index cf55a149fa84..6fa8a94d8bc1 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -130,4 +130,14 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,

#endif

+#define alloc_hooks(_do_alloc) \
+({ \
+ typeof(_do_alloc) _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..bc0fd5259b0b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@

#include <linux/mmzone.h>
#include <linux/topology.h>
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>

struct vm_area_struct;
struct mempolicy;
@@ -175,42 +177,46 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif

-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
+
+struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))

-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
+#define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))

-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define alloc_pages_bulk_array_mempolicy(...) \
+ alloc_hooks(alloc_pages_bulk_array_mempolicy_noprof(__VA_ARGS__))

/* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)

-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)

static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
+alloc_pages_bulk_array_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
+ struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
}

+#define alloc_pages_bulk_array_node(...) \
+ alloc_hooks(alloc_pages_bulk_array_node_noprof(__VA_ARGS__))
+
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
{
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -230,82 +236,104 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask);

- return __alloc_pages(gfp_mask, order, nid, NULL);
+ return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
}

+#define __alloc_pages_node(...) alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
+
static inline
-struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
+struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp);

- return __folio_alloc(gfp, order, nid, NULL);
+ return __folio_alloc_noprof(gfp, order, nid, NULL);
}

+#define __folio_alloc_node(...) alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__))
+
/*
* Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
*/
-static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order)
+static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
+ unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_node(nid, gfp_mask, order);
+ return __alloc_pages_node_noprof(nid, gfp_mask, order);
}

+#define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
+
#ifdef CONFIG_NUMA
-struct page *alloc_pages(gfp_t gfp, unsigned int order);
-struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
+struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid);
-struct folio *folio_alloc(gfp_t gfp, unsigned int order);
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
-static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
+static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
- return alloc_pages_node(numa_node_id(), gfp_mask, order);
+ return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order);
}
-static inline struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+static inline struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid)
{
- return alloc_pages(gfp, order);
+ return alloc_pages_noprof(gfp, order);
}
-static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
-#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
- folio_alloc(gfp, order)
+#define vma_alloc_folio_noprof(gfp, order, vma, addr, hugepage) \
+ folio_alloc_noprof(gfp, order)
#endif
+
+#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
+#define alloc_pages_mpol(...) alloc_hooks(alloc_pages_mpol_noprof(__VA_ARGS__))
+#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__))
+#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__))
+
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
-static inline struct page *alloc_page_vma(gfp_t gfp,
+
+static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
{
- struct folio *folio = vma_alloc_folio(gfp, 0, vma, addr, false);
+ struct folio *folio = vma_alloc_folio_noprof(gfp, 0, vma, addr, false);

return &folio->page;
}
+#define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
+
+extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
+#define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__))

-extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
-extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long get_zeroed_page_noprof(gfp_t gfp_mask);
+#define get_zeroed_page(...) alloc_hooks(get_zeroed_page_noprof(__VA_ARGS__))
+
+void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) __alloc_size(1);
+#define alloc_pages_exact(...) alloc_hooks(alloc_pages_exact_noprof(__VA_ARGS__))

-void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
void free_pages_exact(void *virt, size_t size);
-__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);

-#define __get_free_page(gfp_mask) \
- __get_free_pages((gfp_mask), 0)
+__meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+#define alloc_pages_exact_nid(...) \
+ alloc_hooks(alloc_pages_exact_nid_noprof(__VA_ARGS__))
+
+#define __get_free_page(gfp_mask) \
+ __get_free_pages((gfp_mask), 0)

-#define __get_dma_pages(gfp_mask, order) \
- __get_free_pages((gfp_mask) | GFP_DMA, (order))
+#define __get_dma_pages(gfp_mask, order) \
+ __get_free_pages((gfp_mask) | GFP_DMA, (order))

extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
@@ -357,10 +385,14 @@ extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);

#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
+extern int alloc_contig_range_noprof(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
-extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask);
+#define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__))
+
+extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
+#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
+
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..35636e67e2e1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -542,14 +542,17 @@ static inline void *detach_page_private(struct page *page)
#endif

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
#else
-static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
- return folio_alloc(gfp, order);
+ return folio_alloc_noprof(gfp, order);
}
#endif

+#define filemap_alloc_folio(...) \
+ alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
+
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
return &filemap_alloc_folio(gfp, 0)->page;
diff --git a/mm/compaction.c b/mm/compaction.c
index 4add68d40e8d..f4c0e682c979 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1781,7 +1781,7 @@ static void isolate_freepages(struct compact_control *cc)
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
-static struct folio *compaction_alloc(struct folio *src, unsigned long data)
+static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
struct folio *dst;
@@ -1800,6 +1800,11 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
return dst;
}

+static struct folio *compaction_alloc(struct folio *src, unsigned long data)
+{
+ return alloc_hooks(compaction_alloc_noprof(src, data));
+}
+
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..e51e474545ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -957,7 +957,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);

#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
@@ -972,9 +972,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)

return folio;
}
- return folio_alloc(gfp, order);
+ return folio_alloc_noprof(gfp, order);
}
-EXPORT_SYMBOL(filemap_alloc_folio);
+EXPORT_SYMBOL(filemap_alloc_folio_noprof);
#endif

/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..c329d00b975f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2070,15 +2070,15 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_pages(preferred_gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(preferred_gfp, order, nid, nodemask);
if (!page)
- page = __alloc_pages(gfp, order, nid, NULL);
+ page = __alloc_pages_noprof(gfp, order, nid, NULL);

return page;
}

/**
- * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
+ * alloc_pages_mpol_noprof - Allocate pages according to NUMA mempolicy.
* @gfp: GFP flags.
* @order: Order of the page allocation.
* @pol: Pointer to the NUMA mempolicy.
@@ -2087,7 +2087,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
nodemask_t *nodemask;
@@ -2117,7 +2117,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
- page = __alloc_pages_node(nid,
+ page = __alloc_pages_node_noprof(nid,
gfp | __GFP_THISNODE | __GFP_NORETRY, order);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
@@ -2130,7 +2130,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}

- page = __alloc_pages(gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(gfp, order, nid, nodemask);

if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
/* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
@@ -2146,7 +2146,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}

/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * vma_alloc_folio_noprof - Allocate a folio for a VMA.
* @gfp: GFP flags.
* @order: Order of the folio.
* @vma: Pointer to VMA.
@@ -2161,7 +2161,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2169,15 +2169,15 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
struct page *page;

pol = get_vma_policy(vma, addr, order, &ilx);
- page = alloc_pages_mpol(gfp | __GFP_COMP, order,
- pol, ilx, numa_node_id());
+ page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order,
+ pol, ilx, numa_node_id());
mpol_cond_put(pol);
return page_rmappable_folio(page);
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(vma_alloc_folio_noprof);

/**
- * alloc_pages - Allocate pages.
+ * alloc_pages_noprof - Allocate pages.
* @gfp: GFP flags.
* @order: Power of two of number of pages to allocate.
*
@@ -2190,7 +2190,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned int order)
+struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;

@@ -2201,16 +2201,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order)
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);

- return alloc_pages_mpol(gfp, order,
- pol, NO_INTERLEAVE_INDEX, numa_node_id());
+ return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX,
+ numa_node_id());
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(alloc_pages_noprof);

-struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
- return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
+ return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(folio_alloc_noprof);

static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2229,13 +2229,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,

for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2257,11 +2257,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);

- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);

if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2273,7 +2273,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2293,8 +2293,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,

nid = numa_node_id();
nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
- return __alloc_pages_bulk(gfp, nid, nodemask,
- nr_pages, NULL, page_array);
+ return alloc_pages_bulk_noprof(gfp, nid, nodemask,
+ nr_pages, NULL, page_array);
}

int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edb79a55a252..58c0e8b948a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4380,7 +4380,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
*
* Returns the number of pages on the list or array.
*/
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array)
@@ -4516,7 +4516,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_trylock_finish(UP_flags);

failed:
- page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
if (page) {
if (page_list)
list_add(&page->lru, page_list);
@@ -4527,13 +4527,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

goto out;
}
-EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);

/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
- nodemask_t *nodemask)
+struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4595,38 +4595,38 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,

return page;
}
-EXPORT_SYMBOL(__alloc_pages);
+EXPORT_SYMBOL(__alloc_pages_noprof);

-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
preferred_nid, nodemask);
return page_rmappable_folio(page);
}
-EXPORT_SYMBOL(__folio_alloc);
+EXPORT_SYMBOL(__folio_alloc_noprof);

/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
* you need to access high mem.
*/
-unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
+unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
struct page *page;

- page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+ page = alloc_pages_noprof(gfp_mask & ~__GFP_HIGHMEM, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}
-EXPORT_SYMBOL(__get_free_pages);
+EXPORT_SYMBOL(get_free_pages_noprof);

-unsigned long get_zeroed_page(gfp_t gfp_mask)
+unsigned long get_zeroed_page_noprof(gfp_t gfp_mask)
{
- return __get_free_page(gfp_mask | __GFP_ZERO);
+ return get_free_pages_noprof(gfp_mask | __GFP_ZERO, 0);
}
-EXPORT_SYMBOL(get_zeroed_page);
+EXPORT_SYMBOL(get_zeroed_page_noprof);

/**
* __free_pages - Free pages allocated with alloc_pages().
@@ -4818,7 +4818,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
}

/**
- * alloc_pages_exact - allocate an exact number physically-contiguous pages.
+ * alloc_pages_exact_noprof - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
* @gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP
*
@@ -4832,7 +4832,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
+void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
unsigned long addr;
@@ -4840,13 +4840,13 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);

- addr = __get_free_pages(gfp_mask, order);
+ addr = get_free_pages_noprof(gfp_mask, order);
return make_alloc_exact(addr, order, size);
}
-EXPORT_SYMBOL(alloc_pages_exact);
+EXPORT_SYMBOL(alloc_pages_exact_noprof);

/**
- * alloc_pages_exact_nid - allocate an exact number of physically-contiguous
+ * alloc_pages_exact_nid_noprof - allocate an exact number of physically-contiguous
* pages on a node.
* @nid: the preferred node ID where memory should be allocated
* @size: the number of bytes to allocate
@@ -4857,7 +4857,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
+void * __meminit alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
struct page *p;
@@ -4865,7 +4865,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);

- p = alloc_pages_node(nid, gfp_mask, order);
+ p = alloc_pages_node_noprof(nid, gfp_mask, order);
if (!p)
return NULL;
return make_alloc_exact((unsigned long)page_address(p), order, size);
@@ -6283,7 +6283,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
}

/**
- * alloc_contig_range() -- tries to allocate given range of pages
+ * alloc_contig_range_noprof() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
* @migratetype: migratetype of the underlying pageblocks (either
@@ -6303,7 +6303,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
* pages which PFN is in [start, end) are allocated for the caller and
* need to be freed with free_contig_range().
*/
-int alloc_contig_range(unsigned long start, unsigned long end,
+int alloc_contig_range_noprof(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
@@ -6427,15 +6427,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
undo_isolate_page_range(start, end, migratetype);
return ret;
}
-EXPORT_SYMBOL(alloc_contig_range);
+EXPORT_SYMBOL(alloc_contig_range_noprof);

static int __alloc_contig_pages(unsigned long start_pfn,
unsigned long nr_pages, gfp_t gfp_mask)
{
unsigned long end_pfn = start_pfn + nr_pages;

- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return alloc_contig_range_noprof(start_pfn, end_pfn, MIGRATE_MOVABLE,
+ gfp_mask);
}

static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
@@ -6470,7 +6470,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
}

/**
- * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
+ * alloc_contig_pages_noprof() -- tries to find and allocate contiguous range of pages
* @nr_pages: Number of contiguous pages to allocate
* @gfp_mask: GFP mask to limit search and used during compaction
* @nid: Target node
@@ -6490,8 +6490,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
*
* Return: pointer to contiguous pages on success, or NULL if not successful.
*/
-struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask)
+struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
{
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:13 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
When a high-order page is split into smaller ones, each newly split
page should get its codetag. The original codetag is reused for these
pages but it's recorded as 0-byte allocation because original codetag
already accounts for the original high-order allocated page.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
include/linux/pgalloc_tag.h | 30 ++++++++++++++++++++++++++++++
mm/huge_memory.c | 2 ++
mm/page_alloc.c | 2 ++
3 files changed, 34 insertions(+)

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index a060c26eb449..0174aff5e871 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -62,11 +62,41 @@ static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
}
}

+static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
+{
+ int i;
+ struct page_ext *page_ext;
+ union codetag_ref *ref;
+ struct alloc_tag *tag;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ page_ext = page_ext_get(page);
+ if (unlikely(!page_ext))
+ return;
+
+ ref = codetag_ref_from_page_ext(page_ext);
+ if (!ref->ct)
+ goto out;
+
+ tag = ct_to_alloc_tag(ref->ct);
+ page_ext = page_ext_next(page_ext);
+ for (i = 1; i < nr; i++) {
+ /* New reference with 0 bytes accounted */
+ alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0);
+ page_ext = page_ext_next(page_ext);
+ }
+out:
+ page_ext_put(page_ext);
+}
+
#else /* CONFIG_MEM_ALLOC_PROFILING */

static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
unsigned int order) {}
static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}

#endif /* CONFIG_MEM_ALLOC_PROFILING */

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..86daae671319 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -38,6 +38,7 @@
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
#include <linux/compat.h>
+#include <linux/pgalloc_tag.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2899,6 +2900,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
/* Caller disabled irqs, so they are still disabled here */

split_page_owner(head, nr);
+ pgalloc_tag_split(head, nr);

/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58c0e8b948a4..4bc5b4720fee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2621,6 +2621,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -4806,6 +4807,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
struct page *last = page + nr;

split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
while (page < --last)
set_page_refcounted(last);
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:15 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation. Early tasks allocate their stacks
using page allocator before alloc_node_page_ext() initializes page_ext
area, unless early_page_ext is enabled. Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation. This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging
with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
---
mm/page_ext.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 3c58fe8a24df..e7d8f1a5589e 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -95,7 +95,16 @@ unsigned long page_ext_size;

static unsigned long total_usage;

+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+/*
+ * To ensure correct allocation tagging for pages, page_ext should be available
+ * before the first page allocation. Otherwise early task stacks will be
+ * allocated before page_ext initialization and missing tags will be flagged.
+ */
+bool early_page_ext __meminitdata = true;
+#else
bool early_page_ext __meminitdata;
+#endif
static int __init setup_early_page_ext(char *str)
{
early_page_ext = true;
--
2.43.0.687.g38aa6559b0-goog

Suren Baghdasaryan

unread,
Feb 12, 2024, 4:40:17 PMFeb 12
to ak...@linux-foundation.org, kent.ov...@linux.dev, mho...@suse.com, vba...@suse.cz, han...@cmpxchg.org, roman.g...@linux.dev, mgo...@suse.de, da...@stgolabs.net, wi...@infradead.org, liam.h...@oracle.com, cor...@lwn.net, vo...@manifault.com, pet...@infradead.org, juri....@redhat.com, catalin...@arm.com, wi...@kernel.org, ar...@arndb.de, tg...@linutronix.de, mi...@redhat.com, dave....@linux.intel.com, x...@kernel.org, pet...@redhat.com, da...@redhat.com, ax...@kernel.dk, mcg...@kernel.org, masa...@kernel.org, nat...@kernel.org, den...@kernel.org, t...@kernel.org, muchu...@linux.dev, rp...@kernel.org, pau...@kernel.org, pasha.t...@soleen.com, yosry...@google.com, yuz...@google.com, dhow...@redhat.com, hu...@google.com, andre...@gmail.com, kees...@chromium.org, ndesau...@google.com, vvv...@google.com, gre...@linuxfoundation.org, ebig...@google.com, ytc...@gmail.com, vincent...@linaro.org, dietmar....@arm.com, ros...@goodmis.org, bse...@google.com, bri...@redhat.com, vsch...@redhat.com, c...@linux.com, pen...@kernel.org, iamjoon...@lge.com, 42.h...@gmail.com, gli...@google.com, el...@google.com, dvy...@google.com, shak...@google.com, songm...@bytedance.com, jba...@akamai.com, rien...@google.com, min...@google.com, kales...@google.com, sur...@google.com, kerne...@android.com, linu...@vger.kernel.org, linux-...@vger.kernel.org, io...@lists.linux.dev, linux...@vger.kernel.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, cgr...@vger.kernel.org
To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.

Signed-off-by: Suren Baghdasaryan <sur...@google.com>
Co-developed-by: Kent Overstreet <kent.ov...@linux.dev>
Signed-off-by: Kent Overstreet <kent.ov...@linux.dev>
---
include/linux/memcontrol.h | 5 +++++
lib/Kconfig.debug | 1 +
mm/slab.h | 4 ++++
3 files changed, 10 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f3584e98b640..2b010316016c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1653,7 +1653,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
* if MEMCG_DATA_OBJEXTS is set.
*/
struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref ref;
+#endif
} __aligned(8);

static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7bbdb0ddb011..9ecfcdb54417 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -979,6 +979,7 @@ config MEM_ALLOC_PROFILING
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
select PAGE_EXTENSION
+ select SLAB_OBJ_EXT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/mm/slab.h b/mm/slab.h
index 77cf7474fe46..224a4b2305fb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -569,6 +569,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,

static inline bool need_slab_obj_ext(void)
{
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ if (mem_alloc_profiling_enabled())